The Influence of Minigpt Parameters on Training Results – Single Sentence Text

Last time, we trained our own model for the first time and tried to train it through texts of different lengths (single sentence text and Stray Birds). At the same time, we found a parameter with good performance among the randomly configured parameters. In this article, we will explore how to find the most suitable training parameters in the future.

Firstly, let’s introduce the functions of each parameter:

These parameters will affect the learning speed, performance, and stability of the model:

  1. batch size
    • Meaning: The number of samples processed in parallel during one training session.
    • Small batch: More frequent updates → faster learning, but with high noise and possible oscillation.
    • Large batch: More stable updates → smoother convergence, but requires more video memory.
    • Exploration point: When using small datasets, the batch size should not be too large, otherwise it is easy to overfit..
  2. blocksize (context length)
    • Meaning: The sequence length that the model can see at once.
    • Small block: The model has a short “field of view” and can only learn local relationships.
    • Big block: The model can learn long-range dependencies, but the training time and video memory consumption significantly increase.
    • Discovery point: Many GPT papers gradually increase the blocksize to observe performance improvement..
  3. embedsize
    • Meaning: The dimension of character/word vectors.
    • Small embedding: Insufficient model expression ability.
    • Big embedding: stronger expressive ability, but with increased parameter count and training time.
  4. n_layers
    • Meaning: Transformer stacking layers.
    • Few layers: The model is shallow and cannot learn complex patterns.
    • Multiple layers: stronger modeling ability, but more prone to overfitting and slower training.
  5. learning_rate
    • Meaning: The step size for parameter updates.
    • Too small: slow convergence.
    • Too large: Loss oscillations or even divergence.
    • Usually used in conjunction with a learning rate scheduler, such as warmup or decay.
  6. dropout rate
    • Meaning: A regularization method to prevent overfitting.
    • 0: Completely unfiltered, with the strongest fitting ability, but with a high risk of overfitting.
    • 0.1~0.3: Common range.

2. How to judge the quality of training results

  1. Train/Validate Loss
    • The lower the loss, the better the model fits on the training (or validation) set.
    • But it is necessary to prevent overfitting (low training loss, high validation loss).
  2. Generate text quality
    • More intuitive.
    • Observable generated text:
      • Coherence (whether the sentence is natural and fluent).
      • Reasonable (whether the logic is self consistent).
      • Diversity (whether it is not mechanical repetition).
  3. Convergence speed
    • Under the same parameters, it can converge to a lower loss faster, indicating a more efficient configuration.
    • This is also the value of recording the training time for each round.
  4. Perplexity (PPL)
    • Common indicators of language models.
    • PPL = exp(loss), The lower the value, the better.
    • For example:
      • The average uncertainty of selecting the next token for the model is relatively high, with a PPL of 50.
      • PPL=10 → The model has more confidence.

  3、 Goal

To compare the effects of different parameters on training, you can observe:

  1. Loss curve(whether the decrease is faster and more stable).
  2. Training time(cost vs benefit).
  3. Generate sample text(Is it more like the original text).

This will gradually establish an intuitive mapping of “parameters → training performance”.

Based on this, we have added Sweet.Py on the basis of the previous project, which is used to call Train.py to train with different parameter combinations based on the training data and observe the training effect. Each training will generate folders of corresponding parameters, loss records and curves, corresponding models, and finally text samples of model production.

It should be noted that I created the block_2=64 combination in Sweep, and the original data may be too short, causing index overflow when generating xb/yb in load_dataset. If drop_1ast=True is not present, the length of the last batch may be less than block_2, causing the model to use non-existent tokens when forwarding, so this configuration must be added.

Due to the long duration of this training, I ended the training after completing the single sentence text. Let’s first observe the training effect on the single sentence text.

Gpt believes the best result

Best Results

  • Optimal experimental configuration:
    • batch_size=8
    • embed_dim=64
    • num_layers=2
    • hidden_dim=128
    • lr=0.0001
    • dropout=0.3
  • final_loss = 0.000022, Performed very well.

The influence of parameters on the results

1. Dropout

  • The best average performance is0.1(0.5439), followed by 0.3.
  • 0.0 (no regularity) and 0.2 are slightly worse.
    ➡️ A small amount of dropout can help improve the effect.

2. Learning rate (lr)

  • 0.0001 is best(0.5465), the effect is slightly worse as lr increases:
    • 0.0001 < 0.0005 < 0.0010
      ➡️ A smaller learning rate is more stable.

3. Batch size

  • 8 and 64 are better (around 0.5459), while the median values (16, 32) are slightly worse.
    ➡️ Small or large batches are more suitable, and there may be fluctuations in the middle.

4. Embedding Dimension

  • The larger the better: 256 is optimal (0.5457), followed by 128.
  • The effect of small embeddings (32, 64) is poor.
    ➡️ The semantic expression ability improves with the increase of dimensions.

5. Hidden layer dimension

  • 128 is the best option (0.5453), with limited improvement in large dimensions and slightly inferior performance in 64 and 256.

6. Layers (num_1ayers)

  • 2 layers are clearly optimal(0.00011), with a sharp increase in loss as the number of layers increases, and 12 layers are almost unusable..
    ➡️ If the model is too deep, it may overfit or become unstable during training.

Summary and Suggestions

  1. Optimal configuration
    batch_size=8, embed_dim=128~256, hidden_dim=128, num_layers=2, lr=0.0001, dropout=0.1~0.3
  2. Trend:
    • The primary school learning rate has stabilized and converged.
    • Moderate dropout is beneficial.
    • The larger the embedding, the better, but hidden-dim 128 is sufficient.
    • The shallow layer of the model has the best effect, while the deep layer leads to training collapse.

But in terms of the entire training data, the loss value of this group is relatively good, but the generated text is not very ideal. Perhaps the data is too simple/overfitting: the model may have remembered the training set, resulting in a training loss of almost 0. There are many groups whose training results are completely consistent with the training data, but there are too many invalid contents mixed in afterwards.

Why does this phenomenon occur

  1. The amount of data is too small
    • The training data consists of only one sentence, but the model has a very large number of parameters (hundreds of thousands or even millions).
    • The result is that the model is easy to memorize the sentence by rote, with a loss close to 0 during training.
  2. Repetition and loss of control during generation
    • During the inference phase, the model will continue to generate tokens based on probability.
    • It can reproduce training sentences, but then it will enter a “fabricated” state (because there is no other training data to refer to).
    • So the beginning is completely consistent, followed by mixed invalid content.
  3. The Essence of Language Models
    • LM learns about the distribution of the next word, not just memorizing one sentence.
    • When the context exceeds the given sentence, the model “has not learned” and can only randomly sample.

How to improve

If you want the model to only generate that sentence or reproduce the content more stably, you can consider several methods:

  1. Fine tuning ideas
    • Single sentence training ≠ fine-tuning. Your current result is’ overfitting to one sentence ‘.
    • You can add more similar sentences (such as tens to hundreds of short sentences), so that the model can learn a more robust distribution.
  2. Training Method Adjustment
    • Reduce model size: For example, using smaller embeddings, hidden-dim, num_1ayers.. Excessive modeling can lead to rote memorization and random generation.
    • Add dropout: Make the model not “memorize” so quickly..
    • Training objective modification: If you only want to reproduce, you can change it to seq2seq or directly classify tasks instead of language modeling..
  3. Control during reasoning
    • Lower the temperature (such as 0.1), and the model will output the content you have trained more accurately without any random divergence.
    • You can even setgreen decoration(top_k=1), so that it will almost only reproduce the training data.

In summary:
Your model has “perfectly remembered” that sentence (so the loss is close to 0), but because there is too little training data, it cannot learn the natural language distribution, so it “made up” in the second half of the generation process.

This time is mainly for the expansion of the original project. Next time we will use Stray Birds for training. At present, there are many problems to be solved, such as too long training time.

The corresponding version this time is 1.6-1.8.

Published on August 24, 2025.

Leave a Reply