The Influence of Minigpt Parameters on Training Results – Stray Birds

In the last model training experiment, we conducted systematic sweep tests on different parameter combinations, including hidden layer size (H), dropout ratio (dp), and learning rate (lr). In this paper, the data set is replaced by the Stray Birds. According to the experimental results, the impact of different parameters on the training effect is analyzed, and the best configuration suggestions for reference are given. However, in this experiment, due to the too short timeout setting, most of the training did not undergo multiple rounds of training like the previous single sentence text training, and exited within a few rounds. Due to my hardware limitations (although 4070 should not be a bottleneck, it may be due to the small amount of data causing the graphics card to be idle) and time constraints (running for about a week before ending), I did not run all the set parameter types and did not obtain satisfactory results. If interested, it is recommended to optimize the training script for further research.

1. Experimental Design

  • Dataset: Stray Birds (about 800 rows)
  • Model Architecture: Simplified Transformer/RNN, with optional hidden layer sizes H of 64, 128, 256, 512
  • Hardware: Single RTX 4070, CPU i5-13600K, 32GB memory
  • Training parameter sweet:
    • Hidden layer H: 64/128/256/512
    • Dropout dp:0.0 / 0.1 / 0.2 / 0.3
    • Learning rate lr: 0.00001/0.0005/0.001
  • Training limitations: Some experiments did not fully converge due to a single round timeout.

2. Comparison of experimental results

Dropout dp

Learning Rate lr

Result Features

; 1. Good generalization

Moderate convergence

Learning rate is stable and slightly higher

1.020

Hidden Layer H Final Loss Remarks
64 0.0 0.001 1.207 Moderate convergence The learning rate is relatively high, and the loss is stable at 1. x
64 0.1 0.0005 0.981 Stable convergence loss
64 0.2 0.0001 1.310 underfitting The learning rate is too low and not learned enough
128 0.0 0.001 1.692 underfitting/oscillation The learning rate is too high and the training is unstable
128 0.2 0.001 0.075 Overfitting Very low loss, almost memorized training set
128 0.3 0.0005 0.990 Stable convergence Good generalization, recommended combination
256 0.0 0.001 0.192 Severe overfitting Loss close to 0, rote memorization
256 0.1 0.0005 1.105
256 0.2

0.0001 1.432 Underfitting Learning rate is too low and stops at high loss
512 0.2 0.0001 0.573 Good convergence Large model capacity and long training time
512 0.3 0.0005 Moderate convergence Stable, but time-consuming


3. Parameter Analysis

3.1 Impact of Learning Rate

  • lr=0.001: Training is prone to oscillation or overfitting, especially when the hidden layer is large.
  • lr=0.0005: The training is the most stable, with a loss usually in the range of 0.9-1.1, and the generated results have good generalization.
  • lr=0.0001: Slow training convergence, easy to stop at high loss, manifested as underfitting.

3.2 Dropout Function

  • dp=0.0: The loss is often as low as 0.1 or less, and the model almost remembers the training set, resulting in poor generalization ability.
  • dp=0.2-0.3: The loss is stable at 0.9-1.1, resulting in more natural and generalizable results.

3.3 Model Size H

  • H=64: The model is too small, with limited expressive ability, and the training effect is average.
  • H=128-256: Moderate capacity, able to converge without overfitting.
  • H=512: Strong expressive ability, but long training time. Some experiments only run one round due to timeout, and the loss has not fully converged.

3.4 Timeout Issue

  • The timeout of a single round of training resulted in some experiments not converging, especially for the combination of large models and small learning rates.
  • Solution strategy:
    • Shorten sequence length and reduce computational complexity
    • Increase learning rate and accelerate convergence
    • Limit the sweep parameter space and avoid combining large models with small LR

4. Recommended Configuration

  • Best configuration combination:
    • Hidden layer H: 128 or 256
    • Dropout dp:0.2–0.3
    • Learning rate lr: 0.0005
  • Training effect: The loss is stable at 0.9-1.1, the model does not memorize by rote, the generated text is more natural, and the generalization is good..
  • Precautions:
    • A low loss is not an ideal result and usually indicates overfitting
    • Too large a model or insufficient training time can lead to insufficient convergence of the experiment

Partial configuration training results

5. Conclusion

  1. Moderate capacity+small dropout is the best choice under limited corpus, which can balance style preservation and innovation.
  2. Too large capacity or too small dropout can easily lead to overfitting and generate mechanized text.
  3. Insufficient training rounds can limit the potential of large models, and even with large capacity, they may generate occasionally disjointed sentences.
  4. Future optimization direction:
    • Increase the number of training rounds to fully converge the large model
    • Introduce data augmentation (such as poetry restructuring or synonym rewriting)
    • Try a hybrid capacity model to achieve a balance between speed and generation quality
https://github.com/spikec137/miniGPT

The corresponding version this time is 1.9-2.3.

Published on September 14, 2025.

Leave a Reply