From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

1.Project Objectives and Language Modeling Introduction

    We want to create a language model, such as:

    • Hello, the weather is today
    • Let the model automatically generate the following: “That’s really good

    This is called Language Modeling, and its core task is:

    Given a token sequence, predict the next token.

    2.Project module division and responsibilities

    Building such a system requires four major modules:

    ?

    Module Question Corresponding file
    1. Data preparation What do I use for training? How to transform it into a format that models can learn? dataset.cy+data/tiny.txt
    2. Model Structure How do I construct a neural network to predict the next token? model. py
    3. Model Training How do I train it? How to choose loss function and optimizer? train. py
    4. Model inference After training, how to use it to generate text? generate.py

    3.Data Preparation: dataset.Py

    Neural networks cannot directly process character text and must convert strings into tensors (numbers).

    The text we write is a string, for example:

    text

    Copy and paste

    hello world

    But the model only recognizes numbers, so it needs to:

    1. Character table: Construct stoi={# 8220; h&# 8221;: 0,&# 8220; e&# 8221;: 1,&# 8230;,&# 8221;&# 8220;: 5}
    2. Character encoding: Convert text to [0, 1, 2,&# 8230;]
    3. Sliding window: Cut into small pieces for training (input/label)

    These logics are encapsulated into a CharDataset class, allowing the model to copy and edit data through Python.

    The reading, cleaning, and batching of data are handed over to dataset.cy.

    4.Model Structure: model. py

    Separate the model structure into files. Models are independent components that can be reused during training and usage.

    This way, it’s possible

    • Train a model
    • Then another script loads it to generate text
    • Or switch to a different training method in the future, but keep the model structure unchanged

    So, writing the model definition in model. py allows both training and inference scripts to import it, which is beneficial for modularity, reuse, and decoupling.

    5.Model training: train.py

    Use specialized training scripts with clear structure, reusability, and better parameter tuning.

    Responsible for:

    • Load data
    • Initialize the model
    • Set optimizer
    • Multi round training model
    • Save weight

    Advantages:

    • Change the training data without changing the model
    • Modify hyperparameters without altering the model structure
    • Individual training/individual generation/individual evaluation

    6.Model Storage Directory and Format

    Checkpoints/stores the trained model in. pt file format

    7.Text generation: generate. py

    Generate.py can load the trained model for inference.

    Module   Responsibilities
    dataset.Py   Data preparation, responsible for “eating data”
    model. py   Network structure, responsible for “prediction calculation”
    train. py   Training process, responsible for “learning knowledge”
    generate()   Reasoning script, responsible for “writing sentences”
    checkpoints/   Save the model for easy use next time

    8.Specific Implementation Process

    Therefore, we created our directory structure in this way.

    Let’s first add a sentence in the data/tiny.txt file

    “ Hello, world! The weather is nice today, let’s go out and play. ”

    This is the ‘language corpus’ we need to train.

    Next, we will start writing dataset.cy to enable the model to:

    • Read Text
    • Building character to index mapping
    • Splitting into training pairs

    Function Description
    CharDataset A character level dataset class
    stoi/itos Dictionary of character and number conversion
    __getitem__() Cut continuous text into input target pairs: for example, input&# 8220; Hello today; The goal is to; Good today;
    load_dataset() Read text from a txt file and return dataset objects that can be used for training

    Then we started implementing model-py and building the most basic GPT model structure (MiniGPT).

    Mainly includes:

    • Token Embedding
    • Position Embedding
    • A simplified version of Transformer block
    • Output the probability of predicting the next token

    We have completed model. py and built a minimized but fully functional GPT model.

    Explain its composition in modules:

    1.SelfAttention

    Implement a single head attention mechanism (simplified version of Multi Head)

    • Calculate Query, Key, and Value using three Linear layers respectively
    • Using dot product attention Q · K ᵀ and dividing by scaling factor
    • Using tril masks to prevent models from looking at the ‘future’
    • Output weighted V
    1. TransformerBlock

    Similar to GPT structure: LayerNorm → SelfAttention → LayerNorm → FeedForward

    • Each layer has residual connections (x+&# 8230;)
    • Stable training of the maintenance layer
    • Simplified version, easy to understand

    3.MiniGPT

    This is the encapsulation of the entire model, which includes:

    blocks

    /td>

    Component Function
    token_ embedding Convert character indexes into vectors
    position_ embedding Add position information
    A bunch of Transformer Blocks
    ln_f Final normalization layer
    head Output the logits to VOCAB to predict the distribution of the next character

    Generate() method

    This is our custom inference function that can:

    • Receive a seed sequence (character index)
    • Predict the next token each time and sample it to add to the sequence
    • Finally decoded into a string and returned

    Next, let’s write the training script train.by, which can be used to:

    1. Load dataset (dataset. py)
    2. Initialize the model (model. py)
    3. Training with CrossEntropyLoss and Adam Optimizer
    4. Print loss and save model every few steps
    5. Generate sample text after training is completed

    9.Training Results

    After running, you can see the output results. The model has been successfully trained and the loss has decreased very well, indicating that it has almost remembered this sentence. This is consistent with the overfitting feature of small data+small models: it learns to memorize by rote, but lacks generalization ability because we only fed one sentence.

    The current results indicate that:

    • Correct training: The loss has been significantly reduced, indicating that the model can accurately predict each character.
    • Normal generation logic: The generate function successfully concatenated characters to generate new text.
    • Excellent memory ability: The model repeatedly generates “Let’s go out and play”, indicating that it remembers the structure

    Now a model named mini gpt. pt has been generated under checkpoints

    We can also use generate.py to load the model and output the results.

    10.Visualize the training process and expand the model

    Next, we will add the “training loss curve visualization” function and attempt model multilayering, both of which are extensions of the miniGPT project, gradually upgrading it to a more complete, powerful, and analyzable Transformer training system.

    Training loss curve visualization (understanding whether the model has “learned”)

    Can help monitor the learning process of the model and determine:

    • Is the model experiencing a normal decrease in loss;
    • Is there overfitting/underfitting;
    • Is the selection of parameters (such as learning rate) reasonable;
    • Is the model oscillating or not converging.

    Just record the loss of each eval_interval in train.exe and plot it using matplotlib:

    We added a drawing process in train.exe and initialized the Transformer layers of the model to 4.

    You can see the corresponding chart.

    11.Model extension and optimization suggestions

    Next, we can try it ourselves:

    1. Add more training texts (to improve generalization ability)

    You can replace tiny.txt with multi sentence text, for example:

    Hello, world! The weather is nice today, let’s go out and play.

    Hello, I am an intelligent agent.

    Good morning, today is Friday.

    Let’s go play in the park.

    This way, the model won’t just repeat one sentence.

    1. Try generating different beginnings

    You can switch to other starting tokens, such as:

    Context=torch. sensor (now); ]]], dtype=torch.long).to(device)

    print(model.generate(context, max_new_tokens=100, stoi=stoi, itos=itos))

    1. Larger model structure

    model = MiniGPT(vocab_size=vocab_size, block_size=block_size, embed_size=128, num_layers=4).to(device)

    But the requirements for video memory and training time will increase.

    12.Replace corpus with Stray Birds

    Next, I tried to replace tiny.txt with the Chinese text of Stray Birds.

    Loss Value Range Meaning
    >0.0~0.1 Very good, the model is basically fitted successfully
    0.1~1.0 Learning is underway, and convergence is underway
    >; 1.0 Poor model prediction (common in the early stages) or overfitting/underfitting
    Maximum or NaN Model collapse, possibly due to gradient explosion, high learning rate, etc.


    If the predicted word probability distribution of the model is consistent with the targets, the loss will approach 0.

    If the model guesses randomly, the loss will be relatively high.

    The loss value of large parameters is significantly higher than that of small parameters. The training results of small parameters are actually significantly better than those of large parameters.

    The corpus we used (Stray Birds) is relatively small, consistent in style and limited in vocabulary. Using overly large models (such as:

    • High embedding dimension (256, 512)
    • Multiple layers (6-layer transformer)
    • Many attention heads

    It can make it difficult to train the model well, and even lead to severe overfitting or falling into local optima. Small models are actually easier to learn patterns on small language materials.

    9.Training Results and Summary

    Our model has successfully memorized the training corpus and can generate fluent text

    Small models are more suitable for small corpora, avoiding overfitting and gradient instability

    Visualizing the training curve helps to determine whether the model is converging

    Next plan: Systematically test the impact of model size on training

    All engineering files mentioned in the article!!!!!

    https://github.com/spikec137/miniGPT

    Reference link:

    https://www.shigeku.org/shiku/ws/wg/tagore.htm

    Published on August 3, 2025.

    Leave a Reply