From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

1.Project Objectives and Language Modeling Introduction

We want to create a language model, such as:

Hello, the weather is today
Let the model automatically generate the following: “That’s really good

This is called Language Modeling, and its core task is:

Given a token sequence, predict the next token.

2.Project module division and responsibilities

Building such a system requires four major modules:

Module	Question	Corresponding file
1. Data preparation	What do I use for training? How to transform it into a format that models can learn?	dataset.cy+data/tiny.txt
2. Model Structure	How do I construct a neural network to predict the next token?	model. py
3. Model Training	How do I train it? How to choose loss function and optimizer?	train. py
4. Model inference	After training, how to use it to generate text?	generate.py

3.Data Preparation: dataset.Py

Neural networks cannot directly process character text and must convert strings into tensors (numbers).

The text we write is a string, for example:

text

Copy and paste

hello world

But the model only recognizes numbers, so it needs to:

Character table: Construct stoi={# 8220; h&# 8221;: 0,&# 8220; e&# 8221;: 1,&# 8230;,&# 8221;&# 8220;: 5}
Character encoding: Convert text to [0, 1, 2,&# 8230;]
Sliding window: Cut into small pieces for training (input/label)

These logics are encapsulated into a CharDataset class, allowing the model to copy and edit data through Python.

The reading, cleaning, and batching of data are handed over to dataset.cy.

4.Model Structure: model. py

Separate the model structure into files. Models are independent components that can be reused during training and usage.

This way, it’s possible

Train a model
Then another script loads it to generate text
Or switch to a different training method in the future, but keep the model structure unchanged

So, writing the model definition in model. py allows both training and inference scripts to import it, which is beneficial for modularity, reuse, and decoupling.

5.Model training: train.py

Use specialized training scripts with clear structure, reusability, and better parameter tuning.

Responsible for:

Load data
Initialize the model
Set optimizer
Multi round training model
Save weight

Advantages:

Change the training data without changing the model
Modify hyperparameters without altering the model structure
Individual training/individual generation/individual evaluation

6.Model Storage Directory and Format

Checkpoints/stores the trained model in. pt file format

7.Text generation: generate. py

Generate.py can load the trained model for inference.

Module		Responsibilities
dataset.Py		Data preparation, responsible for “eating data”
model. py		Network structure, responsible for “prediction calculation”
train. py		Training process, responsible for “learning knowledge”
generate()		Reasoning script, responsible for “writing sentences”
checkpoints/		Save the model for easy use next time

8.Specific Implementation Process

Therefore, we created our directory structure in this way.

Let’s first add a sentence in the data/tiny.txt file

“ Hello, world! The weather is nice today, let’s go out and play. ”

This is the ‘language corpus’ we need to train.

Next, we will start writing dataset.cy to enable the model to:

Read Text
Building character to index mapping
Splitting into training pairs

Function	Description
CharDataset	A character level dataset class
stoi/itos	Dictionary of character and number conversion
__getitem__()	Cut continuous text into input target pairs: for example, input&# 8220; Hello today; The goal is to; Good today;
load_dataset()	Read text from a txt file and return dataset objects that can be used for training

Then we started implementing model-py and building the most basic GPT model structure (MiniGPT).

Mainly includes:

Token Embedding
Position Embedding
A simplified version of Transformer block
Output the probability of predicting the next token

We have completed model. py and built a minimized but fully functional GPT model.

Explain its composition in modules:

1.SelfAttention

Implement a single head attention mechanism (simplified version of Multi Head)

Calculate Query, Key, and Value using three Linear layers respectively
Using dot product attention Q · K ᵀ and dividing by scaling factor
Using tril masks to prevent models from looking at the ‘future’
Output weighted V

TransformerBlock

Similar to GPT structure: LayerNorm → SelfAttention → LayerNorm → FeedForward

Each layer has residual connections (x+&# 8230;)
Stable training of the maintenance layer
Simplified version, easy to understand

3.MiniGPT

This is the encapsulation of the entire model, which includes:

blocks

/td>

Component	Function
token_ embedding	Convert character indexes into vectors
position_ embedding	Add position information
A bunch of Transformer Blocks
ln_f	Final normalization layer
head	Output the logits to VOCAB to predict the distribution of the next character

Generate() method

This is our custom inference function that can:

Receive a seed sequence (character index)
Predict the next token each time and sample it to add to the sequence
Finally decoded into a string and returned

Next, let’s write the training script train.by, which can be used to:

Load dataset (dataset. py)
Initialize the model (model. py)
Training with CrossEntropyLoss and Adam Optimizer
Print loss and save model every few steps
Generate sample text after training is completed

9.Training Results

After running, you can see the output results. The model has been successfully trained and the loss has decreased very well, indicating that it has almost remembered this sentence. This is consistent with the overfitting feature of small data+small models: it learns to memorize by rote, but lacks generalization ability because we only fed one sentence.

The current results indicate that:

Correct training: The loss has been significantly reduced, indicating that the model can accurately predict each character.
Normal generation logic: The generate function successfully concatenated characters to generate new text.
Excellent memory ability: The model repeatedly generates “Let’s go out and play”, indicating that it remembers the structure

Now a model named mini gpt. pt has been generated under checkpoints

We can also use generate.py to load the model and output the results.

10.Visualize the training process and expand the model

Next, we will add the “training loss curve visualization” function and attempt model multilayering, both of which are extensions of the miniGPT project, gradually upgrading it to a more complete, powerful, and analyzable Transformer training system.

Training loss curve visualization (understanding whether the model has “learned”)

Can help monitor the learning process of the model and determine:

Is the model experiencing a normal decrease in loss;
Is there overfitting/underfitting;
Is the selection of parameters (such as learning rate) reasonable;
Is the model oscillating or not converging.

Just record the loss of each eval_interval in train.exe and plot it using matplotlib:

We added a drawing process in train.exe and initialized the Transformer layers of the model to 4.

You can see the corresponding chart.

11.Model extension and optimization suggestions

Next, we can try it ourselves:

Add more training texts (to improve generalization ability)

You can replace tiny.txt with multi sentence text, for example:

Hello, world! The weather is nice today, let’s go out and play.

Hello, I am an intelligent agent.

Good morning, today is Friday.

Let’s go play in the park.

This way, the model won’t just repeat one sentence.

Try generating different beginnings

You can switch to other starting tokens, such as:

Context=torch. sensor (now); ]]], dtype=torch.long).to(device)

print(model.generate(context, max_new_tokens=100, stoi=stoi, itos=itos))

Larger model structure

model = MiniGPT(vocab_size=vocab_size, block_size=block_size, embed_size=128, num_layers=4).to(device)

But the requirements for video memory and training time will increase.

12.Replace corpus with Stray Birds

Next, I tried to replace tiny.txt with the Chinese text of Stray Birds.

Loss Value Range	Meaning
>0.0~0.1	Very good, the model is basically fitted successfully
0.1~1.0	Learning is underway, and convergence is underway
>; 1.0	Poor model prediction (common in the early stages) or overfitting/underfitting
Maximum or NaN	Model collapse, possibly due to gradient explosion, high learning rate, etc.

If the predicted word probability distribution of the model is consistent with the targets, the loss will approach 0.

If the model guesses randomly, the loss will be relatively high.

The loss value of large parameters is significantly higher than that of small parameters. The training results of small parameters are actually significantly better than those of large parameters.

The corpus we used (Stray Birds) is relatively small, consistent in style and limited in vocabulary. Using overly large models (such as:

High embedding dimension (256, 512)
Multiple layers (6-layer transformer)
Many attention heads

It can make it difficult to train the model well, and even lead to severe overfitting or falling into local optima. Small models are actually easier to learn patterns on small language materials.

9.Training Results and Summary

Our model has successfully memorized the training corpus and can generate fluent text

Small models are more suitable for small corpora, avoiding overfitting and gradient instability

Visualizing the training curve helps to determine whether the model is converging

Next plan: Systematically test the impact of model size on training

All engineering files mentioned in the article!!!!!

https://github.com/spikec137/miniGPT

Reference link:

https://www.shigeku.org/shiku/ws/wg/tagore.htm

Published on August 3, 2025.

From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

Like this:

By 管理员

Leave a ReplyCancel reply

You Missed

Hello world!

Xiaomi AX3000T openwrt configuration

From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

Home Assistant Initialization and Xiaomi Plugin Installation

关于我

关注我

最新文章

是否对我的内容感兴趣，欢迎留言讨论。

From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

Share this:

Like this:

By 管理员

Related Post

The Influence of Minigpt Parameters on Training Results – Single Sentence Text

The Influence of Minigpt Parameters on Training Results – Stray Birds

Leica for a moment? Fuji for a moment! Level 0 Training

Leave a ReplyCancel reply

You Missed

Hello world!

Xiaomi AX3000T openwrt configuration

From zero training miniGPT: from a sentence to the generation experiment of Stray Birds

Home Assistant Initialization and Xiaomi Plugin Installation