Building GPT from Scratch: 6. Attention

Finally, attention - the core of Transformer architecture. Basically speaking, attention mechanism is to identify the importance of neighbor words for the given word, and use the information in pr...

Sep 24, 2025 Learning

Building GPT from Scratch: 5. LayerNorm

The next component I want to replace is GPTNeoBlock. Since it’s not a single layer but a composition of several layers, I plan to copy its implementation and replace each layer individually. I can ...

Sep 19, 2025 Learning

Building GPT from Scratch: 4. Embedding

When I look at the constructor of the GPT class, I notice that the first two modules are Embedding. So, what exactly is embedding? Neural networks, including large language models (LLMs), require ...

Sep 13, 2025 Learning

Building GPT from Scratch: 3. Training

Training the tiny GPT model is relatively straightforward. For each data point, the training process splits it into X and Y, where Y is simply X shifted by one token. This means the model is traine...

Sep 12, 2025 Learning

Building GPT from Scratch: 2. GPT Model

I will use GPT-Neo to implement a lightweight version of GPT. GPT-Neo requires a configuration instance that includes hyperparameters such as context size and number of layers. Although GPT-Neo pro...

Sep 11, 2025 Learning

Building GPT from Scratch: 1. Data Preparation

As a software engineer, I find that one of the most effective ways to learn new concepts is through hands-on practice. In a series of posts, I’ll document my journey of understanding how GPT works ...

Sep 10, 2025 Learning