Building GPT from Scratch: 6. Attention
Finally, attention - the core of Transformer architecture. Basically speaking, attention mechanism is to identify the importance of neighbor words for the given word, and use the information in pr...
Finally, attention - the core of Transformer architecture. Basically speaking, attention mechanism is to identify the importance of neighbor words for the given word, and use the information in pr...
The next component I want to replace is GPTNeoBlock. Since it’s not a single layer but a composition of several layers, I plan to copy its implementation and replace each layer individually. I can ...
When I look at the constructor of the GPT class, I notice that the first two modules are Embedding. So, what exactly is embedding? Neural networks, including large language models (LLMs), require ...
Training the tiny GPT model is relatively straightforward. For each data point, the training process splits it into X and Y, where Y is simply X shifted by one token. This means the model is traine...
I will use GPT-Neo to implement a lightweight version of GPT. GPT-Neo requires a configuration instance that includes hyperparameters such as context size and number of layers. Although GPT-Neo pro...
As a software engineer, I find that one of the most effective ways to learn new concepts is through hands-on practice. In a series of posts, I’ll document my journey of understanding how GPT works ...