Building GPT from Scratch: 5. LayerNorm
The next component I want to replace is GPTNeoBlock. Since it’s not a single layer but a composition of several layers, I plan to copy its implementation and replace each layer individually. I can simply copy the GPTNeoBlock implementation from Github.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
# model.py
from transformers.models.gpt_neo.modeling_gpt_neo import GPTNeoAttention, GPTNeoMLP
class GPTBlock(nn.Module):
def __init__(self, config, layer_id):
super().__init__()
hidden_size = config.hidden_size
inner_dim = config.intermediate_size if config.intermediate_size is not None else 4 * hidden_size
self.ln_1 = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
self.attn = GPTNeoAttention(config, layer_id)
self.ln_2 = nn.LayerNorm(hidden_size, epsilon=config.layer_norm_epsilon)
self.mlp = GPTNeoMLP(inner_dim, config)
def forward(
self,
hidden_states,
layer_past=None,
attention_mask=None,
head_mask=None,
use_cache=False,
output_attentions=False,
):
residual = hidden_states
hidden_states = self.ln_1(hidden_states)
attn_outputs = self.attn(
hidden_states,
layer_past=layer_past,
attention_mask=attention_mask,
head_mask=head_mask,
use_cache=use_cache,
output_attentions=output_attentions,
)
attn_output = attn_outputs[0] # output_attn: a, present, (attentions)
outputs = attn_outputs[1:]
# residual connection
hidden_states = attn_output + residual
residual = hidden_states
hidden_states = self.ln_2(hidden_states)
feed_forward_hidden_states = self.mlp(hidden_states)
# residual connection
hidden_states = residual + feed_forward_hidden_states
if use_cache:
outputs = (hidden_states,) + outputs
else:
outputs = (hidden_states,) + outputs[1:]
return outputs # hidden_states, present, (attentions, cross_attentions)
After that, I can substitute GPTNeoBlock with GPTBlock.
1
2
- self.blocks = nn.ModuleList([GPTNeoBlock(config, layer_id=i) for i in range(config.num_layers)])
+ self.blocks = nn.ModuleList([GPTBlock(config, layer_id=i) for i in range(config.num_layers)])
The first component I will replace is nn.LayerNorm. Layer normalization was introduced in the Layer Normalization paper. In short, it is known to make training more stable. The concept is straightforward: LayerNorm normalizes each input (i.e., the output of the previous layer) using the following normalization terms.
where \(H\) is the number of hidden units in the layer and \(a_i\) is the input of each unit. Then, the input can be normalized as follows.
\[x_{normalized} = \frac{x - \mu}{\sigma}\]In practice, to avoid division-by-zero errors, it’s common to add a small constant to the denominator. Therefore,
\[x_{normalized} = \frac{x - \mu}{\sqrt{\sigma ^ 2 + \epsilon}}\]Now let’s implement our own LayerNorm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# norm.py
import torch
import torch.nn as nn
class LayerNorm(nn.Module):
def __init__(self, epsilon=1e-5):
super().__init__()
self.epsilon = epsilon
# This is required to send this module to GPU.
# PyTorch complains to send a module without any Parameters or Buffers to GPU.
self.register_buffer("dummy", torch.empty(0))
def forward(self, X):
mu = torch.sum(X, dim=-1, keepdim=True) / X.shape[-1]
variance = torch.sum(torch.square(X - mu), dim=-1, keepdim=True) / X.shape[-1]
return (X - mu) / torch.sqrt(variance + self.epsilon)
I can test the implementation by comparing its output with that of PyTorch’s LayerNorm.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
>>> import torch
>>> from tinyllm import norm
>>>
>>> X = torch.randn(2, 3, 4)
>>>
>>> norm.LayerNorm()(X)
tensor([[[ 0.2828, 1.4225, -1.3231, -0.3822],
[-0.5129, -1.2819, 0.4060, 1.3888],
[-0.1348, -0.1960, 1.5598, -1.2290]],
[[ 1.6377, -0.3495, -1.0720, -0.2162],
[ 1.1558, -1.1942, -0.7673, 0.8058],
[-1.5553, 0.3317, 0.0108, 1.2128]]])
>>> torch.nn.LayerNorm(4)(X)
tensor([[[ 0.2828, 1.4225, -1.3231, -0.3822],
[-0.5129, -1.2819, 0.4060, 1.3888],
[-0.1348, -0.1960, 1.5598, -1.2290]],
[[ 1.6377, -0.3495, -1.0720, -0.2162],
[ 1.1558, -1.1942, -0.7673, 0.8058],
[-1.5553, 0.3317, 0.0108, 1.2128]]],
grad_fn=<NativeLayerNormBackward0>)
As I’ve verified, the custom LayerNorm implementation produces the same results as PyTorch’s nn.LayerNorm. Now, I can replace nn.LayerNorm with norm.LayerNorm in TinyGPT.
1
2
3
4
5
- self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+ self.ln_1 = norm.LayerNorm()
self.attn = GPTNeoAttention(config, layer_id)
- self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
+ self.ln_2 = norm.LayerNorm()
That wraps up the replacement of LayerNorm. In PyTorch’s implementation, there are two parameters, \(\gamma\) and \(\beta\), which serve as scale and shift factors, respectively. These are omitted in norm.LayerNorm. I haven’t read it yet, but Understanding and Improving Layer Normalization might offer deeper insights into layer normalization and the role of these parameters.
So far, I’ve replaced a few simple modules, and it’s already been a great learning experience with PyTorch. In the next post, I’ll implement Attention, which is the core of the Transformer, and hence the GPT architecture.