GPT-3 or GPT-3.5 stands for “Generative Pre-trained Transformer 3”. It’s a transformer network.

The architecture is a standard transformer network (with a few engineering tweaks) with the unprecedented size of 2048-token-long context and 175 billion parameters (requiring 800 GB of storage).

The model was trained on massive amounts of web data (410 billion tokens with 60% weight in training) from Common Crawl