Artificial intelligence (AI) is the simulation of human intelligence processes by machines, especially computer systems.
The Open AI team has created a language model. It holds the record for the largest neural network ever created with 175 billion parameters. In May 2020, Open AI published a groundbreaking paper titled Language Models Are Few-Shot Learners. It’s an order of magnitude larger than the largest previous language models. GPT-3 was trained with almost all available data from the Internet. It showed amazing performance in various NLP (natural language processing) tasks. It includes the translation, question-answering, and cloze tasks, even surpassing state-of-the-art models.
In another astonishing display of its power, GPT-3 was able to generate “news articles” almost indistinguishable from human-made pieces. Judges barely achieved above-chance accuracy (52%) at correctly classifying GPT-3 texts.
Before getting into the article’s meat, I want to explain what GPT-3 is and how it works. I won’t go into much detail here, as there are a lot of good resources out there already. For those of you who don’t know anything about GPT-3, this section will serve as a contextual reference. You don’t need to remember (or understand) any of this to enjoy the rest of the article. But it can give you a better perspective on all the fuss generated around this AI system.
All these concepts relate to GPT models in some sense. For now, I’ll tell you the definitions. I’ll avoid too much technical detail although some previous knowledge might be required to follow through. I’ll show later how they are linked to each other and GPT-3.
This type of neural network appeared in 2017 as a new framework. For solve various machine translation problems (these problems are characterized because input and output are sequences). The authors wanted to get rid of convolutions and recurrence (CNNs and RNNs) to rely completely on attention mechanisms. Transformers are state-of-the-art in NLP.
Jason Brownlee defines language models as
“probabilistic models that can predict the next word in the sequence given the words that precede it.”
These models can solve many NLP tasks, such as machine translation, question answering, text summarization, or image captioning.
In statistics, there are discriminative and generative models, which are often used to perform classification tasks. Discriminative models encode the conditional probability of a given pair of observable and target variables: p(y|x). Generative models encode the joint probability: p(x,y). Generative models can “generate new data similar to existing data,” which is the key idea to take away. Apart from GPT, other popular examples of generative models are GANs (generative adversarial networks) and VAEs (variational autoencoders).
This training paradigm combines unsupervised pre-training with supervised fine-tuning. The idea is to train a model with a very large dataset in an unsupervised way. Then adapt (fine-tune) the model to different tasks, by using supervised training in smaller datasets. This paradigm solves two problems: It doesn’t need many expensive labelled data and tasks without large datasets can be tackled. It’s worth mentioning that GPT-2 and GPT-3 are fully unsupervised (more about this soon).
Usually, deep learning systems are trained and tested for a specific set of classes. If a computer vision system is trained to classify cat, dog, and horse images. It could be tested only on those three classes. In contrast, in zero-shot learning set up the system is shown at test time without weight updating classes. It has not seen at training time (for instance, testing the system on elephant images). Same thing for one-shot and few-shot settings. But in these cases, at test time the system sees one or few examples of the new classes, respectively. The idea is that a powerful enough system could perform well in these situations, which OpenAI proved with GPT-2 and GPT-3.
Most deep learning systems are single-task. One popular example is AlphaZero. It can learn a few games like chess or Go, but it can only play one type of game at a time. If it knows how to play chess, it doesn’t know how to play Go. Multitask systems overcome this limitation. They’re trained to be able to solve different tasks for a given input. For instance, if I feed the word ‘cat’ to the system, I could ask it to find the Spanish translation ‘gato’, I could ask it to show me the image of a cat, or I could ask it to describe its features. Different tasks for the same input.
Zero/one/few-shot task transfer:
The idea is to combine the concepts of zero/one/few-shot learning and multitask learning. Instead of showing the system new classes at test time, we could ask it to perform new tasks (either showing it zero, one or a few examples of the new task). For instance, let’s take a system trained in a huge text corpus. In a one-shot task transfer setting, we could write: “I love you -> Te Quiero. I hate you -> ____.” We are implicitly asking the system to translate a sentence from English to Spanish (a task it hasn’t been trained on) by showing it a single example (one-shot).
All these concepts come together in the definition of a GPT model. GPT stands for Generative Pre-Trained. Models of the GPT family have in common that they are language models based in the transformer architecture, pre-trained in a generative, unsupervised manner that shows decent performance in zero/one/few-shot multitask settings. This isn’t an explanation of how all these concepts work together in practice, but a simple way to remember that they together build up what a GPT model is.
GPT-3: A revolution for artificial intelligence
GPT-3 was bigger than its brothers (100x bigger than GPT-2). This detail is important because, although the similarity is high among GPT models, the performance of GPT-3 surpassed every possible expectation. Its sheer size, a quantitative leap from GPT-2, seems to have produced qualitatively better results.
The significance of this fact lies in its effect over a long-time debate in artificial intelligence: How can we achieve artificial general intelligence? Should we design specific modules — common-sense reasoning, causality, intuitive physics, theory of mind — or we’ll get there simply by building bigger models with more parameters and more training data? It showed amazing performance, surpassing state-of-the-art models on various tasks in the few-shot setting (and in some cases even in the zero-shot setting). The superior size combined with a few examples was enough to obliterate any competitor in machine translation, question-answering, and cloze tasks (fill-in-the-blank). (It’s important to note that in other tasks GPT-3 doesn’t even get close to state-of-the-art supervised fine-tuned models).
The authors pointed out that few-shot results were considerably better than zero-shot results — this gap seemed to grow in parallel with model capacity. it can learn what task it’s expected to do just by seeing some examples of it, to then perform that task with notable proficiency. Indeed, Rohin Shah notes that “few-shot performance increases as the number of parameters increases, and the rate of increase is faster than the corresponding rate for zero-shot performance.” This is the main hypothesis and the reason behind the paper’s title
OpenAI opened the beta because they wanted to see what GPT-3 could do and what new usages could people find. They had already tested the system in NLP standard benchmarks (which aren’t as creative or entertaining as the ones I’m going to show here). As expected, in no time Twitter and other blogs were flooding with amazing results from GPT-3. Below is an extensive review of the most popular ones (I recommend checking out the examples to build up the amazement and then come back to the article).
GPT-3’s conversational skills
GPT-3 has stored huge amounts of internet data, so it knows a lot about public and historical figures. It’s more surprising, however, that it can emulate people. It can be used as a chatbot, which is impressive because chatting can’t be specified as a task in the prompt. Let’s see some examples.
ZeroCater CEO Arram Sabeti used GPT-3 to make Tim Ferriss interview Marcus Aurelius about Stoicism. Mckay Wrigley designed Bionicai, an app that aims at helping people learn from anyone; from philosophy from Aristotle to writing skills from Shakespeare. He shared on Twitter some of the results people got. Psychologist Scott Barry Kaufman was impressed when he read an excerpt of his GPT-3 doppelgänger. Jordan Moore made a Twitter thread where he talked with the GPT-3 versions of Jesus Christ, Steve Jobs, Elon Musk, Cleopatra, and Kurt Cobain. And Gwern made a very good job further exploring the possibilities of the model regarding conversations and personification.
GPT-3’s useful possibilities
Some found applications for the system that not even the creators had thought of, such as writing code from English prompts. Sharif Shameem built a “layout generator” with which he could give instructions to GPT-3 in natural language for it to write the corresponding JSX code. He also developed Debuild.co, a tool we can use to make GPT-3 write code for a React app giving only the description. Jordan Singer created a Figma plugin on top of GPT-3 specifically for him. Another interesting use was found by Shreya Shankar. who built a demo to translate equations from English to LaTeX. And Paras Chopra created a Q&A search engine that would output the answer to a question along with the corresponding URL.
GPT-3 has an artist’s soul
Moving to the creative side of GPT-3 we find Open AI researcher Amanda Askell, who used the system to create a guitar tab titled Idle Summer Days and to write a funny story of Georg Cantor in a hotel. Arram Sabeti told GPT-3 to write a poem about Elon Musk by Dr Seuss and a rap song about Harry Potter by Lil Wayne. But the most impressive creative feat of GPT-3 gotta be the game AI Dungeon. In 2019, Nick Walton built the role-based game on top of GPT-2. He has now adapted it to GPT-3 and it’s earning $16,000 a month on Patreon.
GPT-3’s reasoning abilities
The most intrepid tested GPT-3 in areas in which only humans excel. Parse CTO Kevin Lacker wondered about common-sense reasoning and logic and found that GPT-3 was able to keep up although it failed when entering “surreal territory”. However, Nick Cammarata found that specifying uncertainty in the prompt allowed GPT-3 to handle “surreal” questions while answering “Yo be real”. Gwern explains that GPT-3 may need explicit uncertainty prompts because we humans tend to not say “I don’t know” and the system is simply imitating this flaw.
GPT-3 also proved capable of having spiritual and philosophical conversations that might go beyond our cognitive boundaries. Tomer Ullman made GPT-3 conceive 10 philosophical/moral thought experiments. Messaging is a tool that outputs the meaning of life according to “famous people, things, objects, [or] feelings.”
And Bernhard Mueller, in an attempt to unveil the philosophical Holy Grail, made the ultimate test for GPT-3. He gave it a prompt to find the question to 42 and after some exchanges, GPT-3 said: “The answer is so far beyond your understanding that you cannot comprehend the question. And that, my child, is the answer to life, the Universe and everything.” Amazing and scary at the same time.
GPT-3 produced amazing results, received wild hype, generated increasing worry, and received a wave of critiques and counter-critiques. I don’t know what to expect in the future from these types of models but what’s for sure is that GPT-3 remains unmatched right now. It’s the most powerful neural network as of today and accordingly, it has received the most intense focus, in every possible sense.
Everyone was directing their eyes at GPT-3; those who acclaim it as a great, forward step towards human-like artificial intelligence and those who reduce it to barely be an overhyped strong autocomplete. There are interesting arguments on both sides. Now, it’s your turn to think about what it means for the present of AI and what it’ll mean for the future of the world.