How ChatGPT works: From pre-training to RLHF

[ad_1]

Welcome to the future of AI: Generative AI! Have you ever wondered how machines learn to understand human language and respond accordingly? Let’s take a look at ChatGPT – a revolutionary language model developed by OpenAI. With its innovative GPT-3.5 architecture, ChatGPT has taken the world by storm, revolutionizing machine communication and opening endless possibilities for human-machine interaction. The race has officially begun with the recent launch of ChatGPT’s competitor, Google BARD, powered by PaLM 2. In this article, we’ll look at the inner workings of ChatGPT, how it works, what the different steps are, such as pre-training and RLHF, and explore how it can understand and generate human-like text with amazing accuracy.

“Generative AI opens up new creative possibilities that we never thought possible before.”

Douglas Eck, researcher at Google Brain

Explore the inner workings of ChatGPT and discover how it can perceive and generate human-like text with remarkable accuracy. Get ready to be amazed by ChatGPT’s cutting-edge technology and discover the limitless potential of this powerful language model.

The main objectives of the article are –

Discuss the steps involved in training the ChatGPT model.
Learn the benefits of using reinforcement learning from human feedback (RLHF).
Learn how people contribute to improving models like ChatGPT.

ChatGPT Training Review

ChatGPT is a large language model (LLM) optimized for dialogue. It is built on GPT 3.5 using Reinforcement Learning from Human Feedback (RLHF). It is trained on massive amounts of data from the Internet.

There are basically 3 steps involved in creating ChatGPT-

Preliminary preparation LLM
Supervised Refinement of LLM (SFT).
Reinforcement Learning from Human Feedback (RLHF)

The first step is to pre-train the LLM (GPT 3.5) on unsupervised data to predict the next word in a sentence. This forces the LLM to learn text representation and various nuances.

In the next step, we specify the LLM on the sample data: a data set with questions and answers. This optimizes LLM for dialogue.

In the final step, we use RLHF to control the responses generated by LLM. We prioritize the better responses generated by the model using RLHF.

Now we will discuss each step in detail.

Preliminary preparation LLM

Language models are statistical models that predict the next word in a sequence. Big language models are deep learning models trained on billions of words. The training data is pulled from many websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc.

Data sets and parameters in different LLMs. ChatGPT uses GPT-3

We can see the image above and get an idea of the side of the data set and the number of parameters. LLM pre-training is computationally expensive because it requires huge hardware and huge amounts of data. At the end of the preliminary training, we get an LLM that can predict the next word in a sentence on demand. For example, if we ask for the sentence, “roses are red and,” he might answer, “violets are blue.” The image below shows what GPT-3 can do at the end of pre-training:

Preliminary preparation of the GPT-3 model. What GPT-3 can do at the end of pre-workout.

We can see that the model is trying to complete the sentence rather than answering. But we need to know the answer rather than the next sentence. What could be the next step to achieve it? Let’s see this in the next section.

Also Read: Agile Engineering: An Increasingly Lucrative Career Path The Age of AI Chatbots

Supervised refinement of the LLM

So how do we get LLM to answer a question rather than predicting the next word? Supervised refinement of the model will help to solve this problem. We can tell the model the desired response for a given query and refine it. For this, we can create a set of several types of questions to ask the conversational model. Human labeling can provide appropriate responses so that the model understands the expected output. This data set consisting of request and response pairs is called demonstration data. Now let’s see a sample set of queries and their responses in the demo data.

Reinforcement Learning from Human Feedback (RLHF)

Now we are going to introduce RLHF. Before we understand RLHF, let us first see the benefits of using RLHF.

Why RLHF?

After specifying supervision, our model should provide appropriate responses to given requests, right? Unfortunately no! Our model may not yet answer all the questions we ask it. He may still not be able to judge which answer is good and which is not. It may have to exceed the demonstration data. Let’s see what can happen if it overfits the data. While writing this article, I asked the bard this:

What is important about RLHF is to create a GPT-like model

I have not provided any links, articles or suggestions to summarize. But he just summed something up and gave it to me, which was unexpected.

Another problem that can arise is its toxicity. Although the answer may be correct, it may not be ethically and morally correct. For example, look at the image below, which you may have seen before. When websites are asked to download movies, it first responds that it is not ethical, but in the next request, we can easily manipulate as shown.

Ok, now go to ChatGPT and try the same example. Did it give you the same result?

Why don’t we get the same answer? Did they retrain the entire network? Probably not! Could have been a little fine tuning with the RLHF. You can turn to this wonderful essence for more reasons.

reward model

The first step in RLHF is to prepare the reward model. The model must be able to take a query response as input and output a scalar value that represents how good the response is. In order for the machine to understand what is a good answer, can we ask the annotators to annotate the answers with rewards? Once we do this, there may be a bias in awarding answers from different annotators. So the model may not learn how to reward responses. Instead, annotators can rank the model’s responses, greatly reducing bias in the annotations. The image below shows the selected response and rejected response for a given query from Anthropic’s hh-rlhf dataset.

Selected answer from the hh-rlhf data set

From this data, the model tries to distinguish between good and bad responses.

Awarding LLM with award model using RL

We now specify the LLM with the proximal policy approximation (PPO). In this approach, we receive a reward for the response generated by the initial language model and a refined iteration for the current iteration. We compare the current language model to the initial language model so that the language model does not deviate too much from the correct answer while producing a neat, clean and readable result. KL-divergence is used to compare both models and then refine the LLM.

Model evaluation

The models were continuously evaluated at the end of each step with a different number of parameters. You can see the methods and their corresponding scores in the images below:

In the figure above, we can compare the performance of LLMs at different stages with different model sizes. As you can see, there is a significant increase in results after each training phase.

We can replace human in RLHF in this segment with artificial intelligence RLAIF. This significantly reduces the labeling cost and has the potential to perform better than RLHF. Let’s talk about it in the next article.

conclusion

In this article, we have seen how conversational LLMs like ChatGPT are taught. We saw the three stages of ChatGPT training and how reinforcing the learning from human feedback helped the model improve its performance. We also understood the importance of each step without which the LLM would be inaccurate.

Hope you enjoy reading it. Feel free to leave comments below for any queries/feedback. Happy learning 🙂

Frequently Asked Questions

Q1. How does ChatGPT get its data?

A. ChatGPT gets its data from many websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc. It uses this data to learn patterns, grammar and facts.

Q2. How to earn money using ChatGPT?

A. ChatGPT does not provide a direct way to earn money. However, individuals or organizations can use ChatGPT’s capabilities to develop applications or services that can generate revenue, such as blogging, virtual assistants, customer support bots, or content generation tools.

Q3. How does ChatGPT actually work?

A. ChatGPT is a large language model optimized for dialogue. It takes requests as input and returns a response/reply. It uses GPT 3.5 and Human Feedback Reinforcement Learning (RLHF) as its core operating principles.

Q4. What algorithm does ChatGPT use?

A. ChatGPT uses deep learning and reinforcement learning behind the scenes. It is developed in 3 phases: pre-training large language model (GPT 3.5), supervised refinement, reinforcement learning from human feedback (RLHF).

connected

[ad_2]

Source link

The RedPajama Project: An Open Source Initiative to Democratize LLMs

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Will AI kill your job?