[ad_1]
As we have seen, more parameters do not equal better performance. For better performance, we need quality tokens (texts), but this is in short supply. How to get them? Can we help ourselves with artificial intelligence?
Why don’t we use Chat-GPT to generate text?
If we humans don’t produce enough text, why not automate the process? Recent research shows how this process is not optimal. Stanford Alpaca was trained using 52,000 examples from GPT-3, but only apparently achieved similar performance. In fact, the model learns the style of the target model, but not its knowledge.
Why don’t you exercise longer?
For both PaLM and Gopher and LLaMA (as well as other LLMs) it is clearly written that the models are trained for several epochs (one or how few). This is not a transformer limitation, as, for example, Vision Transformers (ViT) was trained for 300 epochs on ImageNet (1 million images), as shown in the table:
Because it costs a lot. In the LLaMA article, the authors trained for only one epoch (and two epochs for only part of the data set). Nevertheless, the authors report:
When training the 65B parameter model, our code processes about 380 tokens/s/GPU on a 2048 A100 GPU with 80 GB of RAM. This means that training on our database containing 1.4T tokens takes about 21 days. (source)
Teaching an LLM even for a few ages is very expensive. As calculated by Dmytro Nikolayev (Dimid), this means $4.0 million if you train META’s LLaMA-like model on Google Cloud Platform.
So training for other eras will result in an exponential increase in costs. Also, we don’t know if this extra training is really helpful: we haven’t tested it yet.
Recently, a team of researchers at the University of Singapore studied what happens if we prepare LLMs for several eras:
So far, we know that the performance of the model depends not only on the parameters, but also on the number of quality features used for training. On the other hand, these quality marks are not infinite and we are approaching the limit. If we can’t find enough quality tokens and this is an AI generation option, what can we do?
Can we use the same training set and train longer?
There is a Latin locution which says that repetition is beneficial (REpetita Ivant), but over time someone added “but continuous holes” (Continuation of the secant).
The same applies to neural networks: increasing the number of epochs improves the performance of the network (loss reduction); However, at some point, while the loss in the training set decreases, the loss in the validation set starts to increase. The neural network grew, began to consider patterns that are only present in the training set, and lost its ability to generalize.
Well, this has been studied extensively for small neural networks, but what about huge transformers?
The authors of this study used the T5 model (encoder-decoder model) on the C4 database. The authors prepared several versions of the model, increasing the number of parameters until the larger model outperformed the smaller model (indicating that the larger model received a sufficient number of cues, similar to Chinchilla’s law). The authors noted that there was a linear relationship between the number of tokens needed and the size of the model (confirming what DeepMind saw with chinchillas).
The C4 data set is limited (does not have infinite signs), so in order to increase the number of parameters, the authors found themselves in a situation of scarcity of signs. Thus, they decided to simulate what would happen if the LLM saw repeated data. They took a certain number of tokens, so the model could still see them in token training. This showed:
- Repeated tokens cause degraded performance.
- Larger models are more susceptible to overclocking in token crises (so, while theoretically consuming more computing resources, it causes performance degradation).
Additionally, these models are used for downstream tasks. Often LLM is trained on a large amount of unsupervised text and then fine-tuned on a small database for a downstream task. Or it may go through a process called alignment (as in the case of ChatGPT).
When LLM is trained on repeated data, even though it is perfectly fit on other datasets, performance degrades. So downstream tasks are also affected.
We have just seen that repeated cues are detrimental to exercise. But why does this happen?
The authors decided to investigate by keeping the number of repeated tokens fixed and increasing the total number of tokens in the data set. The results show that a larger database alleviates multi-epoch degradation issues.
Last year, Galactica was published (a model that was supposed to help scientists, but only lasted three days). In addition to the spectacular destruction, the article suggested that part of their results was the quality of the data. According to the authors, data quality reduces the risk of overestimation:
We can train on it for many epochs without overfitting, where upstream and downstream performance is improved by using repeated tokens. (source)
For the authors, repeated tokens actually not only do no harm to model training, but actually improve performance.
In this new study, the authors use the Wikipedia database, which is considered a higher quality database than C4, and add repeated tokens. The results show that there is a similar level of degradation, which contradicts what the Galaxy article says.
The authors also sought to investigate whether this was due to model scaling. When scaling the model, both the number of parameters and the computational cost increase. The authors decided to study these two factors individually:
- Mix of Experts (MOE) Because although it increases the number of parameters, it maintains a similar computational cost.
- ParamShareOn the other hand, it reduces the number of parameters but keeps the same computational cost.
The results show that the model with fewer parameters is less affected by repeated tokens. In contrast, the MoE model (higher number of parameters) is more prone to overfitting. The result is interesting because MoE has been successfully used in many artificial intelligence models, so the authors suggest that although MoE is a useful technique when there is enough data, it can hurt performance when there are not enough cues.
The authors also investigated whether objective training affects performance degradation. In general, there are two goals of training:
Recently, with PaLM2–2, Google introduced UL2, which is a mix between these two learning objectives. UL2 has been shown to speed up model training, although interestingly UL2 is more prone to overfitting and has greater multi-epoch degradation.
The authors then explored how to try to mitigate multi-epoch degradation. Since the regularization technique is used precisely to avoid overfitting, the authors tested whether this technique had a beneficial effect here as well.
Abandonment is one of the most effective methods of alleviating the problem. This is not surprising, as one of the most effective regularization techniques, it is easily parallelized and used by most models.
Moreover, it works best for authors to start without dropping out and only add learning later in the training.
On the other hand, the authors note that using Dropout in some models, especially large ones, can cause a slight decrease in performance. So, while it may have a beneficial effect against overfitting, it may lead to unexpected behaviors in other contexts. So much so that the models GPT-3, PaLM, LLaMA, Chinchilla and Gopher do not use it in their architecture.
As shown in the table below, the authors used what are now almost considered small models for their experiments. Thus, it is expensive to test different hyperparameters during LLM design:
For example, in our specific scenario, training the T5-XL five times would require approximately $37,000 USD to rent Google Cloud TPUs. Given even larger models like PaLM and GPT-4 trained on even larger datasets, this cost becomes unmanageable ( source )
Since in their experiments the Sparse MoE model approximates the behavior of the dense model (which is computationally more expensive), it can be used to find the best hyperparameters.
For example, the authors show that it is possible to test different learning rates for the MoE model and it exhibits the same performance as the equivalent dense model. So the authors can test different hyperparameters with the MoE model and then train a dense model with the selected parameters, thus saving cost:
Cleaning the MoE Large model on Google Cloud Platform cost about $10.6K. By contrast, it took $7.4K to train the beefy XL model just once. Therefore, the entire development process, including cleaning, amounted to a total cost of 18 thousand USD, which is only 0.48 times the direct installation of the Dense XL model (source)
[ad_2]
Source link