[ad_1]
Large language models (LLMs) such as GPT-3 and ChatGPT have revolutionized AI by offering natural language understanding and content generation capabilities. But their development is quite expensive, which limits access and further research. The researchers estimate that training GPT-3 cost OpenAI about $5 million. Still, Microsoft recognized the potential and invested $1 billion in 2019 and $10 billion in 2023 in OpenAI’s GPT-3 and ChatGPT ventures.
LLMs are machine learning models trained on large textual data for NLP applications. They are based on the Transformer architecture and use attention mechanisms for NLP tasks such as question-and-answer, machine translation, sentiment analysis, etc.
The question arises: can the efficiency of these large models be increased while at the same time reducing the computational cost and training time?
Several approaches, such as progressive neural networks, network morphism, internal layer model parallelism, knowledge inheritance, etc., have been developed to reduce the computational cost of training neural networks. The novel LiGO (linear growth operator) approach we discuss sets a new benchmark. This halves the computational cost of training LLMs.
Before discussing these techniques, it is necessary to examine the factors contributing to the high cost of producing LLMs.
The cost of building large language models
The three main costs of developing LLMs are as follows:
1. Computing resources
Building LLMs requires massive computing resources to train large datasets. They have to process billions of parameters and learn complex patterns from massive text data.
Investment in specialized hardware such as graphics processing units (GPUs) and tensor processing units (TPUs) is required to build and train LLMs to achieve state-of-the-art performance.
For example, GPT-3 was trained on a supercomputer with 10,000 enterprise-class GPUs (H100 and A100) and 285,000 CPU cores.
2. Energy consumption
The intensive computing resources required to build LLMs result in significant power consumption. For example, GPT-3 took 14.8 days to prepare 175 billion parameters using 10,000 V100 GPUs, which equates to 3.55 million GPU hours. Such high levels of energy consumption also have significant environmental impacts.
3. Data storage and management
LLMs learn on large data sets. For example, GPT-3 was trained on a huge corpus of text data, including Common Crawl, WebText2, Books1, Books2, and Wikipedia, among other sources. A significant infrastructure investment is required to collect, curate and store these datasets.
Also, cloud storage is required for data storage, and human expertise for data preprocessing and version control. What’s more, ensuring your data strategy is compliant with regulations like GDPR also adds value.
The LiGO technique: Cut the cost of building large language models in half
LiGO (Linear Growth Operator) is a new technique developed by MIT researchers to reduce the computational cost of training LLMs by 50%. The method involves initializing the weights of larger models from smaller pre-trained models, which ensures efficient scaling of neural networks.
Yoon Kim, senior author of the paper, says:
“It is estimated that training models on the scale that ChatGPT is proposing could cost millions of dollars for just one training session. Can we improve the efficiency of these training methods so that we still get good models in less time and less money? We propose to do this using smaller language models that have been previously trained. “
This method retains the efficiency benefits of larger models with reduced computational cost and training time compared to training a large model from scratch. LiGO uses a linear data growth operator that combines depth and breadth operators for optimal performance.
The paper used different datasets to conduct text-based experiments, including the English Wikipedia corpus for training the BERT and RoBERTa models and the C4 dataset for training GPT2.
LiGO hardware experimentation included scaling BERT-Small to BERT-Base, BERT-Base to BERT-Large, RoBERTaSmall to RoBERTa-Base, GPT2-Base to GPT2-Medium, and CaiT-XS to CaiT-S.
The researchers compared their approach to several other baselines, including training from scratch, progressive training, bert2BERT, and KI.
The LiGO technique offered a 44.7% savings in FLOPs (floating point operations per second) and a 40.7% savings in wall time over training BERT-Base from scratch by reusing the BERT-Small model. The LiGO growth operator outperforms StackBERT, MSLT, bert2BERT, and KI with efficient training.
Advantages of using training optimization techniques like LiGO
LiGO is an effective neural network training method that has the following advantages:
1. Faster training
As mentioned earlier, faster training is the main advantage of the LiGO technique. It trains LLMs in half the time, increasing productivity and reducing costs.
2. Resource efficient
LiGO is resource-efficient as it reduces wall time and FLOPs, leading to a more cost-effective and environmentally friendly approach to fabricating large transformer models.
3. Generalization
The LiGO technique improved the performance of both language and vision transformers, suggesting that it is a generalizable technique that can be applied to a variety of tasks.
Creating commercial AI products is only one aspect of the overall costs associated with AI systems. Another important component of costs comes from day-to-day operations. For example, OpenAI costs around $700,000 per day to answer queries using ChatGPT. It is expected that researchers will continue to explore approaches that make LLMs cost-effective in training and more affordable in employment.
For more AI-related content, visit unite.ai.
[ad_2]
Source link