[ad_1]
A key feature of human intelligence is that people can learn to perform new tasks by reasoning using only a few examples. Scaling language models has opened up a range of new applications and paradigms in machine learning, including the ability to perform complex reasoning tasks through contextual learning. However, language models are still sensitive to demand, indicating that they do not reason robustly. For example, language models often require instructions in the form of hard engineering or phrasing tasks, and they exhibit unexpected behaviors such as task performance that is unaffected even when incorrect labels are displayed.
In “Symbol Sorting Improves Contextual Learning in Language Models,” we propose a simple specification procedure that we call character adjustment, which can improve contextual learning by highlighting input labels. We experiment with character tuning in Flan-PaLM models and observe the benefits in different settings.
- Character tuning enhances performance on unseen context learning tasks and is much more robust to unspecified requirements, such as those without instructions or natural language labels.
- Symbol-fit models are much more powerful in algorithmic reasoning tasks.
- Finally, character-adaptive models show a large improvement in inverted labels presented in context, meaning that they are better able to use contextual information to override prior knowledge.
A review of character regularization, where models are perfectly suited to tasks where natural language labels are replaced by arbitrary characters. Symbol regulation relies on the intuition that when instruction and relevant labels are not available, models must use contextual examples to learn a task. |
Motivation
Instruction tuning is a common refinement tuning method that has been shown to improve performance and allow models to better follow contextual examples. However, one drawback is that the models are not forced to learn using examples because the task is over-specified in the evaluation example through instructions and natural language labels. For example, in the figure above on the left, although the examples can help the model understand the task (sentiment analysis), they are not strictly necessary because the model can ignore the examples and simply read the instructions that indicate what the task is.
In character tuning, the model is well suited to examples where instructions are removed and natural language labels are replaced with semantically irrelevant labels (eg, “Foo”, “Bar”, etc.). In this setting, the task is unclear without seeing contextual examples. For example, in the figure above to the right, many contextual examples would be needed to clarify the task. Because symbol tuning teaches the model to reason about context exemplars, symbol-tuned models should perform better on tasks that require reasoning between context examples and their labels.
Data sets and task types used for character tuning. |
Character adjustment procedure
We selected 22 publicly available natural language processing (NLP) datasets that we use for the character tuning procedure. These tasks have been widely used in the past, and we chose only classification-type tasks because our method requires discrete labels. We then assign labels to a random label from a set of ~30K arbitrary labels selected from three categories: integers, character combinations, and words.
For our experiments, we symbolized Flan-PaLM, an instruction-oriented variant. We use three different sizes of Flan-PaLM models: Flan-PaLM-8B, Flan-PaLM-62B and Flan-PaLM-540B. We also tested Flan-cont-PaLM-62B (Flan-PaLM-62B on a 1.3T chip instead of 780B chips), which we call 62B-c for short.
We use a set of ~300K arbitrary characters from three categories (integers, character combinations, and words). ∼30K characters are used during tuning and the rest are reserved for evaluation. |
experimental setup
We want to evaluate the model’s ability to perform invisible tasks, so we cannot evaluate the tasks used in character tuning (22 datasets) or the instructions used in tuning (1.8K tasks). Therefore, we select 11 NLP datasets that were not used during refinement.
Contextual learning
In the symbol tuning procedure, models must learn to reason with contextual examples in order to successfully perform tasks because the demands are modified so that the tasks cannot simply be learned from relevant labels or instructions. Symbol-based models should perform better in settings where tasks are ambiguous and require reasoning between contextual examples and their labels. To explore these parameters, we define four contextual learning parameters that vary the amount of reasoning between inputs and labels required to learn a task (based on the availability of instructions/relevant labels)
Depending on the availability of instructions and the availability of appropriate natural language labels, models may require different amounts of reasoning with contextual examples. When these features are not available, models must reason with given contextual examples to successfully complete the task. |
Character tuning improves performance in all settings for 62B and larger models, with small improvements in settings with appropriate natural language labels (+0.8% to +4.2%) and significant improvements in settings without appropriate natural language labels (+5.5% to +15.5%). Strikingly, when the corresponding labels are not available, character-customized Flan-PaLM-8B outperforms FlanPaLM-62B, and character-customized Flan-PaLM-62B outperforms Flan-PaLM-540B. This difference in performance suggests that character tuning allows much smaller models to perform these tasks as well as large models (effectively saving ~10X the inference computation).
Models fitted with large enough characters learn better in context than baseline ones, especially in settings where relevant labels are not available. Performance is shown as average model accuracy (%) across eleven tasks. |
Algorithmic reasoning
We also experiment with BIG-Bench’s algorithmic reasoning tasks. There are two main groups of tasks: 1) list functions — identifying a transformation function (eg, removing the last element of a list) between input and output lists containing nonnegative integers; and 2) simple Turing concepts—reasoning with binary strings to learn a concept that maps an input to an output (eg, swapping 0s and 1s in a string).
In the list function and simple Turing concept tasks, symbol tuning results in average performance improvements of 18.2% and 15.3%, respectively. In addition, Flan-cont-PaLM-62B with character tuning outperforms Flan-PaLM-540B on list function tasks on average, which equates to a ~10x reduction in the inference computation. These improvements suggest that symbol tuning enhances the model’s ability to learn for context-unseen task types because symbol tuning did not involve algorithmic input.
Symbol-fit models achieve higher performance on list function tasks and simple Turing concept tasks. (A–E): List feature task categories. (F): Simple Turing concepts problem. |
Reversible labels
In a flipped label experiment, the labels of context and evaluative examples are flipped, meaning that prior knowledge and input label mappings do not agree (e.g., sentences containing positive sentiment are labeled as “negative sentiment”), allowing us to examine whether the models Can override prior knowledge. Previous work has shown that while pre-trained models (without instruction tuning) can, to some extent, follow inverted labels presented in context, instruction tuning impaired this ability.
We see a similar trend across all model sizes—symbol-fit models are much better able to adhere to flipped labels than instruction-fit models. We found that after symbol tuning, Flan-PaLM-8B sees an average improvement across all datasets of 26.5%, Flan-PaLM-62B sees an improvement of 33.7%, and Flan-PaLM-540B sees an improvement of 34.0%. Additionally, character-fit models achieve similar or better-than-average performance as training-only models.
Symbol-fit models follow inverted labels presented in context much better than instruction-fit models. |
conclusion
We present symbol tuning, a new method for tuning models on tasks where natural language labels are transformed into arbitrary symbols. Symbol tuning is based on the intuition that when a model cannot use instructions or relevant labels to define a presented task, it must instead learn from contextual examples. We configured four language models using character regularization procedures using 22 datasets and approximately 30,000 arbitrary characters as labels.
We show for the first time that symbol tuning improves performance on unseen context learning tasks, especially when the prompt does not contain instructions or relevant labels. We also found that symbol-tuned models performed significantly better on algorithmic reasoning tasks, despite the lack of numerical or algorithmic input in the symbol-tuning procedure. Finally, in a contextual learning setting where the inputs have flipped labels, character tuning (for some datasets) restores the ability to follow flipped labels lost during instruction tuning.
Future work
Through character tuning, we aim to increase the ability of models to explore and learn during contextual learning. We hope that our results will contribute to further work to improve the ability of language models to reason about symbols presented in context.
Acknowledgments
The authors of this post are now part of Google DeepMind. This work was conducted by Jerry Wei, Le Hou, Andrew Lampinen, Xianning Chen, Da Huang, Yi Tei, Xinyun Chen, Yifeng Lu, Danny Zhou, Tengyu Ma, and Kuok W. Le. We would like to thank our colleagues at Google Research and Google DeepMind for advice and helpful discussions.
[ad_2]
Source link