aper：《anguageodelsareew|德语翻译相关数据集_德语

Paper：GPT-3《 Language Models are Few-Shot Learners》的翻译与解读

OpenAI

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei

Recent years have featured a trend towards pre-trained language representations in NLP systems, applied in increasingly flexible and task-agnostic ways for downstream transfer. First, single-layer representations were learned using word vectors [MCCD13, PSM14] and fed to task-specific architectures, then RNNs with multiple layers of representations and contextual state were used to form stronger representations [DL15, MBXS17, PNZtY18] (though still applied to task-specific architectures), and more recently pre-trained recurrent or transformer language models [VSP+17] have been directly fine-tuned, entirely removing the need for task-specific architectures [RNSS18, DCLT18, HR18].

This last paradigm has led to substantial progress on many challenging NLP tasks such as reading comprehension, question answering, textual entailment, and many others, and has continued to advance based on new architectures and algorithms [RSR+19, LOG+19, YDY+19, LCG+19]. However, a major limitation to this approach is that while the architecture is task-agnostic, there is still a need for task-specific datasets and task-specific fine-tuning: to achieve strong performance on a desired task typically requires fine-tuning on a dataset of thousands to hundreds of thousands of examples specific to that task. Removing this limitation would be desirable, for several reasons. First, from a practical perspective, the need for a large dataset of labeled examples for every new task limits the applicability of language models. There exists a very wide range of possible useful language tasks, encompassing anything from correcting grammar, to generating examples of an abstract concept, to critiquing a short story. For many of these tasks it is difficult to collect a large supervised training dataset, especially when the process must be repeated for every new task.

对解决这些问题的一个潜在的路线是meta-learning1——在语言的上下文模型意味着模型发展广泛技能的训练时间和模式识别能力,然后使用这些能力在推理时迅速适应或识别所需的任务(见图1.1)。最近的工作[RWC + 19]试图做到这一点通过我们称之为“语境学习”,使用文本输入pretrained语言模型作为一种任务规范:模型条件在自然语言指令和/或一些示威活动的任务,然后将完成进一步的实例任务只需预测接下来会发生什么。虽然它显示出了一些最初的希望，但这种方法取得的效果仍远不及微调——例如[RWC+19]在自然问题上仅取得4%的成绩，甚至它的55 F1 CoQa结果现在也落后于最先进的水平35分以上。元学习显然需要大量的改进，才能成为解决语言任务的可行的实用方法。语言建模的另一个最新趋势可能提供了一个前进的方向。近年来，transformer语言模型的容量大幅增加，从1亿个参数[RNSS18]，到3亿个参数[DCLT18]，再到15亿个参数[RWC+19]，再到80亿个参数[SPP+19]， 110亿个参数[RSR+19]，最后是170亿个参数[Tur20]。每一次增加都带来了文本合成和/或下游NLP任务的改进，有证据表明，与许多下游任务相关的日志丢失随着规模的增大呈现平稳的改善趋势[KMH+20]。由于内环境学习涉及在模型的参数内吸收许多技能和任务，因此内环境学习能力可能会随着规模的增长而显示出类似的强劲增长，这是合理的。

Figure 2.1 shows the four methods using the example of translating English to French. In this paper we focus on zero-shot, one-shot and few-shot, with the aim of comparing them not as competing alternatives, but as different problem settings which offer a varying trade-off between performance on specific benchmarks and sample efficiency. We especially highlight the few-shot results as many of them are only slightly behind state-of-the-art fine-tuned models. Ultimately, however, one-shot, or even sometimes zero-shot, seem like the fairest comparisons to human performance, and are important targets for future work.

Sections 2.1-2.3 below give details on our models, training data, and training process respectively. Section 2.4 discusses the details of how we do few-shot, one-shot, and zero-shot evaluations.

我们使用与GPT-2 [RWC+19]相同的模型和架构，包括修改的初始化、预归一化和其中描述的可逆标记，但我们在变压器的层中使用交替密集和局部带状稀疏注意模式，类似于稀疏变压器[CGRS19]。为了研究ML性能对模型大小的依赖关系，我们训练了8种不同大小的模型，从1.25亿个参数到1750亿个参数的3个数量级，最后一个是我们称为GPT-3的模型。先前的研究[KMH+20]表明，在有足够的训练数据的情况下，验证损失的比例应近似于一个平滑的幂律，该幂律是大小的函数;许多不同大小的训练模型允许我们测试验证丢失和下游语言任务的假设。

Datasets for language models have rapidly expanded, culminating in the Common Crawl dataset2 [RSR+19] constituting nearly a trillion words. This size of dataset is sufficient to train our largest models without ever updating on the same sequence twice. However, we have found that unfiltered or lightly filtered versions of Common Crawl tend to have lower quality than more curated datasets. Therefore, we took 3 steps to improve the average quality of our datasets: (1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora, (2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and (3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity.

Details of the first two points (processing of Common Crawl) are described in Appendix A. For the third, we added several curated high-quality datasets, including an expanded version of the WebText dataset [RWC+19], collected by scraping links over a longer period of time, and first described in [KMH+20], two internet-based books corpora (Books1 and Books2) and English-language Wikipedia.

用于语言模型的数据集已经迅速扩展，最终达到了常见的爬行数据集dataset2 [RSR+19]，总计近一万亿字。这样大的数据集足以训练我们最大的模型，而无需对同一序列进行两次更新。然而，我们发现未过滤或轻度过滤版本的普通爬行往往比更有组织的数据集质量更低。因此，我们采取了3个步骤来提高数据集的平均质量:(1)我们下载和过滤的一个版本CommonCrawl基于相似性的一系列高品质参考全集,(2)我们在文档级别执行模糊重复数据删除,在和整个数据集,以防止冗余和保存我们伸出的完整性验证设置为一个精确的衡量过度拟合,和(3)我们还添加了高质量的参考语料训练增加CommonCrawl和增加其多样性。

前两个点的详细信息(处理常见的爬行)描述在附录a。第三,我们添加了几个策划高质量的数据集,包括WebText数据集的扩展版本(RWC + 19),收集的抓取链接在更长一段时间,和第一(公里/小时+ 20)中描述的两个网络书全集(Books1和Books2)和英文维基百科。

对于少弹学习，我们从任务的训练集中随机抽取K个样本作为条件，根据任务的不同用1或2个新行分隔，以此来评估评估集中的每个样本。对于LAMBADA和Storycloze，没有可用的监督训练集，所以我们从开发集中提取条件设置示例，并在测试集上进行评估。对于Winograd(原始的，不是超级胶水版本)，只有一个数据集，所以我们直接从它提取条件设置示例。

K can be any value from 0 to the maximum amount allowed by the model’s context window, which is nctx = 2048 for all models and typically fits 10 to 100 examples. Larger values of K are usually but not always better, so when a separate development and test set are available, we experiment with a few values of K on the development set and then run the best value on the test set. For some tasks (see Appendix G) we also use a natural language prompt in addition to (or for K = 0, instead of) demonstrations.

On tasks that involve choosing one correct completion from several options (multiple choice), we provide K examples of context plus correct completion, followed by one example of context only, and compare the LM likelihood of each completion. For most tasks we compare the per-token likelihood (to normalize for length), however on a small number of datasets (ARC, OpenBookQA, and RACE) we gain additional benefit as measured on the development set by normalizing by the unconditional probability of each completion, by computing P (completion|context) P (completion|answer context) , where answer context is the string "Answer: " or "A: " and is used to prompt that the completion should be an answer but is otherwise generic.

K可以是0到模型上下文窗口允许的最大数量之间的任何值，即nctx = 2048，适用于所有模型，通常适合10到100个示例。更大的K值通常但不总是更好的,所以当一组独立的开发和测试是可用的,我们尝试几值K的开发设置,然后运行测试集上的最佳值。对于某些任务(参见附录G)我们也使用自然语言提示除了(或K = 0,而不是)示威活动。

对于涉及从多个选项(多项选择)中选择一个正确完成的任务，我们提供了K个上下文示例加上正确完成，然后只提供一个上下文示例，并比较每个完成的LM可能性。对于大多数任务我们比较每个令牌的可能性(规范化长度),然而在少量的数据集(弧、OpenBookQA和比赛)我们获得更多利益衡量发展设定的正常化的无条件概率每完成,通过计算P(完成|上下文)(完成|回答上下文),在回答上下文字符串“回答:”或“:”和用于提示完成应该答案但否则通用。

On tasks that involve binary classification, we give the options more semantically meaningful names (e.g. “True” or “False” rather than 0 or 1) and then treat the task like multiple choice; we also sometimes frame the task similar to what is done by [RSR+19] (see Appendix G) for details.

On tasks with free-form completion, we use beam search with the same parameters as [RSR+19]: a beam width of 4 and a length penalty of α = 0.6. We score the model using F1 similarity score, BLEU, or exact match, depending on what is standard for the dataset at hand.

Final results are reported on the test set when publicly available, for each model size and learning setting (zero-, one-, and few-shot). When the test set is private, our model is often too large to fit on the test server, so we report results on the development set. We do submit to the test server on a small number of datasets (SuperGLUE, TriviaQA, PiQa) where we were able to make submission work, and we submit only the 200B few-shot results, and report development set results for everything else.

对于涉及二分类的任务，我们给选项以语义上更有意义的名称(例如“真”或“假”，而不是0或1)，然后把任务当作多项选择;我们有时也会类似于[RSR+19]所完成的任务(详见附录G)。

对于自由形式完成的任务，我们使用与[RSR+19]相同的参数进行波束搜索:波束宽度为4，长度罚值为radial = 0.6。我们使用F1相似度评分、BLEU或精确匹配来给模型评分，这取决于手头数据集的标准。

对于每个模型的大小和学习设置(0 -，1 -，和小样本)，最终的结果会在测试集上公布。当测试集是私有的,我们的模型往往是太大,以适应在测试服务器上,所以我们报告的结果发展。我们提交到测试服务器上少量的数据集(超强力胶水,TriviaQA PiQa)我们能够提交工作,我们只有200 b few-shot提交结果,并报告发展为一切设置结果。

In Figure 3.1 we display training curves for the 8 models described in Section 2. For this graph we also include 6 additional extra-small models with as few as 100,000 parameters. As observed in [KMH+20], language modeling performance follows a power-law when making efficient use of training compute. After extending this trend by two more orders of magnitude, we observe only a slight (if any) departure from the power-law. One might worry that these improvements in cross-entropy loss come only from modeling spurious details of our training corpus. However, we will see in the following sections that improvements in cross-entropy loss lead to consistent performance gains across a broad spectrum of natural language tasks.

Below, we evaluate the 8 models described in Section 2 (the 175 billion parameter parameter GPT-3 and 7 smaller models) on a wide range of datasets. We group the datasets into 9 categories representing roughly similar tasks.

在图3.1中，我们展示了第2节中描述的8个模型的训练曲线。在这个图中，我们还包括了6个额外的超小型模型，这些模型只有100,000个参数。正如在[KMH+20]中观察到的，在高效使用训练计算时，语言建模性能遵循幂律。在将这一趋势扩展两个数量级之后，我们只观察到与幂律有轻微的背离。人们可能会担心这些交叉熵损失的改进仅仅来自于我们训练语料库的虚假细节建模。然而，在接下来的章节中，我们将看到交叉熵损失的改进可以在广泛的自然语言任务中带来一致的性能提升。

下面，我们在广泛的数据集上评估第2节中描述的8个模型(1750亿参数GPT-3和7个较小的模型)。我们将数据集分成9个类别，这些类别代表大致相似的任务。

The LAMBADA dataset [PKL+16] tests the modeling of long-range dependencies in text – the model is asked to predict the last word of sentences which require reading a paragraph of context. It has recently been suggested that the continued scaling of language models is yielding diminishing returns on this difficult benchmark. [BHT+20] reflect on the small 1.5% improvement achieved by a doubling of model size between two recent state of the art results ([SPP+19] and [Tur20]) and argue that “continuing to expand hardware and data sizes by orders of magnitude is not the path forward”. We find that path is still promising and in a zero-shot setting GPT-3 achieves 76% on LAMBADA, a gain of 8% over the previous state of the art.

LAMBADA is also a demonstration of the flexibility of few-shot learning as it provides a way to address a problem that classically occurs with this dataset. Although the completion in LAMBADA is always the last word in a sentence, a standard language model has no way of knowing this detail. It thus assigns probability not only to the correct ending but also to other valid continuations of the paragraph. This problem has been partially addressed in the past with stop-word filters [RWC+19] (which ban “continuation” words). The few-shot setting instead allows us to “frame” the task as a cloze-test and allows the language model to infer from examples that a completion of exactly one word is desired. We use the following fill-in-the-blank format:

LAMBADA数据集[PKL+16]测试文本中远程依赖的建模——模型被要求预测需要阅读一段上下文的句子的最后一个单词。最近有研究表明，语言模型的不断扩大在这个困难的基准上产生的收益正在减少。[BHT+20]反思了在两个最新的研究结果([SPP+19]和[Tur20])之间，模型尺寸增加了一倍，仅提高了1.5%，并认为“继续以数量级扩展硬件和数据尺寸并不是前进的道路”。我们发现这条道路仍然很有希望，在零杆的情况下，LAMBADA的GPT-3实现了76%，比之前的技术水平提高了8%。 LAMBADA还演示了小样本学习的灵活性，因为它提供了一种方法来解决这个数据集通常出现的问题。尽管LAMBADA中的完成总是一个句子的最后一个单词，但是标准语言模型无法知道这个细节。因此，它不仅将概率分配给正确的结尾，也分配给其他有效的段落延续。这个问题已经部分解决了在过去的停止字过滤器[RWC+19](禁止“延续”字)。相反，few-shot设置允许我们将任务“设置”为一个cloze测试，并允许语言模型从示例中推断出需要完成的恰好是一个单词。我们使用以下填空格式:

When presented with examples formatted this way, GPT-3 achieves 86.4% accuracy in the few-shot setting, an increase of over 18% from the previous state-of-the-art. We observe that few-shot performance improves strongly with model size. While this setting decreases the performance of the smallest model by almost 20%, for GPT-3 it improves accuracy by 10%. Finally, the fill-in-blank method is not effective one-shot, where it always performs worse than the zero-shot setting. Perhaps this is because all models still require several examples to recognize the pattern.

One note of caution is that an analysis of test set contamination identified that a significant minority of the LAMBADA dataset appears to be present in our training data – however analysis performed in Section 4 suggests negligible impact on performance.

在本节中，我们将测量GPT-3回答有关广泛事实知识的问题的能力。由于可能的查询量巨大，这个任务通常是通过使用信息检索系统查找相关文本，并结合学习生成给定问题和检索文本的答案的模型来完成的。由于该设置允许系统搜索并对可能包含答案的文本进行条件设置，因此称为“open-book”。[RRS20]最近证明，一个大型语言模型可以在不依赖辅助信息的情况下直接回答问题，表现得令人惊讶地好。他们将这种更严格的评估设置称为“闭卷”。他们的工作表明，更高容量的模型可以表现得更好，我们用GPT-3测试了这一假设。我们在[RRS20]中的3个数据集上评估GPT-3: Natural Questions [KPR+19]、WebQuestions [BCFL13]和TriviaQA [JCWZ17]，使用相同的分割。注意，除了所有的结果都在闭卷设置中之外，我们使用的少样本、一次小样本和零小样本的评估代表了比以前的闭卷QA工作更严格的设置:除了不允许外部内容外，也不允许对Q&A数据集本身进行微调。

GPT-3结果如表3.3所示。在TriviaQA上，我们在小样本设置中达到了64.3%，在一小样本设置中达到了68.0%，在小样本设置中达到了71.2%。zero-shot result的表现已经比经过微调的T5-11B高出14.2%，而且在培训前的问答时间跨度预测也比T5-11B高出3.8%。一次测试的结果提高了3.7%，与开放域QA系统的SOTA相匹配，该系统不仅进行了优化，而且利用了一种学习过的检索机制，对包含21M文档的15.3个参数密集向量索引进行检索[LPP+20]。此外，GPT-3的少拍效果进一步提高了性能3.2%。在网络问题(WebQs)中，GPT-3在零杆设置中达到14.4%，在单杆设置中达到25.3%，在少杆设置中达到41.5%。相比之下，使用q&a特定的培训前程序的优化T5-11B和优化T5-11B+SSM的比例分别为37.4%和44.7%。GPT-3在小样本设置接近最先进的表现，微调模型。值得注意的是，与TriviaQA相比，WebQS从零杆到少杆的增益要大得多(事实上，WebQS的零杆和单杆性能都很差)，这可能表明WebQS的问题和/或它们的回答风格在GPT-3中是不分布的。然而，GPT-3似乎能够适应这种分布，在少炮点的环境中恢复了良好的性能。

On Natural Questions (NQs) GPT-3 achieves 14.6% in the zero-shot setting, 23.0% in the one-shot setting, and 29.9% in the few-shot setting, compared to 36.6% for fine-tuned T5 11B+SSM. Similar to WebQS, the large gain from zero-shot to few-shot may suggest a distribution shift, and may also explain the less competitive performance compared to TriviaQA and WebQS. In particular, the questions in NQs tend towards very fine-grained knowledge on Wikipedia specifically which could be testing the limits of GPT-3’s capacity and broad pretraining distribution.

Overall, on one of the three datasets GPT-3’s one-shot matches the open-domain fine-tuning SOTA. On the other two datasets it approaches the performance of the closed-book SOTA despite not using fine-tuning. On all 3 datasets, we find that performance scales very smoothly with model size (Figure 3.3 and Appendix H Figure H.7), possibly reflecting the idea that model capacity translates directly to more 'knowledge’ absorbed in the parameters of the model.

在自然问题(NQs)中，GPT-3在零杆设置中达到了14.6%，在单杆设置中达到了23.0%，在少杆设置中达到了29.9%，而在经过微调的T5 11B+SSM中达到了36.6%。与WebQS类似，从零杆到少杆的巨大增益可能意味着分布的转移，这也可能解释了与TriviaQA和WebQS相比竞争力较差的原因。特别是，NQs的问题倾向于维基百科上非常精细的知识，可以测试GPT-3的能力极限和广泛的培训前分布。

总的来说，在三个数据集中的一个上，GPT-3的一次性匹配了开放域微调SOTA。在另外两个数据集上，尽管没有使用微调，它的性能接近封闭的SOTA。在所有3个数据集上，我们发现性能与模型大小的关系非常顺利(图3.3和附录H图H.7)，可能反映了模型容量直接转化为更多吸收在模型参数中的“知识”的想法。

Winograd Schemas Challenge [LDM12]是NLP中的一项经典任务，当一个代词在语法上有歧义，但在语义上对人来说没有歧义时，该任务涉及确定该代词指的是哪个词。最近，经过微调的语言模型在原始Winograd数据集上取得了接近人类的性能，但是更困难的版本，比如反向挖掘的Winogrande数据集[SBBC19]，仍然显著落后于人类的性能。我们测试了GPT-3在Winograd和Winogrande上的性能，通常是在零杆、一杆和少杆设置下。在Winograd上，我们使用[RWC+19]中描述的相同的“部分求值”方法，在原始的273个Winograd模式集上测试GPT-3。请注意，此设置与SuperGLUE基准中的WSC任务略有不同，后者以二进制分类的形式表示，需要提取实体来转换为本节中描述的形式。Winograd的GPT-3在零杆、一杆和少杆设置中取得了88.3%、89.7%和88.6%的成绩，没有显示出明确的上下文学习，但在所有情况下都取得了较好的成绩，仅比最先进的和估计的人类性能低几个点。我们注意到，污染分析在训练数据中发现了一些Winograd模式，但这似乎只对结果有很小的影响(见第4节)。

ARC [CCE+18] is a dataset of multiple-choice questions collected from 3rd to 9th grade science exams. On the “Challenge” version of the dataset which has been filtered to questions which simple statistical or information retrieval methods are unable to correctly answer, GPT-3 achieves 51.4% accuracy in the zero-shot setting, 53.2% in the one-shot setting, and 51.5% in the few-shot setting. This is approaching the performance of a fine-tuned RoBERTa baseline (55.9%) from UnifiedQA [KKS+20]. On the “Easy” version of the dataset (questions which either of the mentioned baseline approaches answered correctly), GPT-3 achieves 68.8%, 71.2%, and 70.1% which slightly exceeds a fine-tuned RoBERTa baseline from [KKS+20]. However, both of these results are still much worse than the overall SOTAs achieved by the UnifiedQA which exceeds GPT-3’s few-shot results by 27% on the challenge set and 22% on the easy set.

On OpenBookQA [MCKS18], GPT-3 improves significantly from zero to few shot settings but is still over 20 points short of the overall SOTA. GPT-3’s few-shot performance is similar to a fine-tuned BERT Large baseline on the leaderboard. Overall, in-context learning with GPT-3 shows mixed results on commonsense reasoning tasks, with only small and inconsistent gains observed in the one and few-shot learning settings for both PIQA and ARC, but a significant improvement is observed on OpenBookQA. GPT-3 sets SOTA on the new PIQA dataset in all evaluation settings.

Next we evaluate GPT-3 on the task of reading comprehension. We use a suite of 5 datasets including abstractive, multiple choice, and span based answer formats in both dialog and single question settings. We observe a wide spread in GPT-3’s performance across these datasets suggestive of varying capability with different answer formats. In general we observe GPT-3 is on par with initial baselines and early results trained using contextual representations on each respective dataset.

接下来我们对GPT-3进行阅读理解任务的评估。在对话框和单一问题设置中，我们使用了一套5个数据集，包括抽象的、多项选择和基于跨度的回答格式。我们观察到GPT-3在这些数据集上的性能差异很大，这表明不同的回答格式具有不同的能力。一般来说，我们观察到GPT-3与初始基线和使用上下文表示对每个各自数据集进行训练的早期结果相同。

GPT-3在CoQA [RCM19]自由形式会话数据集上表现最好(在人类基线的3个点内)，在QuAC [CHI+18]数据集上表现最差(低于ELMo基线13 F1)，该数据集需要建模结构化对话行为和师生交互的回答范围选择。下降(DWD + 19]数据集测试离散推理和计算能力在阅读理解中,GPT-3在few-shot环境优于原始论文的BERT基线调整但仍远低于人类的性能和先进的方法增强神经网络与符号系统(RLL + 19)。在阵容2.0 [RJL18]上，GPT-3展示了它的少杆学习能力，与零杆设置相比提高了近10杆(69.8杆)。这使得它稍微优于原始论文中最好的微调结果。在RACE [LXL+17](一个针对初中和高中英语考试的多项选择数据集)上，GPT-3的表现相对较弱，仅与最早使用上下文表示的研究相比具有竞争力，仍落后于SOTA 45%。

In order to better aggregate results on NLP tasks and compare to popular models such as BERT and RoBERTa in a more systematic way, we also evaluate GPT-3 on a standardized collection of datasets, the SuperGLUE benchmark [WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]. GPT-3’s test-set performance on the SuperGLUE dataset is shown in Table 3.8. In the few-shot setting, we used 32 examples for all tasks, sampled randomly from the training set. For all tasks except WSC and MultiRC, we sampled a new set of examples to use in the context for each problem. For WSC and MultiRC, we used the same set of randomly drawn examples from the training set as context for all of the problems we evaluated. We observe a wide range in GPT-3’s performance across tasks. On COPA and ReCoRD GPT-3 achieves near-SOTA performance in the one-shot and few-shot settings, with COPA falling only a couple points short and achieving second place on the leaderboard, where first place is held by a fine-tuned 11 billion parameter model (T5). On WSC, performance is still relatively strong, achieving 80.1% in the few-shot setting (note that GPT-3 achieves 88.6% on the original Winograd dataset as described in Section 3.4). On BoolQ, MultiRC, and RTE, performance is reasonable, roughly matching that of a fine-tuned BERT-Large. On CB, we see signs of life at 75.6% in the few-shot setting.

为了更好地聚合NLP任务的结果，并与BERT和RoBERTa等流行模型进行更系统的比较，我们还在标准化数据集上对GPT-3进行了评价，即SuperGLUE基准[WPN+19] [WPN+19] [CLC+19] [DMST19] [RBG11] [KCR+18] [ZLL+18] [DGM06] [BHDD+06] [GMDD07] [BDD+09] [PCC18] [PHR+18]。GPT-3在SuperGLUE数据集上的测试集性能如表3.8所示。在小样本设置中，我们对所有任务使用了32个示例，从训练集中随机采样。对于除了WSC和MultiRC之外的所有任务，我们采样了一组新的示例用于每个问题的上下文。对于WSC和MultiRC，我们使用同一组从训练集中随机抽取的例子作为我们评估的所有问题的上下文。

我们观察到GPT-3在不同任务中的表现差异很大。在COPA和记录GPT-3实现近sota的表现在一次样本和小样本设置，与COPA只下降了几个点，并在排行榜上取得第二名，第一名是由微调110亿参数模型(T5)。在WSC上，性能仍然相对较强，在小样本设置中达到80.1%(请注意，如3.4节所述，gpot -3在原始Winograd数据集上达到88.6%)。在BoolQ、MultiRC和RTE上，性能是合理的，大致与经过微调的BERT-Large匹配。在CB上，我们看到生命迹象的比例为75.6%。

WiC is a notable weak spot with few-shot performance at 49.4% (at random chance). We tried a number of different phrasings and formulations for WiC (which involves determining if a word is being used with the same meaning in two sentences), none of which was able to achieve strong performance. This hints at a phenomenon that will become clearer in the next section (which discusses the ANLI benchmark) – GPT-3 appears to be weak in the few-shot or one-shot setting at some tasks that involve comparing two sentences or snippets, for example whether a word is used the same way in two sentences (WiC), whether one sentence is a paraphrase of another, or whether one sentence implies another. This could also explain the comparatively low scores for RTE and CB, which also follow this format. Despite these weaknesses, GPT-3 still outperforms a fine-tuned BERT-large on four of eight tasks and on two tasks GPT-3 is close to the state-of-the-art held by a fine-tuned 11 billion parameter model.

Finally, we note that the few-shot SuperGLUE score steadily improves with both model size and with number of examples in the context showing increasing benefits from in-context learning (Figure 3.8). We scale K up to 32 examples per task, after which point additional examples will not reliably fit into our context. When sweeping over values of K, we find that GPT-3 requires less than eight total examples per task to outperform a fine-tuned BERT-Large on overall SuperGLUE score.

WiC是一个值得注意的弱点，它的命中率为49.4%(随机)。我们为WiC尝试了许多不同的短语和公式(包括确定一个单词在两个句子中是否具有相同的意思)，但没有一个能够取得很好的效果。这暗示了一个现象,在下一节将变得更清楚(讨论ANLI基准)——GPT-3似乎弱few-shot或一次性设置的一些任务,涉及比较两个句子或片段,例如一个词是否用同样的方式在两个句子,一个句子是否解释另一个,或者一个句子是否意味着另一个。这也可以解释RTE和CB的分数相对较低的原因，它们也采用这种格式。尽管存在这些弱点，GPT-3仍然在8个任务中的4个任务上优于经过微调的伯特-大公司，而在两个任务上，GPT-3通过一个经过微调的110亿参数模型已经接近最先进水平。

最后，我们注意到，随着模型大小和上下文中的示例数量的增加，少量注射的SuperGLUE得分稳步提高，显示了上下文内学习的好处越来越大(图3.8)。我们将K扩展到每个任务32个示例，超过这一点，额外的示例将不可靠地适合我们的上下文。当扫过K的值时，我们发现GPT-3每个任务总共需要少于8个示例，才能在总体超级胶水得分上超过经过微调的伯特-大。

自然语言推理(NLI) [Fyo00]关注理解两个句子之间关系的能力。在实践中，这个任务通常被构造成两个或三个类的分类问题，其中模型分类第二个句子在逻辑上是否与第一个句子相符合，是否与第一个句子相矛盾，或者可能是正确的(中立的)。SuperGLUE包括一个NLI数据集RTE，它计算任务的二进制版本。在RTE上，只有最大版本的GPT-3在任何评估设置上的表现都令人信服地优于random(56%)，但在小样本设置中，GPT-3的表现类似于单任务优化的BERT Large。我们还评估了最近引入的对抗式自然语言推断(ANLI)数据集[NWD+19]。ANLI是一个复杂的数据集，它在三轮(R1、R2和R3)中使用一系列逆向挖掘的自然语言推理问题。与RTE类似，我们所有小于GPT-3的模型在ANLI上的表现几乎完全是随机的，即使是在很少投篮的设置中(约33%)，而GPT-3本身在第3轮显示出生命迹象。ANLI R3的结果突出显示在图3.9和全部结果轮可以在附录h .这些结果RTE和ANLI NLI基础仍然是一个非常困难的任务表明语言模型和他们才刚刚开始显示出进步的迹象。

要想了解GPT-3在“少拍”(或“零拍”和“一次拍”)环境下的能力范围，一种方法是让它执行一些任务，这些任务要求它执行简单的即时计算推理，识别训练中不太可能出现的新模式，或者快速适应不寻常的任务。我们设计了几个任务来测试这类能力。首先，我们测试GPT-3执行算术的能力。其次，我们创建了几个任务，这些任务包括重新排列或整理单词中的字母，这些任务不太可能在训练过程中被准确地看到。第三，我们测试了GPT-3解决卫星式类比问题的能力。最后，我们对GPT-3进行了几个定性测试，包括在句子中使用新单词、修改英语语法和生成新闻文章。我们将发布合成数据集，希望能促进对语言模型测试时行为的进一步研究。

To test GPT-3’s ability to perform simple arithmetic operations without task-specific training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language:

为了测试GPT-3在没有特定任务训练的情况下执行简单算术运算的能力，我们开发了一个包含10个测试的小电池，其中包括用自然语言问GPT-3一个简单的算术问题:

In all 10 tasks the model must generate the correct answer exactly. For each task we generate a dataset of 2,000 random instances of the task and evaluate all models on those instances. First we evaluate GPT-3 in the few-shot setting, for which results are shown in Figure 3.10. On addition and subtraction, GPT-3 displays strong proficiency when the number of digits is small, achieving 100% accuracy on 2 digit addition, 98.9% at 2 digit subtraction, 80.2% at 3 digit addition, and 94.2% at 3-digit subtraction. Performance decreases as the number of digits increases, but GPT-3 still achieves 25-26% accuracy on four digit operations and 9-10% accuracy on five digit operations, suggesting at least some capacity to generalize to larger numbers of digits. GPT-3 also achieves 29.2% accuracy at 2 digit multiplication, an especially computationally intensive operation. Finally, GPT-3 achieves 21.3% accuracy at single digit combined operations (for example, 9*(7+5)), suggesting that it has some robustness beyond just single operations.

As Figure 3.10 makes clear, small models do poorly on all of these tasks – even the 13 billion parameter model (the second largest after the 175 billion full GPT-3) can solve 2 digit addition and subtraction only half the time, and all other operations less than 10% of the time.

One-shot and zero-shot performance are somewhat degraded relative to few-shot performance, suggesting that adaptation to the task (or at the very least recognition of the task) is important to performing these computations correctly. Nevertheless, one-shot performance is still quite strong, and even zero-shot performance of the full GPT-3 significantly outperforms few-shot learning for all smaller models. All three settings for the full GPT-3 are shown in Table 3.9, and model capacity scaling for all three settings is shown in Appendix H.

To spot-check whether the model is simply memorizing specific arithmetic problems, we took the 3-digit arithmetic problems in our test set and searched for them in our training data in both the forms "<NUM1> + <NUM2> =" and "<NUM1> plus <NUM2>". Out of 2,000 addition problems we found only 17 matches (0.8%) and out of 2,000 subtraction problems we found only 2 matches (0.1%), suggesting that only a trivial fraction of the correct answers could have been memorized. In addition, inspection of incorrect answers reveals that the model often makes mistakes such as not carrying a “1”, suggesting it is actually attempting to perform the relevant computation rather than memorizing a table. Overall, GPT-3 displays reasonable proficiency at moderately complex arithmetic in few-shot, one-shot, and even zero-shot settings.

To test GPT-3’s ability to learn novel symbolic manipulations from a few examples, we designed a small battery of 5 “character manipulation” tasks. Each task involves giving the model a word distorted by some combination of scrambling, addition, or deletion of characters, and asking it to recover the original word. The 5 tasks are:

为了测试GPT-3从几个例子中学习新的符号操作的能力，我们设计了一个包含5个“字符操作”任务的小电池。每个任务都包括给模型一个被打乱、添加或删除字符组合而扭曲的单词，并要求它恢复原来的单词。这5项任务是:

We can further quantify performance by plotting “in-context learning curves”, which show task performance as a function of the number of in-context examples. We show in-context learning curves for the Symbol Insertion task in Figure 1.2. We can see that larger models are able to make increasingly effective use of in-context information, including both task examples and natural language task descriptions.

Finally, it is worth adding that solving these tasks requires character-level manipulations, whereas our BPE encoding operates on significant fractions of a word (on average ∼ 0.7 words per token), so from the LM’s perspective succeeding at these tasks involves not just manipulating BPE tokens but understanding and pulling apart their substructure. Also, CL, A1, and A2 are not bijective (that is, the unscrambled word is not a deterministic function of the scrambled word), requiring the model to perform some search to find the correct unscrambling. Thus, the skills involved appear to require non-trivial pattern-matching and computation.

之前在生成语言模型上的工作定性地测试了他们生成合成“新闻文章”的能力，方法是有条件地从模型中取样，并给出一个由一个新闻故事的可信的第一句话组成的人类书面提示。相对于数据集(RWC + 19),用于火车GPT-3偏重于新闻文章要少得多,因此试图产生新闻文章通过原始无条件的样品更有效——例如GPT-3经常解释提出的第一句话“新闻文章”的一条微博,然后文章合成反应或后续消息。为了解决这个问题，我们使用了GPT-3的少样本学习能力，在模型的上下文中提供了之前的三篇新闻文章来约束它。有了提议的下一篇文章的标题和副标题，该模型能够可靠地生成“新闻”类型的短文章。

为了衡量GPT-3生成新闻文章的质量(我们认为这很可能与有条件的样本生成质量总体上相关)，我们决定衡量人类区分GPT-3生成的文章与真实文章的能力。Kreps等人[KMB20]和Zellers等人[ZHR+19]也进行了类似的工作。生成语言模型被训练来匹配人类生成的内容的分布，所以人类区分这两者的能力是质量的一个潜在的重要衡量标准

我们选择的文章不在模型的训练数据中，并且模型的输出被编程地格式化和选择，以防止人类的“挑选”。所有模型都使用相同的上下文来设置输出条件，并使用相同的上下文大小进行预训练，每个模型都使用相同的文章标题和副标题作为提示。然而，我们也进行了一项实验，以控制参与者的努力和注意力，这些人遵循同样的格式，但包含了有意的不良模型生成的文章。这是通过从一个“控制模型”生成文章来实现的:一个没有上下文且增加了输出随机性的160M参数模型。

Mean human accuracy (the ratio of correct assignments to non-neutral assignments per participant) at detecting that the intentionally bad articles were model generated was ∼ 86% where 50% is chance level performance. By contrast, mean human accuracy at detecting articles that were produced by the 175B parameter model was barely above chance at ∼ 52% (see Table 3.11).5 Human abilities to detect model generated text appear to decrease as model size increases: there appears to be a trend towards chance accuracy with model size, and human detection of GPT-3 is close to chance.6 This is true despite the fact that participants spend more time on each output as model size increases (see Appendix E). Examples of synthetic articles from GPT-3 are given in Figures 3.14 and 3.15. 7 Much of the text is—as indicated by the evaluations—difficult for humans to distinguish from authentic human content. Factual inaccuracies can be an indicator that an article is model generated since, unlike human authors, the models have no access to the specific facts that the article titles refer to or when the article was written. Other indicators include repetition, non sequiturs, and unusual phrasings, though these are often subtle enough that they are not noticed. Related work on language model detection by Ippolito et al. [IDCBE19] indicates that automatic discriminators like G R O V E R [ZHR+19] and GLTR [GSR19] may have greater success at detecting model generated text than human evaluators. Automatic detection of these models may be a promising area of future research.

Ippolito et al. [IDCBE19] also note that human accuracy at detecting model generated text increases as humans observe more tokens. To do a preliminary investigation of how good humans are at detecting longer news articles generated by GPT-3 175B, we selected 12 world news articles from Reuters with an average length of 569 words and generated completions of these articles from GPT-3 with an average length of 498 words (298 words longer than our initial experiments). Following the methodology above, we ran two experiments, each on around 80 US-based participants, to compare human abilities to detect the articles generated by GPT-3 and a control model. We found that mean human accuracy at detecting the intentionally bad longer articles from the control model was ∼ 88%, while mean human accuracy at detecting the longer articles that were produced by GPT-3 175B was still barely above chance at ∼ 52% (see Table 3.12). This indicates that, for news articles that are around 500 words long, GPT-3 continues to produce articles that humans find difficult to distinguish from human written news articles.

在检测出被模型生成的故意差的文章时，人类的平均准确率(每个参与者的正确任务与非中立任务的比率)为86%，其中50%是随机水平的表现。相比之下，人类检测175B参数模型产生的物品的平均准确率仅为52%(见表3.11)。5人类检测模型生成的文本的能力似乎随着模型大小的增加而减少:模型大小似乎有机会准确性的趋势，人类对GPT-3的检测接近于机会。尽管随着模型尺寸的增加，参与者会在每个输出上花费更多的时间(见附录E)，但这是真的。

图3.14和图3.15给出了GPT-3合成产品的示例。7如评估所示，大部分文本对人类来说很难从真实的人类内容中区分出来。事实不准确可能是一篇文章是模型生成的标志，因为与人类作者不同，模型无法访问文章标题所引用的具体事实或文章的写作时间。其他的指标包括重复，不符合逻辑，和不寻常的措辞，尽管这些通常是足够微妙的，他们没有被注意到。

Ippolito等人[IDCBE19]在语言模型检测方面的相关工作表明，自动鉴别器如G R O V E R [ZHR+19]和GLTR [GSR19]在检测模型生成的文本方面可能比人类评价器更成功。这些模型的自动检测可能是未来研究的一个有前景的领域。

Ippolito等人[IDCBE19]也注意到，随着人们观察到更多的标记，人类检测模型生成的文本的准确性也会提高。做一个初步调查好人类是如何检测时间的新闻文章由GPT-3 175 b,我们选择了12项世界新闻文章来自路透社平均长度为569个单词和生成完成的这些文章GPT-3平均长度为498个单词(298字的时间比我们最初的实验)。按照上述方法，我们进行了两个实验，每个实验都有大约80名美国参与者，以比较人类检测GPT-3和一个对照模型生成的文章的能力。

我们发现，人类在检测控制组故意制造的较长文章时的平均准确率为~ 88%，而在检测GPT-3 175B制造的较长文章时的平均准确率为~ 52%(见表3.12)。这表明，对于长度在500字左右的新闻文章，GPT-3继续生成人类难以区分的文章。

发展语言学[CB78]研究的一个任务是学习和利用新单词的能力，例如在一个句子中只看到一个单词的定义一次就使用它，或者从一个用法反过来推断一个单词的意思。在这里，我们定性地测试GPT-3完成前一项任务的能力。具体来说，我们给GPT-3一个不存在的单词的定义，比如“Gigamuru”，然后让它在一个句子中使用它。我们提供了一个(单独的)不存在的单词在句子中被定义和使用的1到5个例子，所以就宽泛任务的前面例子而言，任务是很少的，而就具体单词而言，任务是一次性的。表3.16显示了我们生成的6个示例;所有的定义都是人为生成的，第一个答案是人为生成的，作为条件反射，随后的答案是GPT-3生成的。这些示例是在一次运行中连续生成的，我们没有省略或重复尝试任何提示。在所有的情况下，生成的句子似乎是一个正确的或至少似是而非的词的使用。在最后一句话中，该模型为单词“screeg”(即“screeghed”)生成了一个貌似合理的变位，尽管这个词的使用有点尴尬(“screeghed at each other”)，尽管它在描述一场玩具剑战的意义上似乎是可信的。总的来说，GPT-3在使用新单词造句方面至少表现得很熟练。

由于我们的训练数据集来自互联网，所以我们的模型可能是在一些基准测试集上训练的。从互联网规模的数据集中准确地检测测试污染是一个新的研究领域，没有建立最佳实践。虽然在训练大型模型时不调查污染是常见的做法，但考虑到训练前数据集规模的不断扩大，我们相信这个问题正变得越来越重要。

这种担忧不仅仅是假设。最早在普通爬行数据上训练语言模型的论文之一[TL18]检测并删除了一个与其中一个评估数据集重叠的训练文档。GPT-2 [RWC+19]等其他工作也进行了事后重叠分析。他们的研究相对令人鼓舞，发现尽管模型在训练和测试重叠的数据上表现得稍微好一些，但这并不会对报告的结果产生显著影响，因为有一小部分数据被污染了(通常只有几个百分点)。

GPT-3的运作方式有些不同。一方面，数据集和模型的大小大约比GPT-2使用的大两个数量级，并且包括大量的常见爬行，增加了污染和记忆的可能性。另一方面，精确地说，由于数据量大，即使是GPT-3 175B，其训练集也没有过度拟合，这是相对于一个被删除的验证集而言的(图4.1)。因此，我们预计污染可能是频繁的，但其影响可能不会像担心的那样大。

我们最初试图通过主动搜索并试图消除我们的训练数据与本文中研究的所有基准的开发和测试集之间的任何重叠，来解决污染问题。不幸的是，一个错误只导致部分删除了训练数据中检测到的所有重叠部分。由于培训成本的原因，对模型进行再培训是不可行的。为了解决这个问题，我们详细研究剩余检测到的重叠是如何影响结果的。

对于每个基准测试，我们生成一个“干净”版本，删除所有可能泄露的示例，大致定义为与预训练集中的任何内容有13克重叠的示例(或者与整个示例有重叠的示例，如果它小于13克)。我们的目标是非常保守地标记出任何可能被污染的东西，以便产生一个高度可靠的无污染子集。确切的程序在附录C中有详细说明。

然后我们在这些干净的基准上评估GPT-3，并与原始分数进行比较。如果清洁子集上的分数与整个数据集上的分数相似，这表明即使存在污染，也不会对报告的结果产生显著的影响。如果清洁组的分数较低，这表明污染可能使结果膨胀。结果如图4.2所示。尽管潜在的污染通常很高(四分之一的基准测试得分超过50%)，但在大多数情况下，性能变化只是微不足道的，而且我们没有看到污染水平和性能差异相关的证据。我们得出的结论是，要么我们的保守方法大大高估了污染，要么污染对性能的影响很小。

下面，我们将更详细地回顾一些特定的情况，其中(1)模型在清理后的版本上表现明显较差，或(2)潜在的污染非常高，这使得测量性能差异非常困难。

Our analysis flagged six groups of benchmarks for further investigation: Word Scrambling, Reading Comprehension (QuAC, SQuAD2, DROP), PIQA, Winograd, language modeling tasks (Wikitext tasks, 1BW), and German to English translation. Since our overlap analysis is designed to be extremely conservative, we expect it to produce some false positives. We summarize the results for each group of tasks below:

我们的分析为进一步的调查标记了六组基准:拼词，阅读理解(QuAC, SQuAD2, DROP)， PIQA, Winograd，语言建模任务(Wikitext任务，1BW)，以及德语到英语的翻译。由于我们的重叠分析被设计成极其保守的，我们预计它会产生一些误报。我们将每组任务的结果总结如下:

We also inspected datasets where contamination was high, but the impact on performance was close to zero, simply to verify how much actual contamination existed. These appeared to often contain false positives. They had either no actual contamination, or had contamination that did not give away the answer to the task. One notable exception was LAMBADA, which appeared to have substantial genuine contamination, yet the impact on performance was very small, with the clean subset scoring within 0.5% of the full dataset. Also, strictly speaking, our fill-in-the-blank format precludes the simplest form of memorization. Nevertheless, since we made very large gains on LAMBADA in this paper, the potential contamination is noted in the results section. An important limitation of our contamination analysis is that we cannot be sure that the clean subset is drawn from the same distribution as the original dataset. It remains possible that memorization inflates results but at the same time is precisely counteracted by some statistical bias causing the clean subset to be easier. However, the sheer number of shifts close to zero suggests this is unlikely, and we also observed no noticeable difference in the shifts for small models, which are unlikely to be memorizing.

Overall, we have made a best effort to measure and document the effects of data contamination, and to note or outright remove problematic results, depending on the severity. Much work remains to be done to address this important and subtle issue for the field in general, both when designing benchmarks and when training models. For a more detailed explanation of our analysis, we refer the reader to Appendix C.

我们还检查了污染程度高的数据集，但对性能的影响接近于零，只是为了验证实际存在多少污染。这些报告似乎经常包含误报。他们要么没有受到实际的污染，要么受到的污染并没有泄露任务的答案。一个值得注意的例外是LAMBADA，它看起来确实存在大量的污染，但对性能的影响非常小，干净子集的得分在整个数据集的0.5%之内。而且，严格地说，我们的填空格式排除了最简单的记忆形式。然而，由于我们在这篇论文中取得了很大的进展，潜在的污染在结果部分被指出。我们的污染分析的一个重要限制是，我们不能确定干净子集是从与原始数据集相同的分布中提取的。记忆有可能使结果膨胀，但同时也被一些统计偏差精确地抵消了，从而使干净子集变得更容易。然而，绝对的数字。

总的来说，我们已经尽了最大的努力来度量和记录数据污染的影响，并根据严重程度来注意或直接删除有问题的结果。在设计基准和培训模式时，仍有许多工作要做，以解决该领域一般的这一重要而微妙的问题。有关我们的分析的更详细的解释，请读者参阅附录C。

GPT-3和我们对它的分析都有一些局限性。下面我们将对其中一些进行描述，并对未来的工作提出建议。

首先，尽管GPT-3在定量和定性方面有了很大的改进，特别是与它的直接前身GPT-2相比，它在文本合成和一些NLP任务方面仍有明显的缺陷。在文本合成方面，尽管整体质量很高，但GPT-3样本有时仍然在文档层面上语义上重复，在足够长的段落中开始失去连贯性，自相矛盾，偶尔还包含不符合逻辑的句子或段落。我们将发布500个未经管理的无条件样本，以帮助更好地了解GPT-3在文本合成方面的局限性和优势。在离散语言任务领域，我们非正式地注意到GPT-3似乎在“常识物理”方面有特殊的困难，尽管在一些测试该领域的数据集(如PIQA [BZB+19])上做得很好。具体来说，GPT-3很难回答“如果我把奶酪放进冰箱，它会融化吗?”定量,GPT-3的语境学习表现有明显的差距在我们套件的基准,如第三节所述,特别是它没有比机会当评估一次性甚至few-shot一些“比较”的任务,如确定两个词使用同样的方式在一个句子,或者如果一个句子意味着另一个(WIC和ANLI分别),以及阅读理解任务的一个子集。考虑到GPT-3在许多其他任务上的出色的小样本性能，这一点尤其引人注目。

GPT-3在结构和算法上有一些限制，这可以解释上面的一些问题。我们专注于探索自回归语言模型中的上下文内学习行为，因为用这个模型类进行抽样和计算可能性都很简单。因此，我们的实验不包括任何双向架构或其他训练目标，如去噪。这与最近的许多文献有明显的不同，后者记录了在标准语言模型上使用这些方法可以提高调优性能[RSR+19]。因此，我们的设计决策的代价是，在经验上受益于双向性的任务上，可能会有更糟糕的性能。这可能包括填空任务，包括回顾和比较两段内容的任务，或者要求重读或仔细考虑一篇很长的文章，然后写出非常简短的答案的任务。这可能是一个可能的解释为GPT-3滞后few-shot性能的一些任务,如WIC(包括比较词的使用在两个句子),ANLI(包括比较两个句子是否意味着另一个),和一些阅读理解任务(例如QuAC和种族)。基于过去的文献，我们还推测，一个大型的双向模型在微调方面会比GPT-3更强。在GPT-3的规模上制作一个双向模型，以及/或尝试使双向模型在很少或零射击学习中工作，是未来研究的一个有前途的方向，并且可以帮助实现“两全其美”。

A limitation associated with models at the scale of GPT-3, regardless of objective function or algorithm, is that they are both expensive and inconvenient to perform inference on, which may present a challenge for practical applicability of models of this scale in their current form. One possible future direction to address this is distillation [HVD15] of large models down to a manageable size for specific tasks. Large models such as GPT-3 contain a very wide range of skills, most of which are not needed for a specific task, suggesting that in principle aggressive distillation may be possible. Distillation is well-explored in general [LHCG19a] but has not been tried at the scale of hundred of billions parameters; new challenges and opportunities may be associated with applying it to models of this size.

Finally, GPT-3 shares some limitations common to most deep learning systems – its decisions are not easily interpretable, it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on. This last issue – biases in the data that may lead the model to generate stereotyped or prejudiced content – is of special concern from a societal perspective, and will be discussed along with other issues in the next section on Broader Impacts (Section 6).

语言模型为社会提供了广泛的有益应用，包括代码和编写自动完成、语法帮助、游戏叙事生成、改进搜索引擎响应和回答问题。但它们也有潜在的有害用途。相对于较小的模型，GPT-3提高了文本生成的质量和适应性，并增加了区分合成文本和人类书写文本的难度。因此，它有潜力促进语言模型的有益和有害应用。在这里，我们关注改进后的语言模型的潜在危害，不是因为我们认为这种危害必然更大，而是为了激励人们努力去研究和减轻它们。这类语言模型的广泛影响是多方面的。我们关注两个主要问题:第6.1节中故意误用像GPT-3这样的语言模型的可能性，以及第6.2节中像GPT-3这样的模型中的偏见、公平和表示问题。我们也简要讨论能源效益的问题(第6.3节)。

恶意使用语言模型可能有点难以预料，因为它们通常涉及到在非常不同的环境中重新使用语言模型，或者用于与研究人员预期不同的目的。为了帮助解决这一问题，我们可以从传统的安全风险评估框架的角度进行思考，这些框架列出了关键步骤，如识别威胁和潜在影响、评估可能性以及将风险确定为可能性和影响的组合[Ros12]。我们讨论三个因素:潜在的误用应用，威胁行动者，和外部激励结构。

Any socially harmful activity that relies on generating text could be augmented by powerful language models. Examples include misinformation, spam, phishing, abuse of legal and governmental processes, fraudulent academic essay writing and social engineering pretexting. Many of these applications bottleneck on human beings to write sufficiently high quality text. Language models that produce high quality text generation could lower existing barriers to carrying out these activities and increase their efficacy.

The misuse potential of language models increases as the quality of text synthesis improves. The ability of GPT-3 to generate several paragraphs of synthetic content that people find difficult to distinguish from human-written text in 3.9.4 represents a concerning milestone in this regard.

任何依赖于生成文本的对社会有害的活动都可以通过强大的语言模型来增强。例如，虚假信息，垃圾邮件，网络钓鱼，滥用法律和政府程序，欺诈学术论文写作和社会工程借口。这些应用程序中的许多都阻碍了人们编写足够高质量的文本。产生高质量文本生成的语言模型可以降低执行这些活动的现有障碍，并提高其效率。

随着文本合成质量的提高，语言模型的误用潜力也在增加。GPT-3生成几段合成内容的能力是这方面的一个重要里程碑，人们发现这些合成内容很难与3.9.4中人类书写的文本区分开来。

威胁参与者可以根据技能和资源级别进行组织，从能够构建恶意产品的低或中等技能和资源的参与者，到“高级持续威胁”(APTs):高技能和资源充足的(例如。国家资助的)有长期议程的团体[SBC+19]。

为了了解低技能和中等技能的参与者是如何思考语言模型的，我们一直在监视论坛和聊天组，在那里错误信息策略，恶意软件的传播，和计算机欺诈经常被讨论。虽然在2019年春天首次发布GPT-2之后，我们确实发现了大量关于滥用的讨论，但我们发现，自那以后，实验的实例变少了，也没有成功的部署。此外，这些误用的讨论与媒体对语言模型技术的报道有关。从这一点，我们评估的威胁，滥用这些行动者不是立即，但重大改进的可靠性可以改变这一点。因为APTs通常不公开讨论操作，所以我们就可能涉及语言模型使用的APT活动咨询了专业的威胁分析师。自从GPT-2发布以来，在使用语言模型可以获得潜在收益的操作方面没有明显的差异。评估是语言模型可能不值得投入大量资源,因为没有令人信服的证明当前的语言模型明显优于现有方法生成文本,因为“目标”或“控制”方法的内容语言模型仍处于早期阶段。

Each threat actor group also has a set of tactics, techniques, and procedures (TTPs) that they rely on to accomplish their agenda. TTPs are influenced by economic factors like scalability and ease of deployment; phishing is extremely popular among all groups because it offers a low-cost, low-effort, high-yield method of deploying malware and stealing login credentials. Using language models to augment existing TTPs would likely result in an even lower cost of deployment.

Ease of use is another significant incentive. Having stable infrastructure has a large impact on the adoption of TTPs. The outputs of language models are stochastic, however, and though developers can constrain these (e.g. using top-k truncation) they are not able to perform consistently without human feedback. If a social media disinformation bot produces outputs that are reliable 99% of the time, but produces incoherent outputs 1% of the time, this could reduce the amount of human labor required in operating this bot. But a human is still needed to filter the outputs, which restricts how scalable the operation can be. Based on our analysis of this model and analysis of threat actors and the landscape, we suspect AI researchers will eventually develop language models that are sufficiently consistent and steerable that they will be of greater interest to malicious actors. We expect this will introduce challenges for the broader research community, and hope to work on this through a combination of mitigation research, prototyping, and coordinating with other technical developers.

Biases present in training data may lead models to generate stereotyped or prejudiced content. This is concerning, since model bias could harm people in the relevant groups in different ways by entrenching existing stereotypes and producing demeaning portrayals amongst other potential harms [Cra17]. We have conducted an analysis of biases in the model in order to better understand GPT-3’s limitations when it comes to fairness, bias, and representation. 8 Our goal is not to exhaustively characterize GPT-3, but to give a preliminary analysis of some of its limitations and behaviors. We focus on biases relating to gender, race, and religion, although many other categories of bias are likely present and could be studied in follow-up work. This is a preliminary analysis and does not reflect all of the model’s biases even within the studied categories.

Broadly, our analysis indicates that internet-trained models have internet-scale biases; models tend to reflect stereotypes present in their training data. Below we discuss our preliminary findings of bias along the dimensions of gender, race, and religion. We probe for bias in the 175 billion parameter model and also in similar smaller models, to see if and how they are different in this dimension.

We also carried out pronoun resolution on the Winogender dataset [RNLVD18] using two methods which further corroborated the model’s tendency to associate most occupations with males. One method measured the models ability to correctly assign a pronoun as the occupation or the participant. For example, we fed the model a context such as "The advisor met with the advisee because she wanted to get advice about job applications. 'She’ refers to the" and found the option with the lowest probability between the two possible options (Choices between Occupation Option: advisor; Participant Option: advisee). Occupation and participant words often have societal biases associated with them such as the assumption that most occupants are by default male. We found that the language models learnt some of these biases such as a tendency to associate female pronouns with participant positions more than male pronouns. GPT-3 175B had the highest accuracy of all the models (64.17%) on this task. It was also the only model where the accuracy for Occupant sentences (sentences where the correct answer was the Occupation option) for females was higher than for males (81.7% vs 76.7%). All other models had a higher accuracy for male pronouns with Occupation sentences as compared to female pronouns with the exception of our second largest model- GPT-3 13B - which had the same accuracy (60%) for both. This offers some preliminary evidence that in places where issues of bias can make language models susceptible to error, the larger models are more robust than smaller models.

We also performed co-occurrence tests, where we analyzed which words are likely to occur in the vicinity of other preselected words. We created a model output sample set by generating 800 outputs of length 50 each with a temperature of 1 and top p of 0.9 for every prompt in our dataset. For gender, we had prompts such as "He was very", "She was very", "He would be described as", "She would be described as"9 . We looked at the adjectives and adverbs in the top 100 most favored words using an off-the-shelf POS tagger [LB02]. We found females were more often described using appearance oriented words such as ”beautiful” and ”gorgeous” as compared to men who were more often described using adjectives that span a greater spectrum. Table 6.1 shows the top 10 most favored descriptive words for the model along with the raw number of times each word co-occurred with a pronoun indicator. “Most Favored” here indicates words which were most skewed towards a category by co-occurring with it at a higher rate as compared to the other category. To put these numbers in perspective, we have also included the average for the number of co-occurrences across all qualifying words for each gender.

我们还使用两种方法对Winogender数据集[RNLVD18]进行代词解析，这两种方法进一步证实了该模型将大多数职业与男性联系起来的倾向。一种方法是测试模型正确分配代词作为职业或参与者的能力。例如，我们为模型提供了一个上下文，例如“顾问与被咨询者会面，因为她想获得关于工作申请的建议。”“她”指的是“并在两种可能的选项(职业选项:顾问;参与者选择:学生)。

职业和参与者的词汇通常带有社会偏见，比如假设大多数居住者默认为男性。我们发现，语言模型学会了一些偏见，比如倾向于将女性代词与参与者的位置联系起来，而不是男性代词。GPT-3 175B在这项任务上的准确率是所有模型中最高的(64.17%)。这也是唯一一个女性的居住者句子(正确答案是职业选项的句子)的准确率高于男性的模型(81.7%对76.7%)。除了我们的第二大模型GPT-3 13B，其他所有模型在男性代词与职业相关的句子上的准确率都高于女性代词，但GPT-3 13B在两个句子上的准确率都相同(60%)。这提供了一些初步证据，表明在存在偏见的地方，语言模型容易出错，较大的模型比较小的模型更健壮。我们还进行了共现测试，分析哪些词可能出现在其他预先选择的词附近。通过为数据集中的每个提示生成800个长度为50、温度为1和顶部p为0.9的输出，我们创建了一个模型输出示例集。关于性别，我们有诸如"他非常"，"她非常"，"他被描述为"，"她被描述为"9。我们看了形容词和副词在100个最受欢迎的单词中使用现成的POS标记。我们发现，女性被描述时更多地使用“美丽”和“华丽”等以外表为导向的词汇，而男性则更多地使用范围更广的形容词来描述。表6.1显示了模型中最受欢迎的10个描述性单词，以及每个单词与代词指示符共出现的原始次数。这里的“最受欢迎”指的是那些与某个类别同时出现的词比另一个类别出现的比率要高。为了更好地理解这些数字，我们还包括了每种性别的所有限定词中共同出现的次数的平均值。

之前的很多工作都是专门针对问题的回答，这在我们的测试任务中占了很大一部分。最近的努力包括[RSR+19, RRS20]，它微调了一个110亿参数的语言模型，以及[GLT+20]，它关注于在测试时处理大量的数据。我们的工作侧重于语境学习，但在未来可以与[GLT+20, LPP+20]的工作相结合。

语言模型中的金属学习在[RWC+19]中得到了应用，尽管结果有限，也没有系统的研究。更广泛地说，语言模型metalearning具有内环-外环结构，这使得它在结构上类似于一般应用于ML的metalearning。这里有大量的文献，包括匹配网络[VBL+16]， RL2 [DSC+16]，学习优化[RL16, ADG+16, LM17]和MAML [FAL17]。填料模型的上下文的我们的方法与以前的例子是最结构类似于RL2上也类似于[HYC01],在适应一个内循环发生在步伐通过计算模型的激活,没有更新权重,而外层循环(在这种情况下只是语言模型训练的)更新权重,和隐式学习能力适应或者至少在inference-time定义识别任务。[RCP+17]探索了小样本自回归密度估计，[GWC+18]将低资源NMT作为一个小样本学习问题进行了研究。

虽然我们的小样本方法的机制不同，但之前的工作也探索了使用预训练语言模型结合梯度下降进行小样本学习的方法[SS20]。另一个具有类似目标的子领域是半监督学习，其中像UDA [XDH+19]这样的方法也探索了在可用标记数据很少的情况下进行微调的方法。

使用自然语言给出多任务模型的指令首先是在一个监督设置中通过[MKXS18]形式化的，并在使用[RWC+19]的语言模型中用于一些任务(比如汇总)。在文本到文本转换器[RSR+19]中也探索了用自然语言表示任务的概念，尽管它被应用于多任务微调，而不是在没有权值更新的情况下用于上下文学习。

另一种提高语言模型通用性和转移学习能力的方法是多任务学习[Car97]，它对下游任务的混合进行微调，而不是分别更新每个任务的权重。如果成功的多任务学习可以允许单一模型在不更新权值的情况下用于多个任务(类似于我们的上下文学习方法)，或者可以在更新新任务权值时提高样本效率。多任务学习了一些初步的结果[LGH + 15, LSP + 18]和多级微调最近成为一个标准化的一部分SOTA结果在一些数据集[PFB18]而且突破某些任务(kk + 20),但仍需要手动牧师收藏有限的数据集和设置培训课程。相比之下，大规模的预训练似乎提供了一种“自然的”广泛分布的任务，这种任务隐含在预测文本本身中。未来工作的一个方向可能是尝试为多任务学习生成更广泛的明确任务，例如通过程序生成[TFR+17]、人机交互[ZSW+19b]或主动学习[Mac92]。

THE END

aper：《anguageodelsareew

循环神经网络机器翻译与数据集

aper：《anguageodelsareew

德语离线翻译软件如何开发–ingode

万字幅图一网打尽ransformer

神经机器翻译数据集预处理流程简介腾讯云开发者社区

使用transformer的完成机器翻译数据集mobcade的技术博客

最佳论文：智能翻译要抢字幕翻译员的饭碗

新闻

中天咨询：“上岗”翻译领域，翻译员的饭碗还保得住吗奥运奥运会冬奥英语

seqseq：神经机器变换架构的大量探索

线性变换器其实就是快权重编程器

oogle遵循原则减少机器翻译的性别偏见

基于注意力机制，机器之心带你理解与训练神经机器翻译系统腾讯云开发者社区

万字综述语言模型发展史博客

ransformers自然语言处理（二）绝不原创的飞龙

号称打败谷歌翻译的eep究竟靠不靠谱