Key Takeaways

Pre-trained language models are limited by their lack of access to up-to-date information.
Language models are prone to producing hallucinations or inaccurate responses, even with built-in safeguards.
Retrieval-augmented language modeling (RALM) can overcome these limitations by embedding factual content directly into the user's prompt, allowing for more accurate and relevant responses.
RALM enables organizations to leverage their internal data sources and build flexible retrieval systems to enhance language model responses.

TL;DR

As language models continue to revolutionize the field of natural language processing, businesses are looking to harness their power for internal applications and processes. However, it's crucial to understand the limitations of pre-trained language models, particularly their lack of access to up-to-date information and tendency to produce inaccurate responses. In this post, I explore retrieval-augmented language modeling (RALM) as a promising solution to overcome these limitations. By embedding factual content directly into the user's prompt, RALM enables language models to generate more accurate and relevant responses, while also allowing organizations to leverage their internal data sources and build flexible retrieval systems.

Pre-trained Language Models Lack Timely Information

Groundbreaking advances in language models, such as GPT-4, have transformed the field of natural language processing. These models have enabled machines to not only comprehend human language more effectively but also generate responses that closely resemble human-like communication. This has opened up a world of possibilities for businesses seeking to harness the power of language models for internal applications and processes.

However, a critical aspect to consider is that these models are predominantly trained on general text corpora, which may not encompass the most recent data on world events, industry-specific information, or company-specific metrics. With the recent release of GPT-4, OpenAI notes this limitation explicitly:

““GPT-4 generally lacks knowledge of events that have occurred after the vast majority of its data cuts off (September 2021), and does not learn from its experience.””

— OpenAI

For example, if a user were to prompt a language model to provide current financial data or stock prices for a specific company, the model may not possess up-to-date information and could generate inaccurate results. Nevertheless, some language models, such as OpenAI's ChatGPT-3.5 and later versions, incorporate built-in safeguards to mitigate these limitations. As an example, when prompted to deliver Aon's latest financial metrics, ChatGPT-3.5 acknowledges its inability to access real-time data and clarifies that it cannot provide the most recent information.

As an AI language model, I don't have access to real-time data, so I cannot provide you with the latest financial performance metrics of Aon…

Similarly, GPT-3.5 prefaces its response with a statement that directs the user to the company's 2020 Annual Report, indicating its awareness of the timeliness of the information available within the network:

Aon's most recent financial performance metrics can be found in its 2020 Annual Report…

Despite built-in safeguards, the inability to access the most recent data remains a notable limitation for language models, particularly from a business use case perspective. Organizations must exercise caution and diligence when implementing language models for various tasks and applications, ensuring that they account for these constraints and supplement AI-generated information with up-to-date, accurate data sources where necessary. By doing so, businesses can maximize the utility of these powerful tools while minimizing the risks associated with outdated or incomplete information.

Pre-trained Language Models Are Prone to Hallucinations

The queries above exemplify the most significant limitation of language models for business applications: generative language models are highly susceptible to producing hallucinations, or responses that deviate from factual content. Let's examine these queries more closely.

Prompt: What are Aon’s latest financial performance metrics?

Response: <Text Omitted>

```
Total Revenue: $11.0 billion
```
```
Net Income: $1.4 billion
```
```
EPS: $6.09
```
```
Adjusted EPS: $9.74
```
```
Total Assets: $35.9 billion
```
```
Shareholders’ equity: $10.1 billion
```

ChatGPT-3.5

Prompt: What are Aon’s latest financial performance metrics?

Response: <Text Omitted>

```
Total Revenue: $12.5 billion 
```
```
Net Income: $1.7 billion 
```
```
Total Assets: $45.4 billion 
```
```
Return on Equity: 17.1%
```
```
Operating Margin: 18.7%
```

Total Shareholders' Equity: $7.5 billion

GPT-3.5

In these examples, I excluded some text to compare the responses of two models. The primary observation is the inconsistency between the responses. Each model provides different financial metrics based on interpretation, and since the prompt was vague, the variation in the results is reasonable. However, it's important to note that the models generate different responses even for the same financial metrics.

So, which is model is correct?

As it turns out, they are both wrong! According to Aon’s 2020 Annual report (and other sources), the correct metrics are:

Total Revenue: $11.1 billion
Net Income: $1.9 billion
Total Assets: $32 billion
Shareholder’s Equity: $3.5 billion

While the models' responses were generally in the right range for the correct financial metrics (except shareholder’s equity), they would certainly fail to meet the acceptance criteria for use cases that require high precision or factual responses. This limitation is particularly problematic because the confidence in the responses, where they demonstrate awareness of their limitations, coupled with the language models' susceptibility to producing hallucinations, can lead users to believe the results are accurate even when they're not.

Retrieval-Augmented Generation Can Solve These Limitations

A toy example demonstrating a prototypical retrieval-augmentation system. Retrieval system: The retrieval system receives a user query as an input and scores documents within the reference corpus according to their relevance to the query. Once documents are scored, the retrieval system will output a set of documents most relevant to the user query. In-context learning system: The user’s original query is combined with text from the set of relevant documents. The augmented prompt could combine relevant sentences, paragraphs, documents, or even clusters of documents with the user query. Once combined, the augmented prompt is then submitted to the language model for evaluation.

How can businesses adapt language models for high-precision use cases without incurring prohibitive costs? In the past, this would have required significant amounts of human-annotated data to train or fine-tune models for specific tasks. However, in the current paradigm, large language models are zero-shot learners, which enables a single model to adapt to multiple tasks with minimal human annotation and no additional training. Nevertheless, given the limitation of language models' potential inaccuracies, what investment options should businesses consider to achieve high precision estimates?

The answer, I believe, lies in Retrieval-Augmented Language Modeling (RALM). RALM is an emerging paradigm that shows great promise in addressing the limitations of language models for business use cases. With RALM, a system is designed to augment input prompts with factual content that the language model can use to answer a user's query. This approach allows the language model to draw on more specific, relevant, and timely information, especially in zero or few-shot inference applications where a language model must provide accurate responses with minimal or no training data.

The system diagram above highlights a typical RALM system for the financial metrics use case.

The Retrieval System

A retrieval system is a type of system that is designed to retrieve relevant information from a large corpus of data in response to a user's query or request. The system works by analyzing the query and comparing it to the content of the corpus, using various algorithms to determine the most relevant documents or information to return. Retrieval systems are commonly used in search engines, recommendation systems, and other applications where large amounts of data need to be efficiently searched and organized.

Retrieval systems, such as Google Search, are widely used in daily life to efficiently search and organize large amounts of data. In the context of Retrieval-Augmented Language Modeling (RALM), a retrieval system can be designed by a company to work in a similar way. The system uses algorithms to analyze a user's query and return the most relevant documents or information from a corpus of data. For instance, when a query is submitted to Google, the system returns relevant documents based on its analysis of the query, and when I submit my original query to Google, these are the relevant documents return by the retrieval system:

Relevant documents returned by Google’s retrieval algorithm. These documents can be parsed and used to augment the original query to provide factual content to

The retrieval system outputs a set of relevant documents based on the input query. In this example, the system performs well despite the ambiguity of the query. This is due to Google's effective retrieval algorithm, which can interpret the user's intent even when the query is not specific.

In-Context Learning System

With the retrieved documents, the original input query can be augmented with additional context and submitted to the language model for evaluation. This approach relies on in-context learning to exploit the language model's zero-shot inference capabilities to return more precise responses. In more advanced cases, the prompt can be further enhanced by providing demonstrations to the model, which can improve its performance even more.

In this example, only the first section of the first return document is used to augment the original query, as it contained the necessary information to answer the user's query. However, in real-world scenarios, the augmentation system may need to be more flexible and able to incorporate information from multiple sources to provide a comprehensive answer to the query. This flexibility is crucial in ensuring that the language model can provide accurate and relevant responses across a wide range of business use cases.

Here is the output from GPT-3.5 using the augmented prompt (I also instructed the model to provide bullet points):

The response generated by GPT-3.5 is accurate when factual content is provided within the prompt. This is likely due to the augmented prompt enhancing the model’s Bayesian inference capabilities, which enable it to reason probabilistically about the input and arrive at more precise responses. By incorporating factual content into the prompt, the model can more accurately interpret the user's query and provide relevant and accurate responses, especially given the correct answers are embedded in the prior available to model. (This is an interesting research area that I’ll probably explore more in a later post).

Conclusion

Language models are transforming the way businesses operate, with applications such as Microsoft CoPilot already targeting productivity. However, businesses must be aware of the limitations of language models, including timeliness and factualness. Language models are only exposed to information up to a certain point in time, and they can be prone to generating inaccurate responses or "hallucinations. Retrieval-augmented language modeling (RALM) is a promising solution to overcome these limitations. This approach embeds factual content directly into the user's prompt, allowing the language model to generate more accurate and factual responses.

Building on my previous post, Retrieval-augmented language modeling (RALM) can enable organizations to adapt internal data assets for new use cases by providing a flexible and scalable approach to leveraging existing data. With RALM, organizations can build retrieval systems that can access and incorporate relevant data from their internal data sources. This data can then be used to enhance the language model's responses to user queries, providing more accurate and relevant information.