[ad_1]
Large language models (LLMs) can be used to analyze complex documents and provide summaries and answers to questions. The post Domain Adaptation Fine-tuning Foundation Models to Amazon SageMaker JumpStart Financial Data describes how to adjust the LLM using your own data. Once you have a solid LLM, you’ll want to present that LLM to business customers to handle new documents that can be hundreds of pages long. In this post, we show how to build a real-time user interface to allow business users to process a PDF document of arbitrary length. After processing the file, you can summarize the document or ask questions about the content. The sample solution described in this post is available on GitHub.
Working with financial documents
Financial statements, such as quarterly earnings reports and annual reports to shareholders, are often tens or hundreds of pages long. These documents contain a lot of boilerplate language, such as disclaimers and legalese. If you want to extract key data points from one of these documents, you’ll need time and some knowledge of boilerplate language to be able to identify interesting facts. And of course you can’t ask an LLM questions about a document they’ve never seen.
LLMs used for summarization have a limit on the number of tokens (symbols) passed into the model, and with some exceptions, this usually does not exceed a few thousand tokens. This usually precludes the ability to summarize longer documents.
Our solution handles documents that exceed LLM’s maximum character sequence length and makes that document available to LLM to answer questions.
Solution overview
Our design has three important parts:
- It has an interactive web application for business users to upload and edit PDFs
- It uses the langchain library to split a large PDF into more manageable chunks
- It uses query-enhanced generation techniques to ask users questions about new data that LLM has not seen before.
As shown in the following diagram, we use a front-end implemented with React JavaScript hosted in an Amazon Simple Storage Service (Amazon S3) bucket powered by Amazon CloudFront. The front-end application allows users to upload PDF documents to Amazon S3. After the upload is complete, you can enable text extraction provided by Amazon Textract. As part of post-processing, the AWS Lambda function places special markers in the text that indicate page boundaries. When this work is done, you can call an API that summarizes the text or answers questions about it.
Because some of these steps can take some time, the architecture uses a separate asynchronous approach. For example, a document summary call triggers a Lambda function that posts a message to an Amazon Simple Queue Service (Amazon SQS) queue. Another lambda function receives this message and starts an Amazon Elastic Container Service (Amazon ECS) AWS Fargate task. A Fargate task calls an Amazon SageMaker inference endpoint. We use the Fargate task here because summarizing a very long PDF can take more time and memory than is available to a Lambda function. When the summary is complete, the front-end application can retrieve the results from the Amazon DynamoDB table.
For summarization, we use AI21’s Summarize model, one of the foundational models available through Amazon SageMaker JumpStart. Although this model handles documents of up to 10,000 words (about 40 pages), we use the langchain text splitter to ensure that each summary call to LLM does not exceed 10,000 words. We use Cohere’s Medium model for text generation and use GPT-J for embedding, both via JumpStart.
Summary processing
When processing larger documents, we need to determine how to divide the document into smaller parts. When we receive text extraction results from Amazon Textract, we insert markers for larger chunks of text (a configurable number of pages), individual pages, and line breaks. Langchain will split based on those tokens and assemble smaller documents that are under the token limit. See the following code:
The LLM summary chain is a thin wrapper around our SageMaker endpoint:
The answer to the question
With the search-enhanced generation method, we first split the document into smaller segments. We create embeddings for each segment and store them in the open source Chroma vector database via the langchain interface. We store the database in the Amazon Elastic File System (Amazon EFS) file system for later use. See the following code:
When the embeds are ready, the user can ask a question. We search the vector database for parts of the text that most closely match the query:
We take the closest match and use it as the context for the text generation model to answer the question:
User experience
Although LLMs represent advanced data science, most use cases for LLMs ultimately involve interaction with non-technical users. Our example web application handles an interactive use case where business users can upload and edit a new PDF document.
The following diagram shows the user interface. The user starts by uploading a PDF. Once the document is stored in Amazon S3, the user can start the text extraction job. When this is done, the user can call up a summary task or ask questions. The user interface exposes some advanced options, such as part size and part overlap, which will be useful for advanced users testing the application on new documents.
Next steps
The LLM provides significant opportunities to seek new information. Business users need convenient access to these capabilities. Two directions should be considered for future work:
- Take advantage of the powerful LLMs already available in Jumpstart foundation models. With just a few lines of code, our sample application can deploy and use advanced LLMs from AI21 and Cohere for text summarization and generation.
- Make these capabilities available to non-technical users. The prerequisite for processing PDF documents is to extract text from the document, and the summary work can take several minutes. This requires a simple user interface with asynchronous backend processing capabilities that is easy to design using cloud services such as Lambda and Fargate.
We also note that a PDF document is semi-structured information. Important cues such as section headings are difficult to recognize programmatically because they rely on font sizes and other visual cues. Identifying the underlying structure of the information helps the LLM to process the data more accurately, at least until LLMs can handle inputs of unlimited length.
conclusion
In this post, we showed how to build an interactive web application that allows business users to upload and edit PDF documents for summaries and answers to questions. We saw how to use Jumpstart’s foundational models to access advanced LLMs, and use text segmentation and power generation techniques to process longer documents and make them available as information for LLMs.
For now, there’s no reason these powerful capabilities shouldn’t be available to your customers. We encourage you to start using Jumpstart foundation models today.
About the author
Randy DeFauw is the Chief Solutions Architect for AWS. He holds an MSEE from the University of Michigan, where he worked on computer vision for autonomous vehicles. He also holds an MBA from Colorado State University. Randy has held various positions in the technology space ranging from software engineering to product management. In entered the Big Data space in 2013 and continues to explore this area. He is actively working on projects in the ML space and has presented at numerous conferences, including Strata and GlueCon.
[ad_2]
Source link