digital light blue arrows in front of a dark background

Digitizing documents and images within AWS

Find out how to digitize existing documents to make data-driven decisions

October 18 2021Esteban Uscanga-Olea

The first step towards being data-driven

There is a wealth of solutions available, for example Tesseract OCR: but the challenge is managing the deployed service to ensure any solution scales horizontally without incurring too much time and effort, while fitting existing architecture. Read on to learn how Amazon Textract helps solve these challenges.

Text analysis with Amazon Textract

Digital document icons in light blue on a dark blue background

Amazon Textract is a machine learning service that automatically detects and extracts printed text, handwriting, and structured data (such as fields of interest and their values) from tables, images and scans of printed documents. It can successfully automate the digitization process, through a structured set of technology rules and commands.

Amazon Textract's machine learning models have been trained on millions of documents so that virtually any document type you upload is automatically recognized and processed for text extraction. When information is extracted from documents, the service returns a confidence score for each element it identifies so that you can make informed decisions about how you want to use the results. See more here.

AWS Textract in Action

We recently used AWS Textract to demonstrate how effectively it digitizes documents. Our task was to extract data from PDF files attached to incoming emails. To perform this manually took 5-10 minutes per PDF, since the operator had to physically check the extracted data. The number of documents requiring extraction grew monthly to the point where we were receiving hundreds each month and needed to automate the process.

When we began to use AWS Textract to digitize the process, we built a serverless pipeline to automatically process the emails, extract data from the attachments, refine it and store it in DynamoDB. Once the relevant data had been digitized, we could leverage the full power of AWS Services.

Digram showing the textract process

As the diagram demonstrates, AWS Textract is an effective service for digitizing historical documents. In this instance, AWS Textract takes the content, processes it using machine learning and manages the digitization. Multiple services such as AWS Lambda and Dynamo DB ensure we can effectively store the data that we collect.

This process eliminates the need for manual checks and has enabled us to save around 10-12 days of work each month. Furthermore, by using event-driven and serverless architecture, the solution can scale on demand and processes documents autonomously.

Our experience demonstrates that AWS Textract can perform efficient and automated digitization of existing documents, reducing the time and costs associated with performing manual tasks.

Esteban Uscanga-Olea, Cloud Solutions Architect, T-Systems International GmbH

What does this mean for me? 

Our experience demonstrates that AWS Textract is an effective service for managing the digitization process. The solution enables significant time savings when compared to manual, or part-manual approaches. The integrated solution enables your organization to leverage advanced analytics in a matter of seconds, taking the first steps towards a more data-driven enterprise.

If you are interested in using AWS Textract and experiencing how well it integrates with other AWS services for effective digitization you can access the code and repository by clicking here.

About the author

Esteban Uscanga-Olea

Cloud Solutions Architect, T-Systems International GmbH

Show profile and articles

You might also be interested in:

Do you visit t-systems.com outside of Germany? Visit the local website for more information and offers for your country.