How we extracted insights from unstructured PDF documents

Abstract: For a client in the medical domain, we converted a significant amount of unstructured health records into structured insights. Challenges included navigating discrepancies involving how health records might look, and addressing the fact that relevant data can reside anywhere—in tables, as free text, or even in handwritten notes. Yet with our solution, the client can tap into a so-far-unused dataset and drive novel use cases.


A customer in the insurance industry approached us with a challenge we hoped to solve using machine learning technologies. A huge part of the required data, however, resided in thousands of raw, mostly scanned PDF documents.

Those documents included health reports for individual patients. The greatest hurdle was the lack of structure in those documents, as relevant information often resided in tables, as free text, or even in handwritten notes.

We quickly found that only by getting the data out of those documents could we train our machine learning algorithm to provide reliable predictions.


At SPRYFOX, we dug deep into the problem and broke down our approach into three steps:

  1. Finding a tool for reliable optical character recognition.

We conducted an extensive manual labeling step for extracting the data from the raw PDFs. From there, we started evaluating different technologies to get the best possible result within an optical character recognition step. These technologies included free tools like Tesseract and cloud-based solutions such as AWS Textract. Driven by the performance of each tool throughout the evaluation phase, we decided on the follow-up steps.

2. Transforming image-like PDF documents into a processable representation.

Based on our evaluation, we designed a fully-automated PDF processing framework that could extract data from thousands of documents within a reasonable timeframe and save the results in a well-partitioned data sink.

3. Extracting the training data from the processable document representations.

Finally, based on the extracted texts, we designed and built an extraction routine that retrieved the inputs for our ML model training.


For our customer, we decided to employ Amazon Textract to reliably process raw PDF documents within reasonable timeframes. Keeping an eye on the costs, we chose the right functionality subset that would provide us with the required data quality.

We ultimately delivered a cloud-based solution that could orchestrate the whole pipeline from feature extraction to preprocessing, and from representation mapping to a database housing the valuable extracted information for further use.

We built the fully-IaC-ready (Infrastructure as Code) solution with high-level AWS services. Our customer can now provide raw documents and get precise predictions with no additional input.