Create OCR-Processed PDFs In 2 Steps

Siddhant Bahuguna

Khalid El Aaboudi

Last updated on

Feb 26, 2025

min. read

Guide to choose OCR solution

In this article, we’ll show you how to transform images, scanned PDFs, and similar files into OCR PDF documents—making them fully searchable.

Below is the two-step process we use at Mindee to convert your files into high-quality, OCR-enabled PDF :

First, we use an open-source tool called Mindee docTR to perform OCR (Optical Character Recognition) on the image or scanned PDF. The docTR OCR results are then exported as an XML file in hOCR format.
Lastly, we convert the hOCR file to PDF using an open-source tool, OCRmyPDF

Why Create an OCR-Enabled PDF?

We make images as well as the scanned documents or PDFs into OCR-enabled PDFs so that we may search for certain keywords or phrases within them. A few lines of code is all that’s needed to do this. With the approach we present, we’re also able to exhaustively OCR-process the texts embedded within the images, which are normally left out (logos, watermarks, etc.).

To better understand why we need to digitized documents, let’s take a look at two use cases, which involve searching through a huge PDF and searching through a folder full of PDFs.

Below is a non-exhaustive list of documents that can be categorized under the two use-cases.

Searching Within a Large PDF

Examples include:

Contracts (terms and conditions, loan contracts, employment contracts, etc.)
Specifications
Scientific and technical reports
Insurance policies
Request for information/quotation/proposals

Searching Through a Folder of PDFs

Examples include:

Resumes (to find specific skills)
Questionnaires/forms (to locate particular answers)
Invoices, receipts, and quotations (to identify a specific item, customer, or supplier)
Presentations (to search for any keyword)
Old scanned news articles (to retrieve specific news content)

Creating an OCR-readable PDF streamlines the process for both non-developers and developers, enabling quick keyword searches using any favorite PDF reader.

You can also check out our article on how to read the passport MRZ lines!

Why Use docTR for Creating OCR PDFs?

A Quick Overview of docTR

docTR is one of the best open-source OCR solutions available on the market. It uses state-of-the-art detection and recognition models to seamlessly process documents for Natural Language Understanding tasks. With just 3 lines of code, we can load a document and extract text with a predictor!

pip install python-doctr

from doctr.io import DocumentFile
from doctr.models import ocr_predictor
======================================

model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)

docTR supports pretrained backbones such as dbresnet50rotation for both text detection and recognition. For more details on available backbones, please refer to the documentation. Additionally, docTR’s ability to train with small rotations makes it exceptionally robust for creating accurate ocr pdf files. A list of supported vocabularies is available here.

Feature	minimum	Basic WL	basic	comfort	extended
Basic Document Data Extraction	fr	fr	Yes	Yes	Yes
Template-based Processing	Yes	Yes	Yes	Yes	Yes
Multi-page Document Support	No	Yes	Yes	Yes	Yes
Structured Data Output	No	No	Yes	Yes	Yes
API Integration	No	No	Yes	Yes	Yes
Advanced Data Validation	No	No	No	Yes	Yes
Custom Field Extraction	No	No	No	No	Yes

Performance Comparison

Using various datasets, the table below compares docTR with alternative OCR solutions:

🔒 Note: The dataset used for the comparison could not be made public due to the sensitive information included in it.

Architecture	Receipts		Invoices		IDs
Architecture	Recall	Precision	Recall	Precision	Recall	Precision
(docTR)db_resnet50 + master	79	81.42	65.57	69.86	51.34	52.90
(docTR)db_resnet50 + sar_resnet31	78.94	81.37	65.89	70.79	51.78	53.35
AWS Textract	75.77	77.70	70.47	69.13	46.39	43.32
Gvision doc. text detection	68.91	59.89	63.20	52.85	43.70	29.21

Comparisons with public datasets such as FUNSD and CORD also demonstrate docTR’s competitive performance:

	FUNSD		CORD
Architecture	Recall	Precision	Recall	Precision
(docTR)db_resnet50 + master	71.03	76.06	84.49	81.94
(docTR)db_resnet50 + sar_resnet31	71.25	76.29	84.5	81.96
AWS textract	78.1	83	87.5	66
Gvision doc. text detection	64	53.3	68.9	61.1

The above OCR models have been evaluated using both the training and evaluation sets of FUNSD and CORD. For further information regarding the metrics being used, see Task evaluation.

Jumping Into the Code!

To create lightweight OCR PDFs using docTR and OCRmyPDF, start by installing the necessary tools: Mindee docTR and OCRmyPDF.

# installing requirements
!pip install "python-doctr[tf]"
!pip install ocrmypdf

You can use this example or any image/scanned PDF of your choice. For this tutorial, we’ll use the sample image below:

Downloading the Sample Image

To download our sample image, you can run the following code:

# sample input image
!wget https://pbs.twimg.com/media/B_UpX3WU8AA2j3r.jpg -O ./data/images/image.jpg

Alternatively, save the image locally on your computer.

As iterated earlier, we are breaking the process into two steps:

Step 1: Extract Text Using docTR

Define the output folders for the OCR PDF and the hOCR data:

import os
# define output folder
output_folder = "./output/"
output_hocr_folder = output_folder + "hocr/"
output_pdf_folder = output_folder + "pdf/"

os.makedirs(output_hocr_folder,exist_ok=True)
os.makedirs(output_pdf_folder,exist_ok=True)

Then load the image (or scanned PDF) and perform OCR:

📄 Note: if you are using a scanned PDF, you’ll need to use the DocumentFile.from_pdf method instead and run an OCR with docTR.

from doctr.models import ocr_predictor
from doctr.io import DocumentFile

# load image
image_path = "./data/images/image.jpg"

# extracting text from input image using docTR
docs = DocumentFile.from_images(image_path)

# load model
model = ocr_predictor(
            det_arch='db_resnet50',
            reco_arch='crnn_vgg16_bn',
            pretrained=True
)

result = model(docs)

# display ocr boxes
result.show(docs)

Below we can see the docTR result which shows the detected and highlighted text in the image:

the image used for the example with highlights

Step 2: Export hOCR and Generate the OCR PDF

Export the docTR OCR results as an XML file in hOCR format:

# export xml file
xml_outputs = result.export_as_xml()
with open(os.path.join(output_hocr_folder,"doctr_image_hocr.xml"),"w") as f :
    f.write(xml_outputs[0][0].decode())

After exporting the hOCR result of docTR as XML, we can use OCRmyPDF to convert it to an OCR-readable PDF:

from ocrmypdf.hocrtransform import HocrTransform
output_pdf_path = output_pdf_folder + "hocr_output.pdf"

hocr = HocrTransform(
    hocr_filename=output_pdf_path,
    dpi=300
)

# step to obtain ocirized pdf
hocr.to_pdf(
    out_filename=output_pdf_path,
    image_filename=image_path,
    show_bounding_boxes=False,
    interword_spaces=False,
)

And voilà! You’ve created your OCR PDF file.

Searching Through Multiple OCR PDFs

In the Ubuntu terminal, for example, you may use the Ubuntu pdfgrep command to search a folder full of numeric or digitized PDFs.

To do this, let’s first install pdfgrep:

# first let's install pdfgrep
sudo apt-get update
sudo apt-get install pdfgrep

Now we can use pdfgrep to search for any information using a keyword. We can do simple searches with an exact match or use a regex for more flexibility. Let’s look at some examples:

Below, we want to look for year-specific information using the keyword “Year.”

pdfgrep -r "Year"

./hocr_output.pdf: APPLICANTS   ForPublication Year2015-2016

We can also search for a specific time-lapse, say from 2010 to 2019, using a simple regex.

pdfgrep -r -P "\b201[0-9]\b"

./hocr_output.pdf: APPLICANTS   ForPublication Year2015-2016
./hocr_output.pdf:andsubmittotheVarsitarianofficeonorbeforeMARCH:27,2015.

From the above examples, you can see how easy it is to leverage OCR-readable PDF search power on a folder – using only a few lines of command.

Why Use OCRmyPDF?

OCRmyPDF is a tool that adds text layers to scanned image PDFs, making them searchable. It also optimizes PDFs by compressing them without loss of quality—optimizations are applied only after successful OCR processing. With options for rotation correction, batch processing, and selective OCR, it’s a versatile solution for generating high-quality OCR-processed PDF files.

However, OCRmyPDF’s default reliance on the Tesseract OCR engine can sometimes limit accuracy, especially on poor-quality scans. This is why we use docTR as an alternative OCR engine to produce superior OCR PDF esults.

This article helps overcome the major limitation of OCRmyPDF, which is limited by the Tesseract OCR engine. As a result, Tesseract is not as accurate as a state-of-the-art OCR solution (you can test OCR accuracy with our benchmark tool). Poor quality scans could produce poor quality OCR. That is the reason we went with docTR as a replacement for the default OCR engine of OCRmyPDF.

Developer-first

About

From simple photos to complex PDFs or handwritten files, Mindee's API turn your document data into structured JSON with high‑reliability. Zero model training required. Any alphabets, any languages supported.

Explore platform