This blog post is a long read. If you prefer watching or listening, check out this recording of ABBYY’s session during PDF Association PDFest 2020, which covers this topic.
The topic of this blog post is how optical character recognition (OCR) technology facilitates digital transformation for (Portable Document Format) PDF documents. This may sound a bit silly at first: PDFs are already digital, one would say, what is there to transform? OCR has been around for quite a long time, so you may be aware how OCR is used to digitize paper documents. Yet, PDFs are not that simple, OCR can do much more than just digitize old papers, it enables you to transform your work with many PDFs in a truly digital way.
Why OCR for a PDF?
Why do we need to “digitize” PDFs with OCR? They are already inside the computer, in a digital format, aren’t they?
While most PDFs are perfectly well suited to display nicely on a screen, things get much more difficult once you need to analyze, modify, and reuse information stored in a PDF effectively. To do so, we first need to understand PDF document structure.
Let’s have a look at the different types of PDFs:
- In scanned or “image-only” PDFs, we have only a picture. There is no text inside to work with and we don’t know anything about the structure within the PDF itself.
- In searchable and so-called “digital-born” PDFs, we usually have usable text in the document within the file, but still—the document structure is not known.
PDFs do not contain information about the document structure. This means from a PDF itself, we don’t know what parts are text, images, lines or other elements. We don’t know what are the roles of each of those elements and how they are related to each other. This is what OCR can help to identify.
That said, some types of PDFs, such as PDF/UA, may contain additional information that describes the structure to some extent. However, those are just a small number of all PDF formats, and it still would not be enough to analyze and modify PDFs effectively, which is necessary for many operations we want to do with PDFs.
PDF Features Powered by OCR
What kind of operations can OCR enable? Going beyond working with a PDF document as a whole or a set of pages, OCR allows you to work with the content of a PDF document. Such operations include text editing, full text search, redaction, extracting tables, comparing documents and so on. For some of them it is quite clear why and how they depend on OCR, such as operations with scanned PDFs, but for the others, such as the operations with digital PDFs, it isn’t that obvious. Later, we will take a closer look at three use cases.
Before that, let’s have a simplified overview of how OCR works to clarify what kind of useful information about a document it can provide.
What does OCR do with a document?
There are three major steps in an OCR process.
First, the document pages are analyzed using a system called Document Analysis (DA). DA almost literally “looks” at each page and analyzes the image starting from its general appearance and goes further down to detect the smallest parts of the image that could be separate words and characters. This is how DA finds which parts of the page contain texts, which are pictures or tables, and identifies if there are background pictures on the page. DA detects barcodes as well. In addition, DA analyzes the tables to find out which parts of the table image are separators, which are the cells, and what is inside each cell.
The second step is to recognize each bit of text found by DA. This is done by the OCR system itself. OCR “reads” images of each character or character combination, giving us the digital text to work with further in the form of computer codes.
In the third step, the Synthesis system comes into play.
By now, we have information about where the texts, pictures and tables are in a page, where the table cells and separators are, and which character sequences exist in every text section and table cell. We also know (from the DA process) details such as how the image separates into lines and words, and where this happens on the page.
We have all the pieces, but they are not assembled yet. This is a task for the Synthesis system. It basically assembles the whole document in a digital form using all of the pieces identified in the previous steps. Synthesis analyzes the parameters and order of the pieces, looks for specific patterns and similarities on a page and between the pages, and does many other things to recreate the document structure. At this stage, we finally get information about the content in a PDF such as:
- Functions of the text pieces: a paragraph, a heading, a header/footer, a list, etc.,
- Where to place all these pieces in the digital document,
- Format of each paragraph, or even parts of a paragraph: character and line spacing, indentations, alignment, and so on.
Now, when we finally have a digital representation of a PDF document structure, we know enough to analyze, compare, modify or extract the content effectively.
Three PDF Tasks Transformed by OCR
Now that we have an overview of how OCR can identify structure, let’s look at how this knowledge enables solving some specific PDF-related tasks.
1. PDF paragraph-level editing
What's the deal?
Well, what’s the deal exactly with editing a digital PDF? An average PDF user probably expects that text and everything else in digital PDF documents should be accessible. Meaning it should not be any problem to edit them.
We wish that were true.
The reality of PDF is, unfortunately, different:
- The PDF format was not created with editing functions in mind. Rather the opposite; PDFs are not editable by design.
- As mentioned earlier, PDFs contain only information about separate characters – there is no information about words, lines, paragraphs, or other document elements. There is no information about the document structure that dictates layout when adding or deleting some text.
- Simply adding or deleting characters could be possible, but it wouldn’t change how the surrounding structurally connected characters are displayed. For example, look at the picture below. This is how it would look if we were simply inserting or deleting characters in a PDF.
To sum it all up, the deal is that the missing information in a PDF about how to rearrange other elements when adding or deleting information is crucial to keep a paragraph of text consistent when editing it.
OCR to the rescue
How to edit a paragraph in a digital PDF with the help of OCR is quite simple. The text is taken from the PDF as it exists. OCR detects the markup, which we need to know and follow in order to edit the whole paragraph properly.
The OCR involvement in PDF paragraph-level editing processes looks like this:
Once a user initiates editing, the DA (Document Analysis) processes the raster image of the page and finds the elements such as texts and pictures.
Then, the Synthesis system creates a temporary copy of the page with all necessary markup added.
Digital text taken from the PDF itself aligns with the detected structure. This combination allows the page to be edited by the user.
Then the user edits, and since the program already knows and can now follow the structure of paragraphs, the text edits are made smoothly. This allows transitions from line to line; uniformity in line and character spacing; the font is selected automatically; paragraph borders can expand or shrink according to the edits, and so on. All of the edits are displayed to the user in real time.
Once a user is done editing, only the part that has been changed is updated in the PDF. Despite using OCR, the editing doesn’t require creation of the resulting document as a copy of the original one made by the conversion process. As the edits are done in the original document itself, everything that has not been edited is preserved as is.
2. Extracting PDF Tables
Can we get what we can see?
Utilizing OCR also helps to effectively reuse tables by extracting them directly from PDFs. Let’s take a closer look.
Tables look great in digital PDFs. We can easily select text in an individual cell and copy it elsewhere. If we can copy one cell, we should be able to extract the information we see quickly and easily by selecting the whole table and copying it, right?
Why not just copy?
Unfortunately, not. Selecting the whole table in a digital PDF, as we do with text, will not retain the table structure, so you will have to recreate it from scratch.
Why does it happen this way? In a digital PDF, a set of objects, such as lines and rectangles, define the visual appearance of a table. These separate objects are completely on their own. The objects are defined in terms of how they shall look, but are independent to other objects and are not associated with the content in the table. That means it is not known which table cell each piece of text belongs to, or whether it belongs to the table at all.
Since lines and rectangles are already in the PDF, why not simply copy and paste? If we did that, in the best case, we could probably paste what we copied only into another PDF. Even so, the success would be questionable. Moreover, we could not paste what we copied into any other application such as Microsoft Word or Excel, for example.
If we can't "read" about it in PDF, we can't "see" it!
Nevertheless, there is a solution. If we can’t, so to say, “read” about a table from a PDF, since tables aren’t described even in digital-born PDFs, let’s “see” it! OCR can describe and recreate the structure of a table based on its image.
With the help of OCR, a table can be extracted in its rasterized entirety. This is how it works:
First, the selected area with a table is turned into an image.
Next, Document Analysis (DA) analyzes the image to find the table elements such as cells, separators, backgrounds, text and picture elements in the cells.
Then, we need the text itself. If the PDF text is legible, it is taken from the PDF. If not, the text can be identified with OCR.
Finally, the Synthesis system “assembles” the detected structure and the content into a properly marked up piece. This is placed into the clipboard, and the user can paste it as a whole table into another application such as Excel, Word, an email, and so on.
3. Comparing "Digitally-born" PDF Documents
We have all the characters, so what else do we need?
The third case for today is when OCR helps to compare digital PDFs more accurately.
When it comes to comparing digital PDFs, once again, you may ask, “we have all the characters inside the PDF, what else do we need to have the PDFs compared successfully?”
Let’s clarify this.
In general, when comparing two copies of a document in any format, not only PDF, one of the goals is to minimize false differences. The other is not to miss any of the actual differences between the two texts.
Three main causes for false differences can be due to:
The first cause for false differences between two copies of a document occurs when the same text is formatted in a different way or placed somewhat differently within the page, but the overall order of how the text appears in the document has not changed. The second cause occurs when a header/footer or an insert breaks the main text in different places. Both situations may be a result of edits made in the text of one of the copy, or changes made to the layout, such as setting the margins differently, for example. For both situations, the solution to avoid false differences would be, as you probably already guessed, to have and use information about the document structure – that is, to apply OCR to recreate it. However, there is also the third cause for possible false differences, and that is OCR accuracy. Which, generally speaking, is not 100% accurate. A kind of contradiction here, and we will return to this a bit later.
Considering structure or not – What’s the difference?
In the screenshot above, there is only one real difference in the document. The copy on the left contains a footer, while the one on the right does not. The rest of the text is the same, although it is distributed between the pages differently, as you can see (look at the beginning of the Section 4 in both copies).
In the screenshot below, this is the same pair of the documents, compared without consideration of the document structure:
As you may see, although just one real difference exists, as many as four extra false differences are identified. False differences are unnecessary and unwanted points of attention, which waste your time and energy leading to decreased levels of attention and overall productivity. That’s why we want to minimize them.
In summary, the problem is that if we simply take the text from a digital PDF and use it for comparison, we risk creating many false differences because we did not – and could not – take into consideration document structure. This includes understanding how the text may flow from page to page differently, may be broken down by headers and footers that do not belong to the main text, and so on.
Another potential problem with just taking the text from a PDF is that the text layer in a PDF is not always accurate or usable.
Is any solution good enough?
To accurately compare digital PDFs, we have clarified the benefits of OCR to understand document structure, the challenges presented due to possible false differences identified by OCR, and yet to address problems with the text layer quality, it is better to use OCR… So, is there any good solution that enables you to tackle all of these challenges?
Yes, there sure is. We shall use OCR intelligently. The solution minimizes utilization of character recognition by taking as much of the text from the PDF as possible in digital, while using enough information about the document structure to correctly identify what exactly, and in which order, it shall be compared.
Here is how comparison works in the case of digital PDFs:
First, they are prepared with rasterization and pre-recognition.
Next, DA and Synthesis systems define the document structure, finding elements such as paragraphs, lists, inserts, headers and footers, page numbering, and so on.
Then, the quality of the text layer is assessed. That is the reason, by the way, why preliminary recognition is required on the first stage of preparation. Valid parts of the text are taken for comparison directly from the PDF. The bad parts are recognized.
Then, after paragraphs, headers, and inserts from the two copies are properly aligned, the comparison itself happens. Even if they do not align by pixel-match or have shifted in the two copies, they can now be compared properly.
These are just three examples of operations with PDFs that benefit from, or even depend on, OCR technology. There are many more such operations. Even excluding purely document conversion operations, about one third of the feature list for ABBYY FineReader PDF 15 consists of such features, and the importance of a high quality OCR in software to work with PDFs should not be underestimated.
Try all features and capabilities of FineReader PDF 15 for yourself!