Artificial intelligence (AI) plays an increasingly significant role in the development of business technologies. ABBYY's innovative AI-based technologies and solutions enable people to put their information to work and are defining the next phase of digital transformation. But what does AI have to do with ABBYY FineReader PDF, the product that this blog is devoted to? How important is AI for the quality of OCR, and how does it enable FineReader PDF to be an efficient and versatile PDF tool for all kinds of documents: digital and scanned PDFs, scans, and papers?
Join us in this Q&A for insights on AI and OCR from Ivan Zagaynov, Head of Computer Vision at ABBYY.
AI trends in OCR and FineReader
When working with any kind of image of documents, be it an image file or a scanned PDF, the technology used is optical character recognition (OCR). OCR is not a new technology, yet it is constantly developing. What are the latest trends?
Ivan Zagaynov: That’s right; OCR technology is a field that continues to develop, and all modern approaches, including the latest developments in artificial intelligence (AI), are used. At ABBYY, we constantly redesign and streamline our approach for OCR innovation by adding new features and supporting more and more complex scenarios.
AI, actually, is not a new thing in OCR.
Over two decades ago, we at ABBYY introduced machine learning techniques in our technologies to solve various tasks in the OCR process and enable intelligent document processing. Since then, these techniques have matured and become robust tools that are not only used to solve specific tasks but are also able to select the optimal general approach and architecture for that task.
For example, a new approach that is driving our OCR development recently is to fully automate the recognition process using AI technologies, with little to no interaction from the end user. We no longer rely on the correct orientation of the image, on the resolution in the file metadata, on the indicated languages, and at the same time, we no longer expect obtaining this information as a required manual input for a successful recognition. This is especially important for unattended processing of large volumes of documents or on-demand processing provided as a service, that’s why this approach is already being implemented in some solutions such as ABBYY Vantage.
Handwriting OCR (i.e. recognition of cursive writing) is another new, rapidly developing technology and has been recently added to ABBYY’s OCR engine. Various documents like contracts, invoices, and receipts may contain handwritten fields – and for the tasks requiring extraction of data and information from documents, it’s important to be able to recognize them. Nowadays for handwriting recognition, we use specifically designed deep learning models based on the Transformer architecture, which is well-known in the NLP world. By the way, the very same architecture blocks are the basis for the well-known ChatGPT models.
We still have some work ahead of us in this area to reach the high standard of accuracy and broad range of language support known in ABBYY’s solutions and we look forward to making progress with these new developments. Even so, we have already integrated handwriting recognition into some ABBYY solutions, and as the technology matures and use cases evolve, we will continue to innovate for our customers.
What about more traditional OCR, the processing and recognition of printed text?
I.Z.: To further improve the accuracy, we have implemented a family of language models inside the recognition process in the OCR engine. The models estimate the probability of the occurrence of a word “token” in the context of other words (“tokens”) to correct some ambiguities between similar OCR results. These models were pre-trained on large amounts of text and are able to retain knowledge about specific natural languages contexts. They share the same Transformer block architecture which can be found in ChatGPT and other recently published large language models (LLMs) that demonstrate outstanding generative abilities to produce text output for the user queries. But the results from publicly available models always require some sort of a manual review, as we cannot totally rely on them.
So, in our case, we think of the preliminary result of our OCR process as a “query” that is then fed as an input to one of our language models to “fine-tune” the results. However, we intentionally and strongly restrict our model’s generative abilities, so that the overall results remain consistent with what was originally on the page (something that ChatGPT-like models cannot provide!). Besides, it helps us strongly reduce model size and computational efforts for the model inference.
These new language models are being used already in cloud and server based ABBYY solutions.
Thinking about OCR in general now, it is no longer just about robust recognition of ideal document scans, but rather about extracting structured, textual information from any source image. And recently developed AI approaches are extremely valuable for these non-trivial tasks.
FineReader PDF is an AI-powered PDF solution. Can you explain what that means and what kind of AI is in use? There’s a common opinion that AI is able to learn and improve itself over time as people do, can you address this?
I.Z.: First, let me clarify the definition of AI in this context. We think of it as an area of computer science about the design of artificial machines and algorithms that work and react like humans. Usually it is called “weak AI,” but don’t let the name mislead you: it is in fact a very strong, capable technique, which is focused on a solution to a specific, well-defined problem. Why am I clarifying this? Because sometimes the scientific label “AI” refers to some general artificial intelligence, or so-called “strong AI,” which is attributed to the human-like behavior that is capable of solving any new, previously unseen problem.
Although FineReader PDF doesn’t learn as people do, on a micro-level, for each page processed there is some “gaining” experience that helps it to obtain a better result. When FineReader PDF starts to recognize a page, it has a number of pre-trained neural network models. For example, there are the models for character recognition tasks, and there are also mechanisms that calculate statistics to “fine-tune” the recognition process “on the fly”. Thus, character-by-character, line-by-line, FineReader PDF actually uses previously obtained experience during the processing of the whole page. This helps the tool gain speed and improve the quality for better processing results of the whole page of natural language text, compared to the recognition results of separate lines or a random sequence of characters. Context information here plays a significant role, and it is used in our OCR engine in a similar manner as a human does when they read text: humans often predict words and verify themselves based on the meaning of the whole sentence. However, this “gained” experience is not shared between pages or documents. On the other hand, this helps to keep reproducible behavior on a page level and correctly use context on every page.
Reading languages with neural networks
What are some improvements made in FineReader PDF that couldn’t happen without using AI technologies?
I.Z.: Without AI technologies we could not improve speed and quality in the recognition of complex scripts like Chinese, Japanese, and Korean as significantly as we did. We actually made it with the help of architectures of deep convolutional neural networks for character recognition tasks that we invented. The approaches that were used before that, including traditional machine learning, had unstable results in difficult cases and worked more slowly in general.
The newest AI algorithms enable greater accuracy of languages based on a complex script, such as Chinese, Japanese, and Korean. Is there a different AI approach for Latin- and Cyrillic-based languages?
I.Z.: There are neural networks helping to recognize these languages as well. Script complexity is just about the correct choice of the model architecture that allows us to reach optimum balance between OCR speed and quality. Therefore, we use different models for Latin, Cyrillic, Arabic, and other script types. For example, for Arabic script we use an end-to-end approach for word recognition without character separation. A special architecture, combined of convolutional and recurrent neural networks, solves that task. For Latin and Cyrillic, we use a mixed approach and switch between different recognition models based on the visual quality of the text. This has helped us to significantly improve both speed and accuracy.
Not only for “pure” recognition
How else are AI techniques used in PDF processing? Are there other AI applications used “outside” of OCR in FineReader PDF?
I.Z.: Of course, we apply AI techniques on almost any stage of our document processing pipeline. Analyzing document structure or image preprocessing usually involves a large number of object detection and classification tasks. Many of those tasks are solved with the help of deep learning approaches as well as with traditional machine learning. When detecting a table or barcode or finding a header or footer on the page, pre-trained models are in use to find relevant objects on document images.
For example, accuracy of table detection has significantly improved since the previous version, FineReader PDF 15, by using a combination of a convolutional neural network and a linear partition graph for document analysis. The neural network defines probabilities for every pixel to belong to different types of elements – text areas, tables, pictures, etc. – on the whole page. The graph then accurately finds all the elements themselves, in their integrity, based on the information obtained from the neural network. The resulting improvements are especially noticeable in difficult cases: on documents with complex multi-column structures, background images, on tables with no separators, and so on.
Thus, AI techniques are utilized to enable a whole range of various FineReader PDF functions, not only document conversion itself—copying tables, searching for keywords in a scanned PDF, and comparing two copies of a document to find differences, just to name a few.
Solving document tasks well needs intelligence
Is there any task that FineReader solves, as a PDF and document conversion solution, that couldn’t be solved successfully without AI?
I.Z.: The definition of AI itself implies that “it is an algorithm that works and reacts like humans,” no matter what is inside of it, we may use some heuristic solution or neural networks that perform just like a human being. So, if some task were solved successfully (and reaching a high level of customer expectation’s), then we should say that there is at least “weak AI” behind it. There are some very good examples thatour solution delivers: we can make any PDF document searchable on the fly, convert it to any other format like DOCX, XLSX, etc., and even edit scanned PDFs on a paragraph level. Moreover, we can create a PDF from any scan or digital photo and preserve the original document layout and style formatting, like columns, font, and text color. Is all that possible without AI?
To sum up, can you tell our readers why AI technologies are important for software development that aims to streamline and automate various document-related tasks?
I.Z.: I would say that the incorporation of the latest technologies in our products allows us to deliver the best possible solutions to our customers, and since it is our mission to push innovation forward to new levels of automation, the importance of AI technologies must not be underestimated. It is critical to be on the cutting edge!
Try it yourself: Test AI-powered ABBYY FineReader PDF in your daily work with PDF and scanned documents and experience the effectiveness and convenience it brings.
Editorial note: This interview originally published on December 5, 2019, and has been updated on August 14, 2023, to expand on the latest AI-related developments in ABBYY OCR technologies.