Building High Performance Document OCR Systems

Follow along as Mathieu teaches you how to build high performance document OCR systems.

Introduction

Document Understanding for scanned documents might appear more or less as a solved task. Yet, using off-the-shelf models is not always sufficient to build high-performance applications on specific application domains. The performance of APIs such as Google’s Cloud Vision is great, on average. But for high-stake applications, it’s likely not enough. For instance, the OCR might consistently miss some patterns, or fail to read some words. Its performance degrades with, for instance, non-latin languages, the use of jargon, or the presence of unique watermarks. At the same time, academia has not been focussed on Document OCR for the last few years.

In this blog post, we share some of our learnings on building a high performance system that can understand pictures of scanned documents, and introduce approaches for building OCR systems with real and synthetic data engines.

Text Detection

How does Text Detection work?

In Text Detection, we wish to locate isolated words (or sometimes characters), for instance by using a bounding box (a rectangle around each word). Specialised models are used to encode inductive biases. They are all very data hungry. Annotating documents word by word is a very time consuming task: a single document might require hundreds of bounding boxes. This is why available pretrained models for documents are trained from synthetic datasets, as they are simply not very large datasets for text detection.

Source: https://arxiv.org/abs/1904.01941

One of the most used is called SynthText, and has been used to train models such as CRAFT. SynthText has a few biases and limitations. It has a small English corpus, and it tries to emulate 3D scenes with few words. In the case of scanned documents, we might have issues with words from different lines being grouped together. Visual artefacts such as tables or lines might trigger models in various ways. Non-latin characters are very likely to be problematic, too.

Source: https://www.robots.ox.ac.uk/~vgg/data/scenetext/

Training your own model

To go beyond off-the-shelf model performance, we will need to train models with domain-specific data - but manually annotating data is really expensive. Instead, we can build a data engine.

Generating synthetic documents from scratch is generally not the right choice, unless your layout is simple and consistent. Text Detection requires a lot of data. Using any kind of generative model is not a good idea as well. For starters, these models are terrible at text in general.

The most efficient approach is to select an open source baseline that can be efficiently fine tuned with real data and that already has a good performance. This excludes, in particular, CRAFT, that requires character-level annotations.

The annotation process is highly efficient by pre-populating annotations (pseudo-labels) and focussing only on corrections, such as missing words, cropped words, grouped words, and other issues. The corrections are used to train a new model, which in turn is used to generate new predictions to be corrected. This iterative process is the key to achieve very high performance and correct the initial biases of the pre-trained model. Having very clear annotation guidelines is crucial to this process, and words should be split as consistently as possible to avoid issues in the recognition step.

If you have access to a large quantity of documents, then this approach can be improved by using active learning, with a combination of diversity sampling and uncertainty sampling strategies. For instance, one can ensemble multiple models, as a way of identifying samples with high prediction variance. Designing the right strategies for active learning is often problem-specific. One might want to sample more certain formats or languages, one might expect a certain number of predictions, etc.

Text Recognition

How does Text Recognition work?

In Text Recognition, we build on top of Text Detection. The detected words are cropped from the original image, and the task is to turn these small patches into actual text. This task is much harder than the previous one! In this task, we generally produce a sequence of characters (from our vocabulary, the set of all characters we can read) and train models using the CTC loss. Examples of models include CRNN or SAR.

Source: https://arxiv.org/abs/1507.05717

Building Restricted Models with Data Generation

In this task, a useful trick is to build specific models for each type of field to be detected. Recognition models are generally very light, as they need to run on many small crops. If the box is supposed to be an English word, then a pre-trained model might work well. If you already know a box contains, say, a phone number, performance can be greatly improved by training a model specialised in phone numbers, with a reduced vocabulary.

For simple fields, building a data generator is an extremely effective strategy. The first step is to build a (simpler) text generator covering all desired variations of the pattern, and produce the desired normalised output (our label).

For instance, you might want to map all of the following to the label 2124567890:

2124567890
212-456-7890
(212)456-7890
(212)-456-7890
212.456.7890

To build corresponding images, Pillow or OpenCV can be used to generate canvas, write text, and add artefacts. Albumentations and other libraries can be used to perturbate the images. You can compute the size, in pixels, of generated text. This will allow you to precisely crop the generated image in a realistic way, and tightly control the generation process. You want to generate a high variety of challenging images as realistic as possible. For this, make sure to use a large and relevant pool of fonts, backgrounds, add blur, noise, and nicely blend the font.

Again, using any kind of generative models is probably a terrible idea. It doesn’t allow a perfect control over the generation of text, labels and images.

The best part about building a data generator is that you can iterate on your model without collecting data, by identifying error modes and replicating their patterns.

Improving Models with Active Learning

Adding real data to the training dataset with active learning is an excellent way to fight data drift and continuously improve the model performance as production data arrives. Using the output of the text detection step, with additional augmentations, can help dealing with the unique spacing and token widths in your data distribution.

The general idea is to identify low confidence or likely wrong samples and correct the predictions. In many problems, specific patterns are expected for some fields or the whole document. For instance, you might expect to detect two phone numbers, and the two should be identical. If they aren’t, one of them is wrong. The phone number also has an expected length. Some fields have a checksum. Finally, model uncertainty can be estimated and used to select the most interesting data samples.

Conclusion

Building excellent OCR and continuously improving their performance mostly requires a good understanding of the data. This enables one to generate data, break down general problems into simpler problems, and build heuristics to validate results.