How to Compare OCR Tools: Tesseract OCR vs Amazon Textract vs Azure OCR vs Google OCR
Optical Character Recognition (OCR) tools are software able to detect and extract texts from images. They are used in the early steps of the analysis of scanned documents to recognize and automatically process the information that the documents contain. Depending on the complexity of the documents to be analyzed, OCR tools can be used to both detect and extract the texts from them or, in some pipelines, they are used only to extract the text from previously identified regions of interest, e.g. Paragraphs, Tables, Titles,… The latter case is of my particular interest. For these reasons the article will be focused on the extraction task.
Different OCR tools are available today: open source, such as Tesseract OCR, and commercial, like Azure OCR, Amazon Textract and Google OCR. How do we decide which is better for our use case? In the past, some comparisons have been done, but from all I know, an in-depth and up to date comparison is not available today. In most of the comparisons that I analyzed, tests were generally done on few examples and the comparisons were more qualitative than quantitative. In this article I want to describe an initial attempt of comparison between Tesseract OCR, Amazon Textract, Azure OCR and Google OCR using quantitative measures on a public dataset; more results and analysis will come in the future.
I decided to make this comparison because I think that nowadays, with a lot of commercial, very easy-to-use and Black Box models available,
it is always more important to estimate their effectiveness for a specific problem. The estimation of the effectiveness of a model is important, independently if it is provided by an important vendor, Research Community, built from “scratch” or from an existing model. This article should be seen only as an idea of how we should compare these tools and that the performances of these tools can be very different depending on the dataset.
The dataset that I used is not very big and it is not sufficiently diversified with respect to the type of images that it contain, but, the results that are presented, at least, highlights that the decision to use an OCR with respect to another should be taken with some attention.
Warning! The results presented in this work are not to be considered definitive or complete, but only an example of how different OCR tools could be compared in a quantitative way to support the choice of which one to use.
Dataset and Metrics
The dataset used for the comparison is based on the FUNSD dataset:
Jaume, Guillaume, Hazim Kemal Ekenel, and Jean-Philippe Thiran. “FUNSD: A Dataset for Form Understanding in Noisy Scanned Documents.” 2019 International Conference on Document Analysis and Recognition Workshops (ICDARW). Vol. 2. IEEE, 2019. https://guillaumejaume.github.io/FUNSD/
FUNSD is a dataset created to evaluate the task of Form Understanding. FUNSD was built from the RVL-CDIP dataset, a dataset for the evaluation of document classification models, it contains grayscale images of scanned pages. To build FUNSD, 199 images belonging to the Form category of the RVL-CDIP were selected and manually labeled with different information, among them, the location of the forms, the location of the fields contained in the forms and their text. FUNSD is a challenging dataset containing a wide range of image types. Generally they are of low quality with different types of noise added by successive scanning and printing procedures. FUNSD is originally provided in two different sets, the training set and the test set. In the presented experiments, the OCR tools will be compared with respect to a new dataset, derivated from FUNSD, that we will call “OCR Evaluation Dataset“ in this article. It is worth mentioning that in the paper that introduced FUNSD a comparison between different OCR tools was presented but it is not up to date and the comparison looks at different aspects.
The OCR Evaluation Dataset is made by the subimages representing the fields of the forms contained in the original FUNSD dataset. The OCR extracted texts will be compared with the correct texts contained in the fields, extracted from the FUNSD original annotations, Figure 1.
In Table 1 are reported some statistics of the OCR Evaluation Dataset that I built and that will be used for the comparison.
The training (test) set contains only fields extracted from the documents contained the training (test) set of the FUNSD dataset. The performances of the OCR tools will be compared only with respect to the Test set.
To measure the similarity between the extracted text and the ground truth text contained in the annotations, the following measures will be used:
The OCR tools will be compared with respect to the mean accuracy and the mean similarity computed on all the examples of the test set. I decided to also use the similarity measure to take into account some minor errors produced by the OCR tools and because the original annotations of the FUNSD dataset contain some minor annotation errors, Figure 2.
Tesseract OCR is freely available from the project page: https://tesseract-ocr.github.io/tessdoc/Home.html. I will use the Pytesseract python wrapper to interact with Tesseract. Pytesseract can be downloaded using pip, following the instructions described in the link: https://pypi.org/project/pytesseract/. For the experiments I will use Tesseract 5.0.0 alpha for Windows, https://github.com/UB-Mannheim/tesseract/wiki.
Azure OCR is the commercial OCR provided by Microsoft: https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/overview-ocr
Amazon Textract is the commercial OCR provided by Amazon: https://aws.amazon.com/it/textract/
Google OCR is the commercial OCR provided by Google:
The presented comparison will use the default parameters of the OCR tools, no additional tuning of them will be performed. Azure OCR expects a minimum resolution size of 50x50 for the input images. For this reason, all the images with a lower resolution will be resized to have a minimum side length of 50 pixels, the resizing will be done by padding the original image. Because our aim is to compare the OCR tools without tuning any particular parameter, I decided to perform this resizing transformation only for the Azure OCR tool because it is needed in order to use it.
The annotations contained in FUNSD don’t consider new line characters, for this reason I replaced all the new line characters from the outputs of the tested OCR tools with a space character and then replaced all the occurences of two or more consecutive space characters with only one space character from both the ground truth and the outputs of the OCR tools.
Costs (updated to 31/05/2021 — Not considering free transactions*)
Considering the setup experiments and the presented results, I can estimate that I performed more or less 10000 transactions with Azure OCR (10.00 USD), 10000 transactions with Amazon Textract (15.00 USD) and 10000 transactions with Google OCR (15.00 USD).
*All the commercial OCR tools offer an amout of free transactions, for example with Google OCR every month, the first 1000 transactions are free.
In Table 2 are shown the results obtained by the different tools on the Test Set of the OCR Evaluation Dataset, in terms of accuracies and similarities.
Azure OCR and Google OCR show the best performances in both metrics, Tesseract OCR and Amazon Textract are the worst.
Looking at the Scatter Plots of the different combinations of the OCR results, Figure 5, it is possible to see that there is not a clear correlation between the obtained results, exept for the pair: Azure OCR and Google OCR. In particular, although Tesseract OCR and AWS Textract perform similarly overall their results are not strongly correlated.
The differences between Azure OCR and Google OCR and the other tools could be motivated by a lot of factors: maybe Azure OCR and Google OCR were trained to address images similar to the ones contained in the OCR Evaluation Dataset or maybe not, but at least the important differences between these two OCR tools and the other ones show that the decision of which OCR tool to use in a project should also be driven by an initial evaluation of the expected performances of the tools.
In the next few days, the pipeline to extract the dataset that has been used in these experiments will be made public. A correlation study between the OCR tools could be interesting.
The Story has been written by:
Federico Ricciuti, Data Scientist
Feel free to add me on Linkedin:
with the kind support of:
Christian Onzaca, Data Scientist
Sabatino Severino, AI Application Architect
Lorenzo Raimondi, Data Scientist
Gabriella Esposito, Cloud Engineer
In collaboration with: