Testing Mistral AI's New OCR Model

On March 6th, the French company Mistral AI released an LLM designed for OCR tasks. For those unfamiliar with the field, OCR (Optical Character Recognition) technology converts images or scanned documents into editable text. Essentially, it allows a computer to recognise letters and numbers in photos or printed documents, making it easy to copy, edit, or search for information without having to type it out manually.

Traditionally, convolutional neural networks (CNNs) have been used for these tasks. More recent projects, such as Google’s Tesseract 4.0, had combined convolutional networks for character extraction with recurrent neural networks (RNNs) to structure sequences of characters and words. However, with the advent of the transformer architecture, this technology is seeing an expansion of capabilities, as it can understand and process every element of a document (text, images, tables, equations…), opening up a very broad range of possibilities.

Mistral OCR vs SOTA

According to Mistral AI’s own press release, the new Mistral OCR 2503 model positions itself as the new SOTA (State Of The Art) model in OCR, with an overall accuracy of 94.89%, compared to alternatives such as Google’s Gemini Flash models or GPT-4o, which sit at around 90%. While this difference may seem negligible, Mistral AI claims that, by also extracting images alongside text, the model is not only more accurate in textual extraction but also provides this additional functionality that others lack.

Thanks to the transformer architecture, all these models excel at understanding context, different languages, and types of objects — categories in which Mistral OCR also reports slight improvements over other SOTA models. Another major advantage of this architecture is that it allows instructions to be given via prompts to focus on specific aspects when extracting information and formatting results into custom JSON outputs.

One of the strengths of Mistral OCR is that it is an extremely efficient model, capable of processing up to 2,000 pages per minute at a cost of $1 per 1,000 pages processed. Mistral AI offers several solutions for clients, whether consuming the service through their API, via a cloud provider, or hosted locally. For non-local options, data is stored in the European Union — a relevant point for compliance with European data protection regulations.

Putting It to the Test

I tested the model with documents that presented certain complex structures such as multi-column layouts, tables with merged rows, images embedded within text, and inlined charts and tables. Since extracting information from PDF reports is of particular interest for business use cases, I focused the test on these documents, although the model also handles other scenarios perfectly well, such as extracting handwritten text or text from document photographs.

In general terms, the results on simple documents are incredible. While the API does occasionally throw a server error, it processes documents with remarkable speed. Like other models, it is very accurate with text. My impression is that it does not quite reach the near-100% levels shown in benchmarks, especially with footnotes and table elements, but in the body text it is very consistent.

The extraction result is returned in Markdown format, which is especially useful for vector storage and for subsequent processing and consumption by LLM models. However, this approach has limitations in terms of formatting, particularly with complex tables, which in advanced scenarios fail to correctly reflect the original structure. On the other hand, the fact that it processes documents page by page is especially useful for managing metadata in RAG-based schemas, but it introduces some coherence issues within the document. For example, some page breaks produce an abrupt cut-off of sentences, which can generate incoherent fragments and loss of information.

Image 1 – Page-break case in a 2-column article (Original) Original article format Image 1 – Page-break case in a 2-column article (Result) OCR Result format

In scenarios with text distributed across columns, it usually gets the order right and correctly concatenates text that spans multiple columns, being very accurate in this regard. As for chart extraction and text integration, again, with simple formats, it is capable of recognising when an image is relevant, extracting it, and incorporating it into the result.

Image 2 – Charts (Original) Charts

Image 2 – Charts (Result) OCR Result format

However, with heavy graphics or pages packed with visual elements, it gets confused fairly easily and makes strange or irrelevant extractions:

Image 3 – Heavy graphics and visually dense pages (Original) Charts

Image 3 – Heavy graphics and visually dense pages (Result) OCR Result format

Image 4 – Heavy graphics and visually dense pages (Original) Charts

Image 4 – Heavy graphics and visually dense pages (Result) OCR Result format

It also has considerable trouble understanding line breaks that occur due to column width and should fall within the same table cell, as well as with merged cells and hierarchical structures:

Image 5 – Merged cells and hierarchical structures (Original) Charts

Image 5 – Merged cells and hierarchical structures (Result) OCR Result format

Image 6 – Merged cells and hierarchical structures (Original) Charts

Image 6 – Merged cells and hierarchical structures (Result) OCR Result format

As for formulas, superscripts, subscripts, and other structural formats, it is capable of making a good and consistent interpretation.

Image 7 – Formulas, superscripts, subscripts, and other formats (Original) Charts

Image 7 – Formulas, superscripts, subscripts, and other formats (Result) OCR Result format

Preliminary Conclusions

In general, this is a very good model. Perhaps the big news is that it was developed by Mistral, a European company committed to the open-source community. I hope this release is a precursor to an open version for the community, and that with models of a contained size (below 7B parameters), it could run on consumer devices and be more accessible for on-premises deployments in companies, thereby guaranteeing complete data privacy.

Even today, in a digitalised world highly exposed to the web and interoperability, a large proportion of information remains in unstructured or semi-structured documents that are difficult for machines to interpret. These kinds of technologies are key components for automation, interoperability, and the deployment of artificial intelligence and data exploitation in enterprises, as they sit at the very foundation of the stack.

Although existing solutions already offered a high degree of reliability in text extraction, the integration with other document elements and the ease of use of this solution open up new possibilities, especially considering the potential that these advances can bring to RAG-based agent systems or fine-tuned models. The use of visual elements beyond plain text is not as novel as it is being made out to be, since Anthropic introduced similar improvements in its Claude Sonnet 3.5 model in November 2024. However, that system is neither open nor accessible, unlike this tool, which allows extractions for personal use.

Furthermore, its accessible price and speed make it applicable in sectors where it was previously not considered viable due to costs or operational requirements, much as happened with smaller model families such as OpenAI’s Mini, Google’s Flash, Anthropic’s Haiku, or the smaller and quantised versions of Llama, Qwen, and other open-source models.

Although it has certain limitations that will likely be refined over time, it is a useful tool and, in some sense, the enthusiasm it is generating in the community is justified.

Mistral OCR vs SOTA

Putting It to the Test

Preliminary Conclusions

Share article