OlmOCR: Open-source tool to extract plain text from PDFs

olmocr.allenai.org

313 points by eamag 4 months ago

vikp 4 months ago

I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.

Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.

Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).

Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .

You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .

Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.

rcdwealth 4 months ago

[dead]
KennyBlanken 4 months ago

Are you also a fan of the Dallas Cowboys?

rahimnathwani 4 months ago

Good:

- no cloud service required, can run on local Nvidia GPU

- outputs a single stream of text with the correct reading order (even for multi column PDF)

- recognizes handwriting and stuff

Bad:

- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)

OP is the demo page, which lets you OCR 10 pages.

The code needs an Nvidia GPU to run: https://github.com/allenai/olmocr

Not sure if the VRAM requirements because I haven't tried running locally yet.

thelittleone 4 months ago

Text from diagrams can be useful in LLMs. For example an LLM can understand a flow charts decision making shapes etc, but without text it could misinterpret information. I process a bunch of PDFs including procedures. Diagrams are concerted to code. The text helps in many cases.
- rahimnathwani 4 months ago
  Diagrams are concerted to code
  That's cool. May I ask what your pipeline looks like? And what code format do you use for diagrams? Mermaid?

chad1n 4 months ago

These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.

thelittleone 4 months ago

How about building a tool which indexes ocr chunks / tokens and a confidence grading. Setting a tolerance level and defining actions where the token or chunk (s) fall below that level. Actions could include could include automated verification using another model or last resort human.
- Eisenstein 4 months ago
  
  How would you calculate the confidence? LLMs are notoriously bad at grading their own output.

fschuett 4 months ago

Very impressive, it's the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I've been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: https://i.imgur.com/YLuF9sa.png - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the "E" of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.

The only think I'd need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.

yorwba 4 months ago

You might be interested in https://learnable-typewriter.github.io for extracting the glyph shapes once you have the OCRd text.

constantinum 4 months ago

Tested it with the following documents:

* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?

* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*

* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??

* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.

* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.

Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.

There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate

[1] - https://pg.llmwhisperer.unstract.com/ [2] - https://github.com/DS4SD/docling

phren0logy 4 months ago

FYI, you can choose which OCR engine Docling uses (from a handful of predefined choices) - it doesn’t have to be Tesseract.
https://ds4sd.github.io/docling/reference/pipeline_options/#...

simonw 4 months ago

I posted some notes on this here a couple of days ago: https://simonwillison.net/2025/Feb/26/olmocr/

brianjking 4 months ago

I'm using the GGUF in LMStudio found here: https://huggingface.co/allenai/olmOCR-7B-0225-preview-GGUF
mayoosh 4 months ago

what is the cost of running on the GPU?

mjnews 4 months ago

Here's a concise version:

Deployed a quick demo of this at https://olmocr.im/ if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.

Short, URL-forward, and focused on what HN readers care about (immediate testing + clear use case).

mjnews 4 months ago

Deployed a quick demo of this at https://olmocr.im if anyone wants to test. Handles multi-column PDFs surprisingly well (finally!), though YMMV with handwritten text. Feedback welcome.

TZubiri 4 months ago

It's amazing how of these solutions exist.

Such a hard problem that we create for ourselves.

zitterbewegung 4 months ago

Would like to know how this compares to https://github.com/tesseract-ocr/tesseract

rahimnathwani 4 months ago

Tesseract is multilingual.
Tesseract extracts all text from doc, without trying to fix reading order.
Tesseract runs in many more places, as it doesn't require a GPU.
Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.
maleldil 4 months ago

I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.
I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].
I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.
[1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.
jesuslop 4 months ago

and mathpix
- rahimnathwani 4 months ago
  
  Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.
  You can't run it locally, though, right?
  - kergonath 4 months ago
    
    > The Mathpix mobile app has support for reading two column PDFs as a single column.
    Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.
    > You can't run it locally, though, right?
    Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.
    
    rahimnathwani 4 months ago
    
    Did you try marker? https://github.com/VikParuchuri/marker
    I haven't tried olmocr yet and I now realize my 8GB GPU probably won't cut it, as it used a 7B param VLM model under the hood.
    
    kergonath 4 months ago
    
    > Did you try marker?
    I did not, but I will. Thanks for the pointer!

Krasnol 4 months ago

Make it an .exe file and storm the world's offices.

brianjking 4 months ago

Has anyone figured out how to load this on a Huggingface endpoint?

johnthescott 4 months ago

a surprising number of academic pdfs do not have the Title element set in the dictionary. seems like a jobs for "ai".

KennyBlanken 4 months ago

Wasn't this just linked to here a few days ago with tests showing it has atrocious accuracy, misses a significant amount of text, and takes an order of magnitude more time (and several orders of magnitude more energy) compared to known OCR solutions?

Zardoz84 4 months ago

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

dredmorbius 4 months ago

For better or worse, LLM PDF OCR conversion is likely where state of the art / market are headed.
"Reliable OCR" has been a rather phenomenal oxymoron for going on five or more decades, so if you've something more reliable in mind I'd appreciate your sharing it.

arnestrickmann 4 months ago

thats so cool

linwangg 4 months ago

[dead]

xz18r 4 months ago

Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.

Sophira 4 months ago

I think it's interesting how we call this AI, because neural networks have been used for OCR for literally decades at this point.
Where does "neural networks" stop and "AI" begin?
- adrian_b 4 months ago
  
  In my opinion, the use of AI/ML/neural networks for the recognition of individual letters or of ligatures is perfectly valid.
  However for OCR, I do not want any kind of AI that attempts to use a context bigger than that, i.e. attempting to recognize words, phrases, sentences.
  I find much more acceptable an OCR tool that fails to recognize all the characters, marking some as unknown, than one tool that returns even a single wrongly guessed word.
  While there may be some kinds of documents where an LLM may guess any missing content with reasonable accuracy, all the documents that I would want to process with OCR, i.e. mostly old or very old books, have content where one could guess successfully something only with a very thorough knowledge of the other writings of the same author, of the subject matter and of the characteristics of the language that was used in that historical period.
  A LLM trained very specifically for the text author, text subject and contemporaneous writings might have chances of success, but none of the available LLMs is like this and it is much cheaper anyway to just use an OCR tool that does not attempt to make contextual guesses and then resolve any unreadable characters by humans.
  - dredmorbius 4 months ago
    
    Though it's worth mentioning that inference from context (which an LLM OCR tool is presumed to do) is precisely what a human transcriptionist does. This can result in "nonfaithful" reproduction where, say, a tyop in an original document is corrected, consciously or unconsciously, when transcribing. Taking into account both the local context (on a given page) and the larger context of a work (within a subject area, collection, etc.) I'd expect an AI-based OCR tool would behave similarly.
    For archival / academic work, what would be nifty would be a tool which would note the original text image context, a certainty probability score, and possible alternative transcriptions in cases of ambiguity.
    It's a nice idea to get humans in the loop, but realistically this simply won't always be possible, and it's helpful to think of what next-best approaches might be. It wouldn't surprise me if AIs turned out to be more generally reliable in such cases, though I would also expect some wild mis-fires and fumbles along the way.
- remram 4 months ago
  
  I'm in academia, my understanding is that all statistical methods are AI now.
  Search? AI.
  Linear regression? AI.
simonw 4 months ago

Know of any classic OCR tools that can reliably extract tabular data from scrappy PDFs? I've been hunting for a dependable option for that for years.
- cle 4 months ago
  
  I've not found one either. I did this at a very large scale recently and ended up just using pdfplumber. I did POC Table Transformer but the cost was too high at my scale--there are probably better options now anyway. Most seem to focus on structure detection and then use traditional OCR for the actual content extraction.
  It's a very hard space in the long tail, like tables that span pages or tables with complex internal structures. I went into it thinking "eh how hard can tables be?". Very hard. Thankfully it's a pretty active research area.
- constantinum 4 months ago
  
  If you're looking for better accuracy and table layout preservation, give LLMWhisperer and Docling a try! Both keep tables tidy with a Markdown-like structure.
rahimnathwani 4 months ago

Look at pages 18-20 of the technical report. I don't know of any non-AI OCR that can do as good a job as that.