vikp 18 hours ago

I'm a fan of the team of Allen AI and their work. Unfortunately, the benchmarking of olmocr against marker (https://github.com/VikParuchuri/marker) is quite flawed.

Throughput - they benchmarked marker API cost vs local inference cost for olmocr. In our testing, marker locally gets 20 - 120 pages per second on an H100 (without custom kernels, etc). Olmocr in our testing gets between .4 (unoptimized) and 4 (sglang) pages per second on the same machine.

Accuracy - their quality benchmarks are based on win rate with only 75 samples - which are different between each tool pair. The samples were filtered down from a set of ~2000 based on opaque criteria. They then asked researchers at Allen AI to judge which output was better. When we benchmarked with our existing set and LLM as a judge, we got a 56% win rate for marker across 1,107 documents. We had to filter out non-English docs, since olmocr is English-only (marker is not).

Hallucinations/other problems - we noticed a lot of missing text and hallucinations with olmocr in our benchmark set. You can see sample output and llm ratings here - https://huggingface.co/datasets/datalab-to/marker_benchmark_... .

You can see all benchmark code at https://github.com/VikParuchuri/marker/tree/master/benchmark... .

Happy to chat more with anyone at Allen AI who wants to discuss this. I think olmocr is a great contribution - happy to help you benchmark marker more fairly.

Krasnol 4 minutes ago

Make it an .exe file and storm the world's offices.

rahimnathwani 20 hours ago

Good:

- no cloud service required, can run on local Nvidia GPU

- outputs a single stream of text with the correct reading order (even for multi column PDF)

- recognizes handwriting and stuff

Bad:

- doesn't seem to extract the text within diagrams (which I guess is fine because that text would be useless to an LLM)

OP is the demo page, which lets you OCR 10 pages.

The code needs an Nvidia GPU to run: https://github.com/allenai/olmocr

Not sure if the VRAM requirements because I haven't tried running locally yet.

  • thelittleone 13 hours ago

    Text from diagrams can be useful in LLMs. For example an LLM can understand a flow charts decision making shapes etc, but without text it could misinterpret information. I process a bunch of PDFs including procedures. Diagrams are concerted to code. The text helps in many cases.

    • rahimnathwani 13 hours ago

        Diagrams are concerted to code
      
      That's cool. May I ask what your pipeline looks like? And what code format do you use for diagrams? Mermaid?
fschuett 16 hours ago

Very impressive, it's the only AI Vision toolkit so far that actually recognizes Latin and medieval scripts. I've been trying to somehow translate public-domain medieval books (including the artwork and original layout) to PDF, so they can be re-printed, i.e pages like this: https://i.imgur.com/YLuF9sa.png - I tried a Google Vision + o1 solution, which did work to some extent, but not on the first try. This even recognizes the "E" of the artwork initial (or fixes it because of the context), which many OCR or AI solutions fail at.

The only think I'd need now is a way to get the original font and artwork positions (would be a great addition to OlmOCR). Potentially I could work up a solution to create the font manually (as most medieval books are written in the same writing style), then find the shape of the glyphs in the original image once I have the text and then mask out the artwork with some OpenCV magic.

chad1n 19 hours ago

These "OCR" tools who are actually multimodals are interesting because they can do more than just text abstraction, but their biggest flaw is hallucinations and overall the nondeterministic nature. Lately, I've been using Gemini to turn my notebooks into Latex documents, so I can see a pretty nice usecase for this project, but it's not for "important" papers or papers that need 100% accuracy.

  • thelittleone 13 hours ago

    How about building a tool which indexes ocr chunks / tokens and a confidence grading. Setting a tolerance level and defining actions where the token or chunk (s) fall below that level. Actions could include could include automated verification using another model or last resort human.

TZubiri 19 hours ago

It's amazing how of these solutions exist.

Such a hard problem that we create for ourselves.

zitterbewegung 20 hours ago

Would like to know how this compares to https://github.com/tesseract-ocr/tesseract

  • maleldil 4 hours ago

    I haven't checked OlmOCR, but in my experience, Tesseract is awful for scientific papers. The structure is mangled, formulas are completely rubbish, tables are nearly useless, etc.

    I also tried Docling (which I believe is LLM-based), which works fine, but the references section of the paper was too noisy, and Gemini 2.0 Flash was okay but too slow for a large number of PDFs[1].

    I settled for downloading the LaTeX code from arXiv and using pandoc to parse that. I also needed to process citations, which was easy using pandoc's support for BibTeX to CSL JSON.

    [1] Because of the number of output tokens, I had to split the PDF into pages and individually convert each one. Sometimes, the API would take too long to respond, making the overall system quite slow.

  • rahimnathwani 20 hours ago

    Tesseract is multilingual.

    Tesseract extracts all text from doc, without trying to fix reading order.

    Tesseract runs in many more places, as it doesn't require a GPU.

    Tesseract's pure text output tends to have a lot of extra bits, e.g. bits of text that appear in diagrams. Good as a starting point and fine for most downstream tasks.

  • jesuslop 20 hours ago

    and mathpix

    • rahimnathwani 20 hours ago

      Wow. The Mathpix mobile app has support for reading two column PDFs as a single column.

      You can't run it locally, though, right?

      • kergonath 20 hours ago

        > The Mathpix mobile app has support for reading two column PDFs as a single column.

        Mathpix is what gave the best results when I tried a whole bunch of OCR solutions on technical PDFs (multi-column with diagrams, figures and equations). It is brilliant.

        > You can't run it locally, though, right?

        Unfortunately, no. Which is a shame because I also have confidential documents to OCR and there is no way I put them on someone else’s cloud.

        • rahimnathwani 19 hours ago

          Did you try marker? https://github.com/VikParuchuri/marker

          I haven't tried olmocr yet and I now realize my 8GB GPU probably won't cut it, as it used a 7B param VLM model under the hood.

          • kergonath 19 hours ago

            > Did you try marker?

            I did not, but I will. Thanks for the pointer!

constantinum 16 hours ago

Tested it with the following documents:

* Loan application form: It picks up checkboxes and handwriting. But it missed a lot of form fields. Not sure why?

* Edsger W. Dijkstra’s handwritten notes(from Texas univ archive) - Parsing is good.*

* Badly(misaligned) scanned bill - Parsing is good. Observation: there is a name field, but it produced a synonymous name instead of the name in the bill — hallucination??

* Investment fund factsheet - It could parse the bar charts and tables, but it whimsically excluded many vital data points from the document.

* Investment fund factsheet, complex tables - Bad extraction, could not extract merged tables and again whimsical elimination of rows and columns.

Anyone curious, try LLMWhisperer[1] for OCR. It doesn't use LLMs, so no hallucination side effects. It also preserves the layout of the input document for more context and clarity.

There's also Docling[2], which is handy for converting tables from PDFs into markdown. While it uses Tesseract/EasyOCR under the hood, which can sometimes make the OCR results a bit less accurate

[1] - https://pg.llmwhisperer.unstract.com/ [2] - https://github.com/DS4SD/docling

Zardoz84 9 hours ago

I was expecting a tool to extract text from PDFs, not another LLM pretending to be a reliable OCR.

  • dredmorbius 4 hours ago

    For better or worse, LLM PDF OCR conversion is likely where state of the art / market are headed.

    "Reliable OCR" has been a rather phenomenal oxymoron for going on five or more decades, so if you've something more reliable in mind I'd appreciate your sharing it.

xz18r 20 hours ago

Why exactly does this need to be AI? OCR was a thing way before the boom and works pretty fine, usually. Seems like overkill.

  • Sophira 19 hours ago

    I think it's interesting how we call this AI, because neural networks have been used for OCR for literally decades at this point.

    Where does "neural networks" stop and "AI" begin?

    • adrian_b 8 hours ago

      In my opinion, the use of AI/ML/neural networks for the recognition of individual letters or of ligatures is perfectly valid.

      However for OCR, I do not want any kind of AI that attempts to use a context bigger than that, i.e. attempting to recognize words, phrases, sentences.

      I find much more acceptable an OCR tool that fails to recognize all the characters, marking some as unknown, than one tool that returns even a single wrongly guessed word.

      While there may be some kinds of documents where an LLM may guess any missing content with reasonable accuracy, all the documents that I would want to process with OCR, i.e. mostly old or very old books, have content where one could guess successfully something only with a very thorough knowledge of the other writings of the same author, of the subject matter and of the characteristics of the language that was used in that historical period.

      A LLM trained very specifically for the text author, text subject and contemporaneous writings might have chances of success, but none of the available LLMs is like this and it is much cheaper anyway to just use an OCR tool that does not attempt to make contextual guesses and then resolve any unreadable characters by humans.

      • dredmorbius 4 hours ago

        Though it's worth mentioning that inference from context (which an LLM OCR tool is presumed to do) is precisely what a human transcriptionist does. This can result in "nonfaithful" reproduction where, say, a tyop in an original document is corrected, consciously or unconsciously, when transcribing. Taking into account both the local context (on a given page) and the larger context of a work (within a subject area, collection, etc.) I'd expect an AI-based OCR tool would behave similarly.

        For archival / academic work, what would be nifty would be a tool which would note the original text image context, a certainty probability score, and possible alternative transcriptions in cases of ambiguity.

        It's a nice idea to get humans in the loop, but realistically this simply won't always be possible, and it's helpful to think of what next-best approaches might be. It wouldn't surprise me if AIs turned out to be more generally reliable in such cases, though I would also expect some wild mis-fires and fumbles along the way.

    • remram 16 hours ago

      I'm in academia, my understanding is that all statistical methods are AI now.

      Search? AI.

      Linear regression? AI.

  • simonw 18 hours ago

    Know of any classic OCR tools that can reliably extract tabular data from scrappy PDFs? I've been hunting for a dependable option for that for years.

    • cle 3 hours ago

      I've not found one either. I did this at a very large scale recently and ended up just using pdfplumber. I did POC Table Transformer but the cost was too high at my scale--there are probably better options now anyway. Most seem to focus on structure detection and then use traditional OCR for the actual content extraction.

      It's a very hard space in the long tail, like tables that span pages or tables with complex internal structures. I went into it thinking "eh how hard can tables be?". Very hard. Thankfully it's a pretty active research area.

    • constantinum 16 hours ago

      If you're looking for better accuracy and table layout preservation, give LLMWhisperer and Docling a try! Both keep tables tidy with a Markdown-like structure.

  • rahimnathwani 20 hours ago

    Look at pages 18-20 of the technical report. I don't know of any non-AI OCR that can do as good a job as that.