On Mon, Nov 14, 2022 at 12:25 PM Peter King via talk <talk@gtalug.org>
wrote:

>
> One of the ways in which OCR contributes real value is if you have a large
> number of documents that are idiosyncratic in the same way ...  If anyone
> knows of anything open-source that works reasonably well, I'd love to hear
> about it.
>

For all that Tesseract is a mass-ingestion OCR tool, it can be fine tuned.
Whether there are tools for training it that are user-friendly, I don't
know. I'd really like a tool that would stop tesseract on matches lower
than a certain confidence threshold, and allow manual control of what was
stored in the text.

 A few years ago tesseract was used to create a searchable archive of all
available documentation from the Free City of Danzig, the short-lived city
state that existed from 1920-1939 in what is now Gdańsk, Poland. Most of
the paperwork (and there was a *lot*: very big on public participation in
deciding on how they were going to be run) was printed in Fraktur (aka
blackletter, gothic or textura). Tesseract was trained to read this script,
and now the parameters live in the 'tesseract-ocr-frk' package for all to
use. I wish they could have done the same for the then-contemporary written
script of Sütterlin, one of the great "go home you're drunk" cursives.

For very automatic OCR on Linux, the ocrmypdf tool is quite amazing. Great
way of stress-testing your hardware, too.

 Stewart
---
Post to this mailing list talk@gtalug.org
Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk

Reply via email to