On Mon, Nov 14, 2022 at 12:25 PM Peter King via talk <talk@gtalug.org> wrote:
> > One of the ways in which OCR contributes real value is if you have a large > number of documents that are idiosyncratic in the same way ... If anyone > knows of anything open-source that works reasonably well, I'd love to hear > about it. > For all that Tesseract is a mass-ingestion OCR tool, it can be fine tuned. Whether there are tools for training it that are user-friendly, I don't know. I'd really like a tool that would stop tesseract on matches lower than a certain confidence threshold, and allow manual control of what was stored in the text. A few years ago tesseract was used to create a searchable archive of all available documentation from the Free City of Danzig, the short-lived city state that existed from 1920-1939 in what is now Gdańsk, Poland. Most of the paperwork (and there was a *lot*: very big on public participation in deciding on how they were going to be run) was printed in Fraktur (aka blackletter, gothic or textura). Tesseract was trained to read this script, and now the parameters live in the 'tesseract-ocr-frk' package for all to use. I wish they could have done the same for the then-contemporary written script of Sütterlin, one of the great "go home you're drunk" cursives. For very automatic OCR on Linux, the ocrmypdf tool is quite amazing. Great way of stress-testing your hardware, too. Stewart
--- Post to this mailing list talk@gtalug.org Unsubscribe from this mailing list https://gtalug.org/mailman/listinfo/talk