Hello, it is always good to provide some problematic images (better than thousands of words;-) )
For preprocessing: look at scantailor - there are several forks with different improvements that also provide cli version. IMO it should be able to replace unpaper. I also recommend checking https://github.com/ImageProcessing-ElectronicPublications: there is a great collection of various tools for image processing including the implementation of various thresholding algorithms Zdenko ut 15. 10. 2024 o 17:46 [email protected] <[email protected]> napísal(a): > I work on corpora research with text which scanning quality might be > abysmal; yet, the text in themselves are valuable. Based on my previous > experiences, as well as the comments and complaints that I notice, I don't > think that we will be able to ever fully automate the whole process of OCR > with reliable fidelity, but in a sense that situation is not entirely > hopeless, since the human expert aspect of it could be "easily" and > optimally managed through a corpus of known good data minded by experts > (such as wikipedia and gutenberg.org) and the management of eyeballing > human agents through a GUI (directing them exactly to where OCR seems to > not have gotten it right presenting even contextual options to the user, > keeping an editing history for each text, ...). OCR mistakes which could be > easily handled based on the context using corpora are: "another" OCRed as > "mother", and "Andre ?\farie Arnpere" in an equally messy yet hopeful > context such as "Andre ?\farie Arnpere ( 1775--1836) , professor of > mathematical analysis and n1echanics at the f::cole Polytechnique". > > I am specially interested in the following aspects: > 1) options while pre-processing images in order to make the work of > tesseract optimal and since I will be working mostly with scientific texts, > different font sizes and types of fonts, glyphs and multi-encoded text > (texts containing formulas, charts, annotated pictures) must be handled > well or at least flagged out; > 2) images in visual text should be spotted and extracted separately from > the actual text (including the text segments which are part of the images, > think cartoons): > > https://superuser.com/questions/1857597/preferably-linux-based-os-utility-to-extract-images-from-image-based-pdf-file > 3) relating to §2 tables should be also handled well > 4) multilingually encoded texts (which I think tesseract handles well) > ~ > An important project such as unpaper (preprocessing on pages to be fed > onto tesseract) was apparently abandoned without an accompanying > documentation of the mathematical basis of its algorithm: > > // __ document algorithms > > https://github.com/unpaper/unpaper/issues/6 > ~ > For long I have noticed complaints about tesseract-ocr's blanket > assumptions about font size, which makes it fail on multi-font size texts > such as flyers and on texts with a curved gradient (either artistically or > partially as an artifact of lousy scanning (on some of the texts you even > see the whole fingers of the agent scanning them)). I think troubleshooting > those problems is not that difficult. > Given the nature and degree of complexity of the problem at hand, I am > mostly interested in open, functionally described and well-documented > step-by-step approaches, not "results". > Do you know of know of any similar prior art? > Any shared experiences and general suggestions regarding possible road > blocks that such a problem may encounter? > My search on: > > https://groups.google.com/g/tesseract-ocr/search?q=pre-processing%20unpaper > resulted in only 8 hits which were somewhat helpful. > lbrtchx > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/8f3510a6-f019-4ef8-9a79-0ba86754e2dcn%40googlegroups.com?utm_medium=email&utm_source=footer> > . > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAJbzG8ySC-PbXtJhp%3Dv602U5aN7VNgTa3P17OfB6WtoC_PrnnA%40mail.gmail.com.

