If there were a "one size fits all" answer, it'd probably already be done 
automatically by Tesseract, but you might have a look at some of the work 
the eMOP project did to deal with OCRing challenging texts at scale to see 
if you can reuse some of their tooling or learnings (although a lot of what 
they were doing was focused on custom training, as opposed to other types 
of image processing).

Tom

On Thursday, August 18, 2016 at 4:18:40 PM UTC-4, Luis Zertuche wrote:
>
> Hello Tesseract Gurus!
>
> I'm working on a  pdf2text extraction for legal documents. I've done some 
> searches and found tips to improve quality, but I was wondering if someone 
> here can provide info beyond the basics. Image processing-wise I've been: 
> resizing to 600dpi, correcting for skew angle and [denoising with a median 
> filter, contrast stretching, dilating with a small structuring element and 
> otsu_thresholding] All those things improved the results only for a subset 
> of the documents and given how widely they vary in acquisition quality, 
> noise level and contrast, Ive realized the imaging pipeline is not one size 
> fits all. I'm considering creating parallel image processing pipelines and 
> do OCR on all of them and just pick the best,
>
> *1. Can anyone comment on what would be some good variations of an imaging 
> pipelines for 'high variance dirty-text' ? Or alternatively can anyone 
> think of an imaging pipeline that would cover a wider range of document 
> quality?*
>
> As far as tesseract parameters go. I've put together a small parameter 
> exploration loop evaluating iterations with a text-quality metric(M2, based 
> on number of dictionary words, using pyenchant), here is an example for the 
> linesize value parameter for a single document:
>
> +++For linesize value, 1.25, M2 value is 0.661157024793
>
> +++For linesize value, 1.35, M2 value is 0.661157024793
>
> +++For linesize value, 1.45, M2 value is 0.644628099174
>
> +++For linesize value, 1.55, M2 value is 0.611111111111
>
> +++For linesize value, 1.65, M2 value is 0.0
>
> *+++For linesize value, 1.75, M2 value is 0.693693693694*
>
> +++For linesize value, 1.85, M2 value is 0.672413793103
>
> +++For linesize value, 1.95, M2 value is 0.631578947368
>
> +++For linesize value, 2.05, M2 value is 0.0
>
> Here the best linsize value is 1.75 for that document, which yields good 
> results(the ). From this info,
>
> *2. Can anyone recommend what are some good parameters to do apply this 
> method with? Any other tips of combining parameters into something more 
> general o any other exploration tips? *
>
> Thanks for reading! Any other potentially useful tips or info would be 
> greatly appreciated, whether its on the image processing or on the 
> tesseract parameters.
>
> Best, Luis. 
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9031f5d4-4a0c-428c-863f-f28d38391022%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to