[tesseract-ocr] OCR andtips from Tesseract gurus!

Luis Zertuche Thu, 18 Aug 2016 13:19:19 -0700

Hello Tesseract Gurus!

I'm working on a  pdf2text extraction for legal documents. I've done some 
searches and found tips to improve quality, but I was wondering if someone 
here can provide info beyond the basics. Image processing-wise I've been: 
resizing to 600dpi, correcting for skew angle and [denoising with a median 
filter, contrast stretching, dilating with a small structuring element and 
otsu_thresholding] All those things improved the results only for a subset 
of the documents and given how widely they vary in acquisition quality, 
noise level and contrast, Ive realized the imaging pipeline is not one size 
fits all. I'm considering creating parallel image processing pipelines and 
do OCR on all of them and just pick the best,

*1. Can anyone comment on what would be some good variations of an imaging
pipelines for 'high variance dirty-text' ? Or alternatively can anyone
think of an imaging pipeline that would cover a wider range of document
quality?*

As far as tesseract parameters go. I've put together a small parameter
exploration loop evaluating iterations with a text-quality metric(M2, based
on number of dictionary words, using pyenchant), here is an example for the
linesize value parameter for a single document:

+++For linesize value, 1.25, M2 value is 0.661157024793

+++For linesize value, 1.35, M2 value is 0.661157024793

+++For linesize value, 1.45, M2 value is 0.644628099174

+++For linesize value, 1.55, M2 value is 0.611111111111

+++For linesize value, 1.65, M2 value is 0.0

*+++For linesize value, 1.75, M2 value is 0.693693693694*

+++For linesize value, 1.85, M2 value is 0.672413793103

+++For linesize value, 1.95, M2 value is 0.631578947368

+++For linesize value, 2.05, M2 value is 0.0

Here the best linsize value is 1.75 for that document, which yields good
results(the ). From this info,

*2. Can anyone recommend what are some good parameters to do apply this
method with? Any other tips of combining parameters into something more
general o any other exploration tips? *

Thanks for reading! Any other potentially useful tips or info would be
greatly appreciated, whether its on the image processing or on the
tesseract parameters.

Best, Luis.

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/34c9743e-6205-401f-8b35-85a89ee6ede6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] OCR andtips from Tesseract gurus!

Reply via email to