Hello Tesseract Gurus!

I'm working on a  pdf2text extraction for legal documents. I've done some 
searches and found tips to improve quality, but I was wondering if someone 
here can provide info beyond the basics. Image processing-wise I've been: 
resizing to 600dpi, correcting for skew angle and [denoising with a median 
filter, contrast stretching, dilating with a small structuring element and 
otsu_thresholding] All those things improved the results only for a subset 
of the documents and given how widely they vary in acquisition quality, 
noise level and contrast, Ive realized the imaging pipeline is not one size 
fits all. I'm considering creating parallel image processing pipelines and 
do OCR on all of them and just pick the best,

*1. Can anyone comment on what would be some good variations of an imaging 
pipelines for 'high variance dirty-text' ? Or alternatively can anyone 
think of an imaging pipeline that would cover a wider range of document 
quality?*

As far as tesseract parameters go. I've put together a small parameter 
exploration loop evaluating iterations with a text-quality metric(M2, based 
on number of dictionary words, using pyenchant), here is an example for the 
linesize value parameter for a single document:

+++For linesize value, 1.25, M2 value is 0.661157024793

+++For linesize value, 1.35, M2 value is 0.661157024793

+++For linesize value, 1.45, M2 value is 0.644628099174

+++For linesize value, 1.55, M2 value is 0.611111111111

+++For linesize value, 1.65, M2 value is 0.0

*+++For linesize value, 1.75, M2 value is 0.693693693694*

+++For linesize value, 1.85, M2 value is 0.672413793103

+++For linesize value, 1.95, M2 value is 0.631578947368

+++For linesize value, 2.05, M2 value is 0.0

Here the best linsize value is 1.75 for that document, which yields good 
results(the ). From this info,

*2. Can anyone recommend what are some good parameters to do apply this 
method with? Any other tips of combining parameters into something more 
general o any other exploration tips? *

Thanks for reading! Any other potentially useful tips or info would be 
greatly appreciated, whether its on the image processing or on the 
tesseract parameters.

Best, Luis. 


-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/34c9743e-6205-401f-8b35-85a89ee6ede6%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to