[tesseract-ocr] ocr on real (dirty) printing

Peter Joh. Brunner Thu, 02 Apr 2015 07:21:13 -0700

I have a problem using tesseract with german fraktur.

firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes 
between 
the letters.
though these are far smaller than the other letters, they are interpreted 
as 
normal letters.


Is there a possibility to give parameters to tesseract that it 
. either should neglect letters which do not fit the majority of the other 
  letters, 
. or it should only use letters in a given range of size
. or to firstly make the boxes, 
  then correct the boxes, by hand or program,
  finally translate using the corrected boxes

a solution with a dictionary is not possible, because the text consists of 
only 
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/7a3189e9-7bf4-408b-906d-c85090c7fc8f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] ocr on real (dirty) printing

Reply via email to