[tesseract-ocr] Re: ocr on real (dirty) printing

Peter Joh. Brunner Thu, 16 Apr 2015 02:38:21 -0700

once again, with more information:

I have a problem using tesseract with german fraktur.

I work with tesseract 3.02.02 on SUSE Linux 13.2

firstly the text to be ocr'd is real printed text of about 1930.
the printing is a little dirty i.e. there are little points and strokes
between
the letters.
though these are far smaller than the other letters, they are interpreted
as
normal letters.oes-frak.frak.exp017

Is there a possibility to give parameters to tesseract that it
. either should neglect letters which do not fit the majority of the other
letters,
. or it should only use letters in a given range of size
. or to firstly make the boxes,
then correct the boxes, by hand or program,
finally translate using the corrected boxes

I have already tried with a config-file to modify
textord_min_xheight 24
textord_xheight_mode_fraction 0.9
textord_xheight_error_margin 0.1
textord_descx_ratio_min 0.3
tessedit_redo_xheight FALSE
it changes some things but nothing to neglect the points and strokes

following an example:
the appended picture is translated to the text
15 Ellser Exdmsund Mögsgzerg

a solution with a dictionary is not possible, because the text consists of
only
names of persons and locations.

Another thing i wonder is:
when i ocr an image from .tiff to .txt
and makebox of the same image
some (few) letters are different recognized!

thanks for help in advance

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/0c58a26a-a8be-4550-9fca-593669a8cf5c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

[tesseract-ocr] Re: ocr on real (dirty) printing

Reply via email to