actually I have had this problem of symbols being chopped up.
I do not blame the engine though. Some symbols in my language are twice or
thrice that of the average.
Eg:- మూ [muu] is almost thrice in length as రీ[rii]. Similar problem must be
there in English with m , i etc.
Unfortunately for me the most frequent letter in my Telugu language is a ము
[mu]. which is being chopped up.
in to a మ[ma] and a tail ఎ[é].
ARE THERE ANY variables or code segments that I can look into and tweak for
this?

Secondly,
Symbol ము'mu' coming out as మ'ma' and the standalone vowel ఎ'é'  is erratic
as the symbol మ'ma' already has a vowel in it and hence a standalone can not
follow it.
IS THERE ANY way I can specify that some characters can not come after some
other characters?
Like specify, ma-é is an invalid combination? (In general I can not have
stand-alone-vowel-symbols like ఎ'é' coming in the middle of a word as each
symbol is a syllable in Telugu, like in Tibetan)। I am looking for
non-dictionary based solutions.

Thanks a lot.
Rakesh.


On 12 August 2010 15:42, Jimmy O'Regan <[email protected]> wrote:

> On 12 August 2010 10:24, Eugene Reimer <[email protected]> wrote:
> > You could probably improve its ability to recognize "00" as two 0's by
> > training it on such paired symbols.
> >
> > Mind you, I have also been surprised by cases where a perfectly clear and
> > flawless symbol gets subdivided, like a N becoming |\| or an H becoming
> I-I,
> > which indicates that tesseract has code to subdivide blobs other than
> based
> > on there being "space" between them.  However that code seems to behave
> in
> > erratic ways.
>
> Actually, on this image, I get:
> Mobile (65) 81(1) 6(l)2
>
> which is more or less the behaviour you're talking about; however, you
> should bear in mind that what looks like a solid shape to you does not
> necessarily look like a solid shape to the recogniser.
>
> Some (possibly) related variables:
>
> INT_VAR (repair_unchopped_blobs, 1, "Fix blobs that aren't chopped");
> double_VAR(tessedit_certainty_threshold, -2.25, "Good blob limit");
> BOOL_VAR(fragments_guide_chopper, FALSE,
>         "Use information from fragments to guide chopping process");
>
> INT_VAR(segment_adjust_debug, 0,
>        "Segmentation adjustment debug");
> BOOL_VAR(assume_fixed_pitch_char_segment, 0,
>         "include fixed-pitch heuristics in char segmentation");
> BOOL_VAR(use_new_state_cost, 0,
>         "use new state cost heuristics for segmentation state evaluation");
> double_VAR(heuristic_segcost_rating_base, 1.25,
>           "base factor for adding segmentation cost into word rating."
>           "It's a multiplying factor, the larger the value above 1, "
>           "the bigger the effect of segmentation cost.");
> double_VAR(heuristic_weight_rating, 1,
>           "weight associated with char rating in combined cost of state");
> double_VAR(heuristic_weight_width, 0,
>           "weight associated with width evidence in combined cost of
> state");
> double_VAR(heuristic_weight_seamcut, 0,
>           "weight associated with seam cut in combined cost of state");
> double_VAR(heuristic_max_char_wh_ratio, MAX_SQUAT,
>           "max char width-to-height ratio allowed in segmentation");
>
>
>
> --
> <Leftmost> jimregan, that's because deep inside you, you are evil.
> <Leftmost> Also not-so-deep inside you.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected]<tesseract-ocr%[email protected]>
> .
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to