actually I have had this problem of symbols being chopped up. I do not blame the engine though. Some symbols in my language are twice or thrice that of the average. Eg:- మూ [muu] is almost thrice in length as రీ[rii]. Similar problem must be there in English with m , i etc. Unfortunately for me the most frequent letter in my Telugu language is a ము [mu]. which is being chopped up. in to a మ[ma] and a tail ఎ[é]. ARE THERE ANY variables or code segments that I can look into and tweak for this?
Secondly, Symbol ము'mu' coming out as మ'ma' and the standalone vowel ఎ'é' is erratic as the symbol మ'ma' already has a vowel in it and hence a standalone can not follow it. IS THERE ANY way I can specify that some characters can not come after some other characters? Like specify, ma-é is an invalid combination? (In general I can not have stand-alone-vowel-symbols like ఎ'é' coming in the middle of a word as each symbol is a syllable in Telugu, like in Tibetan)। I am looking for non-dictionary based solutions. Thanks a lot. Rakesh. On 12 August 2010 15:42, Jimmy O'Regan <[email protected]> wrote: > On 12 August 2010 10:24, Eugene Reimer <[email protected]> wrote: > > You could probably improve its ability to recognize "00" as two 0's by > > training it on such paired symbols. > > > > Mind you, I have also been surprised by cases where a perfectly clear and > > flawless symbol gets subdivided, like a N becoming |\| or an H becoming > I-I, > > which indicates that tesseract has code to subdivide blobs other than > based > > on there being "space" between them. However that code seems to behave > in > > erratic ways. > > Actually, on this image, I get: > Mobile (65) 81(1) 6(l)2 > > which is more or less the behaviour you're talking about; however, you > should bear in mind that what looks like a solid shape to you does not > necessarily look like a solid shape to the recogniser. > > Some (possibly) related variables: > > INT_VAR (repair_unchopped_blobs, 1, "Fix blobs that aren't chopped"); > double_VAR(tessedit_certainty_threshold, -2.25, "Good blob limit"); > BOOL_VAR(fragments_guide_chopper, FALSE, > "Use information from fragments to guide chopping process"); > > INT_VAR(segment_adjust_debug, 0, > "Segmentation adjustment debug"); > BOOL_VAR(assume_fixed_pitch_char_segment, 0, > "include fixed-pitch heuristics in char segmentation"); > BOOL_VAR(use_new_state_cost, 0, > "use new state cost heuristics for segmentation state evaluation"); > double_VAR(heuristic_segcost_rating_base, 1.25, > "base factor for adding segmentation cost into word rating." > "It's a multiplying factor, the larger the value above 1, " > "the bigger the effect of segmentation cost."); > double_VAR(heuristic_weight_rating, 1, > "weight associated with char rating in combined cost of state"); > double_VAR(heuristic_weight_width, 0, > "weight associated with width evidence in combined cost of > state"); > double_VAR(heuristic_weight_seamcut, 0, > "weight associated with seam cut in combined cost of state"); > double_VAR(heuristic_max_char_wh_ratio, MAX_SQUAT, > "max char width-to-height ratio allowed in segmentation"); > > > > -- > <Leftmost> jimregan, that's because deep inside you, you are evil. > <Leftmost> Also not-so-deep inside you. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]<tesseract-ocr%[email protected]> > . > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en. > > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

