I've been thinking recently about user-patterns, and unicharambigs. The functionality it provides can be very useful, but it's pretty Latin-specific in the patterns it handles (upper/lower case, numeric, etc). It would be *really* useful if it could be extended to match arbitrary attributes which could be set in the unicharset file.
To give you a concrete example, with my Ancient Greek training, I needed to ensure that characters with a 'breathing' mark over them were only recognised on the first character of a word, or the second in the case of some digraphs. I ended up doing this by creating a ton unicharambigs rules (about 35,000 - I wrote a program to do it,) mostly of the form: 2 αἀ 2 αά 1 I think I'm more asking for the ability to have matching on character types (like user-patterns does), but for unicharambigs, and with the ability to 'tag' each character to any number of 'attributes'. So in my case I could replace the 35,000 unicharambigs rules with something like this: # unicharset α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 95 0 0 nobreath,alpha ἀ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 78 0 0 breath,smoothbreath,alpha ἂ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 222 0 0 breath,smoothbreath,grave,alpha ἁ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 234 0 0 breath,roughbreath,alpha ἅ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 11 0 0 breath,roughbreath,alpha,acute # unicharambigs 2 <nobreath><smoothbreath,alpha> 2 [matched]ὰ 1 2 <nobreath><roughbreath,alpha> 2 [matched]ά 1 This example alone would replace hundreds of lines of my current unicharambig rules. I know I'm getting slightly into regex territory here, but I think that could be very useful. When characters have a variety of diacritics, the number of possible combinations for a rule like my above breathing rules gets much larger. This is obviously quite a big change, but for my usecase it would make a big difference. I'm curious whether it would be useful to many other people too? Nick -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

