I've been thinking recently about user-patterns, and unicharambigs.

The functionality it provides can be very useful, but it's pretty
Latin-specific in the patterns it handles (upper/lower case,
numeric, etc). It would be *really* useful if it could be extended
to match arbitrary attributes which could be set in the unicharset
file.

To give you a concrete example, with my Ancient Greek training, I
needed to ensure that characters with a 'breathing' mark over them
were only recognised on the first character of a word, or the second
in the case of some digraphs. I ended up doing this by creating a
ton unicharambigs rules (about 35,000 - I wrote a program to do it,)
mostly of the form:

2       αἀ      2       αά      1

I think I'm more asking for the ability to have matching on character
types (like user-patterns does), but for unicharambigs, and with the
ability to 'tag' each character to any number of 'attributes'.

So in my case I could replace the 35,000 unicharambigs rules with
something like this:

# unicharset
α 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 95 0 0 nobreath,alpha
ἀ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 78 0 0 breath,smoothbreath,alpha
ἂ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 222 0 0 
breath,smoothbreath,grave,alpha
ἁ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 234 0 0 breath,roughbreath,alpha
ἅ 3 0,255,0,255,0,32767,0,32767,0,32767 NULL 11 0 0 
breath,roughbreath,alpha,acute

# unicharambigs
2       <nobreath><smoothbreath,alpha>  2       [matched]ὰ      1
2       <nobreath><roughbreath,alpha>   2       [matched]ά      1

This example alone would replace hundreds of lines of my current
unicharambig rules.

I know I'm getting slightly into regex territory here, but I think
that could be very useful. When characters have a variety of
diacritics, the number of possible combinations for a rule like my
above breathing rules gets much larger.

This is obviously quite a big change, but for my usecase it would
make a big difference. I'm curious whether it would be useful to
many other people too?

Nick

-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

Reply via email to