Hello.

I am new to tesseract and I think it has a potential to solve the problem I 
am facing.
I am trying to detect certain words in images. (See attached sample)

Although the writing looks consistant in the sample, it is not always the 
case. The width and height varies quiet a bit(up to 30%) and there are 
different styles which I may have to create different font for each.

Let's say I want to find all occurrences of the word "son" in the document, 
based on what I am reading, there are two possibilities.

   1. Train an english font by creating a box file for each letter in the 
   document. This is a known route and I need suggestions on
      1. the number of samples I have to create the box file for and
      2. whether to create a box file directly from the sample or create 
      another image with more spaces between letters (see "wife" the box will 
      overlap)
   2. Maybe I can let the whole word as a character in a new language. So 
   treat the word "son" as a character rather than a combination of letters. I 
   haven't seen much documentation on this but this might be a possible way 
   since I have seen how the diagraph "ch" has been treated in some documents. 
   Thoughts?

I would really appreciate any comment.

Thanks,


Sol

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.


<<attachment: sample.jpg>>

Reply via email to