Hello. I am new to tesseract and I think it has a potential to solve the problem I am facing. I am trying to detect certain words in images. (See attached sample)
Although the writing looks consistant in the sample, it is not always the
case. The width and height varies quiet a bit(up to 30%) and there are
different styles which I may have to create different font for each.
Let's say I want to find all occurrences of the word "son" in the document,
based on what I am reading, there are two possibilities.
1. Train an english font by creating a box file for each letter in the
document. This is a known route and I need suggestions on
1. the number of samples I have to create the box file for and
2. whether to create a box file directly from the sample or create
another image with more spaces between letters (see "wife" the box will
overlap)
2. Maybe I can let the whole word as a character in a new language. So
treat the word "son" as a character rather than a combination of letters. I
haven't seen much documentation on this but this might be a possible way
since I have seen how the diagraph "ch" has been treated in some documents.
Thoughts?
I would really appreciate any comment.
Thanks,
Sol
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.
<<attachment: sample.jpg>>

