Also, since 80-90 % of the text on the page is a repeat of text that my 
program will have seen many times before, is there a way to ignore it or 
prevent tesseract from processing it beyond understanding that is 99.99 % 
likely to be a repeat of previously seen words and characters? Thanks 
again! I'm trying to understand tesseract as fast as possible but it is 
complicated.

On Tuesday, July 14, 2015 at 1:43:23 PM UTC-7, [email protected] wrote:
>
> I would like to use knowledge of the page layout and to greatly improve 
> OCR accuracy. I am working with a large number of forms that are extremely 
> repetitive in structure. Say I know that a particular field in the form 
> holds the value for state/province, and another for city/town. 
>
> Is it too ambitious to attempt to improve the accuracy of tesseract by 
> using this knowledge? For example, I could hypothetically identify the 
> field that holds the state/province, classify this as one of 50 possible 
> states. Then I can have a list of cities in every state, and classify the 
> contents of the city field by choosing the most likely city that is in that 
> state? 
>
> This type of approach could hypothetically be generalized to many other 
> types of very structured information, for example, letting tesseract know 
> that a particular field is likely to contain a year or a phone number, or 
> even potentially a name and choosing from a long list of names. 
>
> Are these types of goals realistic? And if so, is the best way to get 
> started to spend a long time with the source code, make modifications, and 
> compile it myself? Thanks very much!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/8949d25d-16a9-4d5c-bd57-bc2fbb181416%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to