Also, since 80-90 % of the text on the page is a repeat of text that my program will have seen many times before, is there a way to ignore it or prevent tesseract from processing it beyond understanding that is 99.99 % likely to be a repeat of previously seen words and characters? Thanks again! I'm trying to understand tesseract as fast as possible but it is complicated.
On Tuesday, July 14, 2015 at 1:43:23 PM UTC-7, [email protected] wrote: > > I would like to use knowledge of the page layout and to greatly improve > OCR accuracy. I am working with a large number of forms that are extremely > repetitive in structure. Say I know that a particular field in the form > holds the value for state/province, and another for city/town. > > Is it too ambitious to attempt to improve the accuracy of tesseract by > using this knowledge? For example, I could hypothetically identify the > field that holds the state/province, classify this as one of 50 possible > states. Then I can have a list of cities in every state, and classify the > contents of the city field by choosing the most likely city that is in that > state? > > This type of approach could hypothetically be generalized to many other > types of very structured information, for example, letting tesseract know > that a particular field is likely to contain a year or a phone number, or > even potentially a name and choosing from a long list of names. > > Are these types of goals realistic? And if so, is the best way to get > started to spend a long time with the source code, make modifications, and > compile it myself? Thanks very much! > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8949d25d-16a9-4d5c-bd57-bc2fbb181416%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

