I think the UZN file format may be what I am looking for. On Tuesday, July 14, 2015 at 2:05:42 PM UTC-7, [email protected] wrote: > > Also, since 80-90 % of the text on the page is a repeat of text that my > program will have seen many times before, is there a way to ignore it or > prevent tesseract from processing it beyond understanding that is 99.99 % > likely to be a repeat of previously seen words and characters? Thanks > again! I'm trying to understand tesseract as fast as possible but it is > complicated. > > On Tuesday, July 14, 2015 at 1:43:23 PM UTC-7, [email protected] wrote: >> >> I would like to use knowledge of the page layout and to greatly improve >> OCR accuracy. I am working with a large number of forms that are extremely >> repetitive in structure. Say I know that a particular field in the form >> holds the value for state/province, and another for city/town. >> >> Is it too ambitious to attempt to improve the accuracy of tesseract by >> using this knowledge? For example, I could hypothetically identify the >> field that holds the state/province, classify this as one of 50 possible >> states. Then I can have a list of cities in every state, and classify the >> contents of the city field by choosing the most likely city that is in that >> state? >> >> This type of approach could hypothetically be generalized to many other >> types of very structured information, for example, letting tesseract know >> that a particular field is likely to contain a year or a phone number, or >> even potentially a name and choosing from a long list of names. >> >> Are these types of goals realistic? And if so, is the best way to get >> started to spend a long time with the source code, make modifications, and >> compile it myself? Thanks very much! >> >
-- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/3f63cb31-14c5-4265-9a06-72daf540c28f%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

