I think the UZN file format may be what I am looking for.

On Tuesday, July 14, 2015 at 2:05:42 PM UTC-7, [email protected] wrote:
>
> Also, since 80-90 % of the text on the page is a repeat of text that my 
> program will have seen many times before, is there a way to ignore it or 
> prevent tesseract from processing it beyond understanding that is 99.99 % 
> likely to be a repeat of previously seen words and characters? Thanks 
> again! I'm trying to understand tesseract as fast as possible but it is 
> complicated.
>
> On Tuesday, July 14, 2015 at 1:43:23 PM UTC-7, [email protected] wrote:
>>
>> I would like to use knowledge of the page layout and to greatly improve 
>> OCR accuracy. I am working with a large number of forms that are extremely 
>> repetitive in structure. Say I know that a particular field in the form 
>> holds the value for state/province, and another for city/town. 
>>
>> Is it too ambitious to attempt to improve the accuracy of tesseract by 
>> using this knowledge? For example, I could hypothetically identify the 
>> field that holds the state/province, classify this as one of 50 possible 
>> states. Then I can have a list of cities in every state, and classify the 
>> contents of the city field by choosing the most likely city that is in that 
>> state? 
>>
>> This type of approach could hypothetically be generalized to many other 
>> types of very structured information, for example, letting tesseract know 
>> that a particular field is likely to contain a year or a phone number, or 
>> even potentially a name and choosing from a long list of names. 
>>
>> Are these types of goals realistic? And if so, is the best way to get 
>> started to spend a long time with the source code, make modifications, and 
>> compile it myself? Thanks very much!
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/3f63cb31-14c5-4265-9a06-72daf540c28f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to