Your goals don't sound unreasonable, but I'd suggest using an approach that 
focuses on pre and post processing before diving in and hacking on 
tesseract itself.  That will allow you to easily continue to track 
improvements in base tesseract without having to worry about re-integrating 
your changes.

Tom

On Tuesday, July 14, 2015 at 4:43:23 PM UTC-4, [email protected] wrote:
>
> I would like to use knowledge of the page layout and to greatly improve 
> OCR accuracy. I am working with a large number of forms that are extremely 
> repetitive in structure. Say I know that a particular field in the form 
> holds the value for state/province, and another for city/town. 
>
> Is it too ambitious to attempt to improve the accuracy of tesseract by 
> using this knowledge? For example, I could hypothetically identify the 
> field that holds the state/province, classify this as one of 50 possible 
> states. Then I can have a list of cities in every state, and classify the 
> contents of the city field by choosing the most likely city that is in that 
> state? 
>
> This type of approach could hypothetically be generalized to many other 
> types of very structured information, for example, letting tesseract know 
> that a particular field is likely to contain a year or a phone number, or 
> even potentially a name and choosing from a long list of names. 
>
> Are these types of goals realistic? And if so, is the best way to get 
> started to spend a long time with the source code, make modifications, and 
> compile it myself? Thanks very much!
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/fbfa156f-dd2a-4e95-8745-af3da1cb4399%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to