Thanks very much for the feedback!

On Thursday, July 16, 2015 at 9:31:37 AM UTC-7, Tom Morris wrote:
>
> Your goals don't sound unreasonable, but I'd suggest using an approach 
> that focuses on pre and post processing before diving in and hacking on 
> tesseract itself.  That will allow you to easily continue to track 
> improvements in base tesseract without having to worry about re-integrating 
> your changes.
>
> Tom
>
> On Tuesday, July 14, 2015 at 4:43:23 PM UTC-4, [email protected] wrote:
>>
>> I would like to use knowledge of the page layout and to greatly improve 
>> OCR accuracy. I am working with a large number of forms that are extremely 
>> repetitive in structure. Say I know that a particular field in the form 
>> holds the value for state/province, and another for city/town. 
>>
>> Is it too ambitious to attempt to improve the accuracy of tesseract by 
>> using this knowledge? For example, I could hypothetically identify the 
>> field that holds the state/province, classify this as one of 50 possible 
>> states. Then I can have a list of cities in every state, and classify the 
>> contents of the city field by choosing the most likely city that is in that 
>> state? 
>>
>> This type of approach could hypothetically be generalized to many other 
>> types of very structured information, for example, letting tesseract know 
>> that a particular field is likely to contain a year or a phone number, or 
>> even potentially a name and choosing from a long list of names. 
>>
>> Are these types of goals realistic? And if so, is the best way to get 
>> started to spend a long time with the source code, make modifications, and 
>> compile it myself? Thanks very much!
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/31bd9a33-946e-4514-9d3c-ef45d9c02204%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to