Re: [tesseract-ocr] Need to understand Tesseract code

ravi katiyar Thu, 16 Jun 2016 02:09:05 -0700

Alright , this does give me a starting point .
I am on my R&D way :)

Thank you once again


On Thursday, 16 June 2016 12:49:43 UTC+5:30, Allistair C wrote:
>
> Apologies, missed that! :)
>
> Can't see why you couldn't start with tesseract as-is for movie poster OCR 
> and focus instead then on image preprocessing, I.e how you send tesseract 
> the image to interpret. 
>
> I would actually first have a go at trying Google Cloud Vision API as that 
> seems very good at picking out text from more complex scenes. Else you 
> should read previous posts here on detection of text areas in natural world 
> scenes so you can first extract text rectangles cleanly to send to 
> tesseract rather than one big image. I guess it depends which part of the 
> poster is most important (title of movie or everything like actors etc) as 
> titles often use very specialised fonts (not always but often) and I think 
> those you will find very challenging without perhaps additional training 
> too (see tesseract training resources)
>
> Good luck
>
> Sent from my iPhone
>
> On 16 Jun 2016, at 06:18, ravi katiyar <[email protected] <javascript:>> 
> wrote:
>
> Hi
>
> Really appreciate your prompt response , thank you for showing me some 
> direction.
> I understand that modifying tesseract will be an uphill task , and now 
> specially given that the source code is been completely developed in c and 
> C++ it seems even more tougher.
>
> I did mention my use case is to be able to identify text out of movie 
> posters printed in newspaper.
> Is someone aware of something similar to tesseract which can do this job ?
>
> Thanks
> Ravi Katiyar
>
> On Thursday, 16 June 2016 03:41:36 UTC+5:30, Allistair C wrote:
>>
>> Hi,
>>
>> Your question is a little difficult to understand - it sounds like you 
>> are saying on the one hand you have no OCR or image processing background, 
>> know Java, and want to modify Tesseract toward some aim that you do not 
>> specify?
>>
>> Tesseract as far as I understand is developed using C/C++ and not Java. 
>> Only the Android JNI bindings would be Java.
>>
>> You can find the Tesseract source code at:
>>
>> https://github.com/tesseract-ocr/tesseract
>>
>> In terms of concepts you should read "An Overview of the Tesseract OCR 
>> Engine" written by Tesseract's lead Ray Smith as it will give you insight 
>> into the algorithms that are employed for its OCR.
>>
>>
>> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf
>>
>> Further concepts for algorithms can be found in the "Techniques" section 
>> at:
>>
>> https://en.wikipedia.org/wiki/Optical_character_recognition
>>
>> Sounds like an uphill struggle to me but I wish you luck!
>>
>> Cheers
>>
>>
>> On 15 June 2016 at 07:28, ravi katiyar <[email protected]> wrote:
>>
>>> Hello All,
>>>
>>> I am new to the world of OCR and image processing as well. I am come 
>>> from a java background.
>>> can someone tell what are the pre-requisite to understand the tesseract 
>>> code ?
>>> Like java.awt.image package , Digital image processing concepts ? what 
>>> would I need to be thorough with so that the I am able to understand 
>>> tesseract code .
>>>
>>> I want this understanding because I am aiming to make modifications to 
>>> this code , so that tesseract is able to extract text from a movie poster 
>>> printed in a newspaper.
>>> Tesseract cannot do this currently.
>>>
>>> Thanks
>>> Ravi Katiyar
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> -- 
> You received this message because you are subscribed to the Google Groups 
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> To post to this group, send email to [email protected] 
> <javascript:>.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/tesseract-ocr/de18b6e5-d87a-4fc3-a4a6-79c3e952a5e0%40googlegroups.com
>  
> <https://groups.google.com/d/msgid/tesseract-ocr/de18b6e5-d87a-4fc3-a4a6-79c3e952a5e0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ca181040-8998-4564-86ef-cc08d8f0b587%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Need to understand Tesseract code

Reply via email to