Re: [tesseract-ocr] Need to understand Tesseract code

2016-06-15 Thread ravi katiyar
Hi

Really appreciate your prompt response , thank you for showing me some 
direction.
I understand that modifying tesseract will be an uphill task , and now 
specially given that the source code is been completely developed in c and 
C++ it seems even more tougher.

I did mention my use case is to be able to identify text out of movie 
posters printed in newspaper.
Is someone aware of something similar to tesseract which can do this job ?

Thanks
Ravi Katiyar

On Thursday, 16 June 2016 03:41:36 UTC+5:30, Allistair C wrote:
>
> Hi,
>
> Your question is a little difficult to understand - it sounds like you are 
> saying on the one hand you have no OCR or image processing background, know 
> Java, and want to modify Tesseract toward some aim that you do not specify?
>
> Tesseract as far as I understand is developed using C/C++ and not Java. 
> Only the Android JNI bindings would be Java.
>
> You can find the Tesseract source code at:
>
> https://github.com/tesseract-ocr/tesseract
>
> In terms of concepts you should read "An Overview of the Tesseract OCR 
> Engine" written by Tesseract's lead Ray Smith as it will give you insight 
> into the algorithms that are employed for its OCR.
>
>
> http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf
>
> Further concepts for algorithms can be found in the "Techniques" section 
> at:
>
> https://en.wikipedia.org/wiki/Optical_character_recognition
>
> Sounds like an uphill struggle to me but I wish you luck!
>
> Cheers
>
>
> On 15 June 2016 at 07:28, ravi katiyar  
> wrote:
>
>> Hello All,
>>
>> I am new to the world of OCR and image processing as well. I am come from 
>> a java background.
>> can someone tell what are the pre-requisite to understand the tesseract 
>> code ?
>> Like java.awt.image package , Digital image processing concepts ? what 
>> would I need to be thorough with so that the I am able to understand 
>> tesseract code .
>>
>> I want this understanding because I am aiming to make modifications to 
>> this code , so that tesseract is able to extract text from a movie poster 
>> printed in a newspaper.
>> Tesseract cannot do this currently.
>>
>> Thanks
>> Ravi Katiyar
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/de18b6e5-d87a-4fc3-a4a6-79c3e952a5e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Need to understand Tesseract code

2016-06-15 Thread Allistair
Hi,

Your question is a little difficult to understand - it sounds like you are
saying on the one hand you have no OCR or image processing background, know
Java, and want to modify Tesseract toward some aim that you do not specify?

Tesseract as far as I understand is developed using C/C++ and not Java.
Only the Android JNI bindings would be Java.

You can find the Tesseract source code at:

https://github.com/tesseract-ocr/tesseract

In terms of concepts you should read "An Overview of the Tesseract OCR
Engine" written by Tesseract's lead Ray Smith as it will give you insight
into the algorithms that are employed for its OCR.

http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/33418.pdf

Further concepts for algorithms can be found in the "Techniques" section at:

https://en.wikipedia.org/wiki/Optical_character_recognition

Sounds like an uphill struggle to me but I wish you luck!

Cheers


On 15 June 2016 at 07:28, ravi katiyar  wrote:

> Hello All,
>
> I am new to the world of OCR and image processing as well. I am come from
> a java background.
> can someone tell what are the pre-requisite to understand the tesseract
> code ?
> Like java.awt.image package , Digital image processing concepts ? what
> would I need to be thorough with so that the I am able to understand
> tesseract code .
>
> I want this understanding because I am aiming to make modifications to
> this code , so that tesseract is able to extract text from a movie poster
> printed in a newspaper.
> Tesseract cannot do this currently.
>
> Thanks
> Ravi Katiyar
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAORW5vguXjQnO-c2h9Std0T%2B951Upv3yY_qen65EkAk_EUbHCg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


[tesseract-ocr] Re: Possible to prioritise some characters over others during OCR?

2016-06-15 Thread Diederik Hattingh
Hi Stef, 
Thanks for the reply (here and on SO).

The fix mostly works, but unfortunately I am still seeing that tesseract 
sometimes ignores the unicharambigs file I set for it.

For example I have the following two images:




And :





The only difference between the files is the border around them.


In my eng.unicharambigs file I have added the following lines:


3: I I3: / /1
3: / I3: / /1
3: I /3: / /1
5. c o m l5. c o m /1
3: / l3: / /1
3: l / 3: / /1


When I run tesseract on file without spacing I get the following output:


http:II1
111.com/


When I run tesseract on file with spacing I get the correct output:


http://1
111.com/


Another example of spacing (or something else?) making a difference:


Smaller border






Larger border:







both these files have spacing around the text with the first image having 
less spacing.  (and the find is a little different between the two images, 
though very slightly)


running Tesseract on first file gives correct result: 
http://alphaGl.com/primenumbershittingbearl (Except for 6 -> G and last / 
becoming l)


On the second image I get the output 
http://alpha61.comIprimenumbershittingbearl.  It seems as if the 
unicharambigs file is ignored for the .com/ case.  It doesn't do the 
substitution as specified.


Anything you can think of the fix this problem?












On Friday, 3 June 2016 18:39:38 UTC+2, Stef wrote:
>
> Here you are: SO answer. 
> 
>  
>
> Am Freitag, 3. Juni 2016 18:31:47 UTC+2 schrieb John Muccigrosso:
>>
>> On Thursday, June 2, 2016 at 5:21:51 PM UTC-4, Stef wrote:
>>>
>>> You can resolve the ambiguity using the unicharambigs file, for details 
>>> see my SO answer to your SO question.
>>>
>>> Stef
>>>
>>
>> I'm curious about this as well. Could you post a link to this discussion?
>>
>> Thanks. 
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/35005c56-a045-44c9-8224-3ad623a58f76%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Re: Do we have Sanskrit training images and box files online?

2016-06-15 Thread ShreeDevi Kumar
You can check out the older version of sanskritocr from
http://learnsanskrit.org/tools/ocr

The new version is commercial software, available as a demo for free, but
requires payment for use.

- sent from my phone. excuse the brevity.
On 14-Jun-2016 3:44 am, "rohit saluja"  wrote:

> Also thanks a lot for mentioning about SanskritOCR, I did not knew that.
>
> On Tue, Jun 14, 2016 at 3:35 AM, rohit saluja 
> wrote:
>
>> Hey thanks a lot for your reply. This seems to be a great idea to use hin
>> data with sanskrit wordlist.
>>
>> Still I am interested in knowing the things building from scratch.
>> So I used some boxfiles and images I created for sanskrit 2003 font and
>> used the hindi config file from
>> https://github.com/tesseract-ocr/langdata/blob/master/hin/hin.config
>> and I renamed it as san3ds.config. san3ds(3 for 2003 ds for devanagari
>> split) is the new name I am giving for my new training data.
>>
>> I was able to train san3ds without any config file before.
>>
>> I just renamed san3ds.word-dawg as san3ds.cube-word-dawg. Remaining files
>> I kept as it is.
>> I could form san3.traineddata file, but I am getting an error while
>> recognition:-
>>
>> Cube ERROR (CubeRecoContext::Load): unable to read cube language model
>> params from /usr/local/share/tessdata/san3ds.cube.lm
>> Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext
>> object
>> init_cube_objects(true, _manager):Error:Assert failed:in file
>> tessedit.cpp, line 214
>> Segmentation fault (core dumped)
>>
>> Any help in this, why this is happening? Is it wrong in renaming
>> word-dawg, I cannot find any separate option for generating cube-word-dawg.
>>
>> Thanks in advance
>> Rohit
>>
>>
>> On Mon, Jun 13, 2016 at 7:04 PM, ShreeDevi Kumar 
>> wrote:
>>
>>> If you look at the readme files in the diff subdirectories starting with
>>> OCR under
>>> https://github.com/Shreeshrii/imagessan/tree/master you will see
>>> results of character and word level accuracy. Depending on the font,
>>> character level accuracy is around 80% and word level accuracy around 60%
>>>
>>> I have not used it for actual OCR of any text because sanskritocr
>>> software by dr. Oliver hellwig gives better results.
>>>
>>> See https://sites.google.com/site/sanskritcode/ocr/1-ocr-ing
>>>
>>> - sent from my phone. excuse the brevity.
>>> On 13-Jun-2016 6:53 pm, "ShreeDevi Kumar"  wrote:
>>>
 Yes, hin traineddata with cube gives better results than san.

 I did some rudimentary testing with the new traineddata I made. It does
 not use cube. Look at the config files, it has some options for devanagari
 processing.

 You could try to unpack the hin traineddata and then remake the Dawg
 files using sanskrit wordlists and combine them as an experiment.

 If you have unicode version of the font used for the docs you want to
 OCR, then train using that.

 - sent from my phone. excuse the brevity.
 On 13-Jun-2016 4:47 pm, "rohit saluja"  wrote:

> Thanks again for replying. I will surely check them out.
>
> My experience is that OCR on sanskrit data with hin.traineddata gives
> better results than san.traineddata. I do know know, it is due to cube 
> mode
> or devanagari preprocessing(segmentation i guess) in devanagari?
>
> I wonder why such preprocessing is not applied in san.traineddata.
> Please let me know whether you are using cube mode in your traineddata
> or not, and are you using devanagari preprocessing?
>
> On Mon, Jun 13, 2016 at 9:18 AM, ShreeDevi Kumar  > wrote:
>
>> Google has not provided images and box files for San.traineddata
>> released for 3.04
>>
>> I tried training using text2image with a combination of different
>> fonts and training text. Results are at
>> https://github.com/Shreeshrii/imagessan/tree/master/tessdata
>>
>> You can give these a try to see if recognition is any better.
>>
>> You can unpack any trained data file using -u option with
>> combine-tessdata to see the config files etc.
>>
>>
>> http://manpages.ubuntu.com/manpages/trusty/man1/combine_tessdata.1.html
>>
>> Use the dawg2wordlist to look at the various dictionary word lists
>> used.
>>
>> http://manpages.ubuntu.com/manpages/trusty/man1/dawg2wordlist.1.html
>>
>> - sent from my phone. excuse the brevity.
>> On 12-Jun-2016 11:26 am, "rohit saluja" 
>> wrote:
>>
>>> Hey thanks for replying.
>>> Which options to use with text2image command? Also, is there any
>>> configuration file and fonts list?
>>>
>>> I tried the default option of text2image with tesseract github
>>> training data with sanskrit 2003, but the recognition results are far 
>>> away
>>> from 

[tesseract-ocr] Need to understand Tesseract code

2016-06-15 Thread ravi katiyar
Hello All,

I am new to the world of OCR and image processing as well. I am come from a 
java background.
can someone tell what are the pre-requisite to understand the tesseract 
code ?
Like java.awt.image package , Digital image processing concepts ? what 
would I need to be thorough with so that the I am able to understand 
tesseract code .

I want this understanding because I am aiming to make modifications to this 
code , so that tesseract is able to extract text from a movie poster 
printed in a newspaper.
Tesseract cannot do this currently.

Thanks
Ravi Katiyar

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/9a488786-ac4d-4d2e-a047-ebe329df1ea8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.