Hi Shree,

Thank you for quick response.
I used the trained data by downloading the datasets at 
https://github.com/tesseract-ocr/tessdata, 
https://github.com/tesseract-ocr/tessdata_best and 
https://github.com/tesseract-ocr/tessdata_fast.

I ran following commands for each of these datasets and changed psm from 1 
to 13 , but more or less the output is like the one I posted. Couldn't get 
the output as you have posted that has data in the right order of the 
context.

tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1

Not sure what I am doing wrong here, appreciate your help with this.

Regards,
Giriraj

On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>
> Which eng.traineddata did you use?
>
> There are three options
> From tessdata, tessdata_best and tessdata_fast.
>
> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected] 
> <javascript:>> wrote:
>
>> Hello Shree,
>>
>> I realize this post is more than two years old now, but would appreciate 
>> any help.
>> I tried your suggestion on the same attached sample using tesseract v4 
>> and I am unable to get the result as you have posted.
>> I have tried all page segmentation modes, but none of them produced the 
>> result you have posted. 
>> Could you please let me know what I might be doing wrong?
>>
>> Here is the version detail for the tessreact on my machine:
>>
>> tesseract 4.0.0
>>  leptonica-1.77.0
>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 
>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>>
>> Here is the output I get for most of the psm modes:
>>
>>
>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3
>>
>> Did you know? Did you know?
>>
>> Your Comcast Business Internet Never miss a payment with text alerts.
>> service gives you access to millions Receive text message reminders when 
>> your
>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. 
>> Sign up at
>> and even more coverage. Find out business.comcast.com/myaccount.
>>
>> more at business.comcast.conm/wifi.
>>
>> Your bill is ready
>>
>>    
>>
>> Need help? We’re here for you.
>>
>>  
>>
>> > Visit business.comcast.com/help Please notify us immediately with any
>> Call 1-800-391-3000 questions regarding charges billed to your
>> aa account. Comcast will issue a credit or
>> Billing support refund for any verified billing error which is
>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty 
>> (60) days
>> and 7 am-8 pm Sat of the bill.
>>
>> Technical support
>> Open 24 hours, 7 days a week
>>
>> TT
>>
>> Automatic payment If you’re moving, give us as much
>> Sign up at business.comcast.com/myaccount advanced notice as possible so 
>> we
>>
>> Se Online can help make a smooth transition.
>> Visit business.comcast.com/myaccount
>>
>> a By phone
>> Call 1-800-391-3000
>>
>> Call 1-800-391-3000
>>
>> IME
>>
>>  
>>
>>  
>>
>> Regards,
>> Giriraj.
>>
>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>>
>>> If you want to OCR an invoice like the sample you posted, just use the 
>>> eng.traineddata and OCR the page. You do not need to do any training.
>>>
>>> Here is the output I get 
>>>
>>>
>>>
>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3
>>>
>>>
>>> Did you know?
>>>
>>>
>>> Your Comcast Business Internet
>>>
>>> service gives you access to millions
>>>
>>> of WiFi hotspots with the fastest WiFi
>>>
>>> and even more coverage. Find out
>>>
>>> more at businesscomcast.com/wifi.
>>>
>>>
>>>
>>> Need help? We’re here for you.
>>>
>>>
>>> 9 Visit business.comcast.com/help
>>>
>>> Call 1-800—391 -3000
>>>
>>> A
>>>
>>>
>>> Billing support
>>>
>>> Open 6 am-9 pm MTN, Mon through Fri
>>>
>>> and 7 am—8 pm Sat
>>>
>>>
>>> Technical support
>>>
>>> Open 24 hours, 7 days a week
>>>
>>>
>>>
>>> Did you know?
>>>
>>>
>>> Never miss a payment with text alerts.
>>>
>>> Receive text message reminders when your
>>>
>>> bill is ready to pay or past due. Sign up at
>>>
>>> business.comcast.com/myaccount.
>>>
>>>
>>>
>>> Your bill is ready
>>>
>>>
>>>
>>>
>>> Please notify us immediately with any
>>>
>>> questions regarding charges billed to your
>>>
>>> account. Comcast will issue a credit or
>>>
>>> refund for any verified billing error which is
>>>
>>> brought to our attention within sixty (60) days
>>>
>>> of the bill.
>>>
>>>
>>> llllllllllllllllllllllllllllllllll
>>>
>>>
>>> Additional payment options Moving? Let us help.
>>>
>>>
>>> Automatic payment
>>>
>>> Sign up at business.comcast.com/myaccount
>>>
>>>
>>> a Oniine
>>>
>>>
>>> Visit business.comcast.com/myaccount
>>>
>>>
>>> a By phone
>>>
>>> Call 1-800-391 -3000
>>>
>>>
>>> if you're moving, give us as much
>>>
>>> advanced notice as possible so we
>>>
>>> can help make a smooth transition.
>>>
>>>
>>> Call 1 -800-391 -3000
>>>
>>>
>>> |||||||llllllllllllllllllllllllll
>>>
>>>
>>>
>>>
>>> ShreeDevi
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> 
>>> wrote:
>>>
>>>> Hello all,
>>>>
>>>> I am surprised by how many people tell me that tesseract is the best 
>>>> open-source OCR tool but yet there is no video explaining step-by-step the 
>>>> problems that you can encounter, or a good explanation and documentation 
>>>> for OCR.
>>>>
>>>> Well even though, everyone loves challenges! So here's the challenge I 
>>>> faced. I brought many pdf files that are invoices and I want to train 
>>>> tesseract to be able to ocr them as scanned images. 
>>>> So first of all, I transformed these pdf files into tif files 
>>>> using: magick -density 300 -depth 4   2151.pdf -background white -fill 
>>>> white -alpha Off  2151%d.tif
>>>> This is ImageMagick. Nothing important here other than we have a 300 
>>>> dpi image with an alpha channel off.
>>>>
>>>> You must rename them so : rename .tif files to: 
>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example
>>>>
>>>> Great! After this step you must create your box file right? So I simply 
>>>> called: 
>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox
>>>>
>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the famous 
>>>> jTessBoxEditor online (weird right?) which did the job.
>>>>
>>>> After that, I created my .tr files:
>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>>>>
>>>> And here comes the surprises!!!
>>>> After having your .tr files you call unicharset_extractor. 
>>>> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? 
>>>> Which is wrong according to the documentation: 
>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
>>>> Second question: Should I write a box file, then the other or combine 
>>>> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2: 
>>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box  
>>>> Third question: set_unicharset_extractor why should I use it? It 
>>>> doesn't fix the metrics only specify if Latin or Common! Link: 
>>>> https://github.com/tesseract-ocr/tesseract/issues/318
>>>>
>>>> After all these unanswered questions, I used mftraining and cntraining 
>>>> (no problems). Finally, I renamed my inttemp, normproto, 
>>>> pffmtable, shapetable  and I combined them using combine_tessdata com.
>>>>
>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same 
>>>> for shapetable, normproto, pffmtable
>>>>
>>>> I think these questions are asked more than once by all new users to 
>>>> tesseract. Please if any expert in tesseract can answer these questions it 
>>>> will be a great help for all the community.
>>>> Kindly find the attached 2 tif files and the boxes generated. 
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to