Hi Shree, Thank you for quick response. I used the trained data by downloading the datasets at https://github.com/tesseract-ocr/tessdata, https://github.com/tesseract-ocr/tessdata_best and https://github.com/tesseract-ocr/tessdata_fast.
I ran following commands for each of these datasets and changed psm from 1 to 13 , but more or less the output is like the one I posted. Couldn't get the output as you have posted that has data in the right order of the context. tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1 tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1 tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1 Not sure what I am doing wrong here, appreciate your help with this. Regards, Giriraj On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote: > > Which eng.traineddata did you use? > > There are three options > From tessdata, tessdata_best and tessdata_fast. > > On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <[email protected] > <javascript:>> wrote: > >> Hello Shree, >> >> I realize this post is more than two years old now, but would appreciate >> any help. >> I tried your suggestion on the same attached sample using tesseract v4 >> and I am unable to get the result as you have posted. >> I have tried all page segmentation modes, but none of them produced the >> result you have posted. >> Could you please let me know what I might be doing wrong? >> >> Here is the version detail for the tessreact on my machine: >> >> tesseract 4.0.0 >> leptonica-1.77.0 >> libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib >> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 >> Found AVX2 >> Found AVX >> Found SSE >> >> Here is the output I get for most of the psm modes: >> >> >> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3 >> >> Did you know? Did you know? >> >> Your Comcast Business Internet Never miss a payment with text alerts. >> service gives you access to millions Receive text message reminders when >> your >> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. >> Sign up at >> and even more coverage. Find out business.comcast.com/myaccount. >> >> more at business.comcast.conm/wifi. >> >> Your bill is ready >> >> >> >> Need help? We’re here for you. >> >> >> >> > Visit business.comcast.com/help Please notify us immediately with any >> Call 1-800-391-3000 questions regarding charges billed to your >> aa account. Comcast will issue a credit or >> Billing support refund for any verified billing error which is >> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty >> (60) days >> and 7 am-8 pm Sat of the bill. >> >> Technical support >> Open 24 hours, 7 days a week >> >> TT >> >> Automatic payment If you’re moving, give us as much >> Sign up at business.comcast.com/myaccount advanced notice as possible so >> we >> >> Se Online can help make a smooth transition. >> Visit business.comcast.com/myaccount >> >> a By phone >> Call 1-800-391-3000 >> >> Call 1-800-391-3000 >> >> IME >> >> >> >> >> >> Regards, >> Giriraj. >> >> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote: >>> >>> If you want to OCR an invoice like the sample you posted, just use the >>> eng.traineddata and OCR the page. You do not need to do any training. >>> >>> Here is the output I get >>> >>> >>> >>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 >>> >>> >>> Did you know? >>> >>> >>> Your Comcast Business Internet >>> >>> service gives you access to millions >>> >>> of WiFi hotspots with the fastest WiFi >>> >>> and even more coverage. Find out >>> >>> more at businesscomcast.com/wifi. >>> >>> >>> >>> Need help? We’re here for you. >>> >>> >>> 9 Visit business.comcast.com/help >>> >>> Call 1-800—391 -3000 >>> >>> A >>> >>> >>> Billing support >>> >>> Open 6 am-9 pm MTN, Mon through Fri >>> >>> and 7 am—8 pm Sat >>> >>> >>> Technical support >>> >>> Open 24 hours, 7 days a week >>> >>> >>> >>> Did you know? >>> >>> >>> Never miss a payment with text alerts. >>> >>> Receive text message reminders when your >>> >>> bill is ready to pay or past due. Sign up at >>> >>> business.comcast.com/myaccount. >>> >>> >>> >>> Your bill is ready >>> >>> >>> >>> >>> Please notify us immediately with any >>> >>> questions regarding charges billed to your >>> >>> account. Comcast will issue a credit or >>> >>> refund for any verified billing error which is >>> >>> brought to our attention within sixty (60) days >>> >>> of the bill. >>> >>> >>> llllllllllllllllllllllllllllllllll >>> >>> >>> Additional payment options Moving? Let us help. >>> >>> >>> Automatic payment >>> >>> Sign up at business.comcast.com/myaccount >>> >>> >>> a Oniine >>> >>> >>> Visit business.comcast.com/myaccount >>> >>> >>> a By phone >>> >>> Call 1-800-391 -3000 >>> >>> >>> if you're moving, give us as much >>> >>> advanced notice as possible so we >>> >>> can help make a smooth transition. >>> >>> >>> Call 1 -800-391 -3000 >>> >>> >>> |||||||llllllllllllllllllllllllll >>> >>> >>> >>> >>> ShreeDevi >>> ____________________________________________________________ >>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>> >>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <[email protected]> >>> wrote: >>> >>>> Hello all, >>>> >>>> I am surprised by how many people tell me that tesseract is the best >>>> open-source OCR tool but yet there is no video explaining step-by-step the >>>> problems that you can encounter, or a good explanation and documentation >>>> for OCR. >>>> >>>> Well even though, everyone loves challenges! So here's the challenge I >>>> faced. I brought many pdf files that are invoices and I want to train >>>> tesseract to be able to ocr them as scanned images. >>>> So first of all, I transformed these pdf files into tif files >>>> using: magick -density 300 -depth 4 2151.pdf -background white -fill >>>> white -alpha Off 2151%d.tif >>>> This is ImageMagick. Nothing important here other than we have a 300 >>>> dpi image with an alpha channel off. >>>> >>>> You must rename them so : rename .tif files to: >>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example >>>> >>>> Great! After this step you must create your box file right? So I simply >>>> called: >>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox >>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox >>>> >>>> Then I fixed my files with CowBoxEditor as I wasn't finding the famous >>>> jTessBoxEditor online (weird right?) which did the job. >>>> >>>> After that, I created my .tr files: >>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train >>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train >>>> >>>> And here comes the surprises!!! >>>> After having your .tr files you call unicharset_extractor. >>>> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? >>>> Which is wrong according to the documentation: >>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc >>>> Second question: Should I write a box file, then the other or combine >>>> them? Option 1: unicharset_extractor com.test_font.exp0.box or Option 2: >>>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box >>>> Third question: set_unicharset_extractor why should I use it? It >>>> doesn't fix the metrics only specify if Latin or Common! Link: >>>> https://github.com/tesseract-ocr/tesseract/issues/318 >>>> >>>> After all these unanswered questions, I used mftraining and cntraining >>>> (no problems). Finally, I renamed my inttemp, normproto, >>>> pffmtable, shapetable and I combined them using combine_tessdata com. >>>> >>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same >>>> for shapetable, normproto, pffmtable >>>> >>>> I think these questions are asked more than once by all new users to >>>> tesseract. Please if any expert in tesseract can answer these questions it >>>> will be a great help for all the community. >>>> Kindly find the attached 2 tif files and the boxes generated. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To post to this group, send email to [email protected]. >>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com >>>> >>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>> . >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at https://groups.google.com/group/tesseract-ocr. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com >> >> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer> >> . >> For more options, visit https://groups.google.com/d/optout. >> > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

