Hello Shree,

I realize this post is more than two years old now, but would appreciate 
any help.
I tried your suggestion on the same attached sample using tesseract v4 and 
I am unable to get the result as you have posted.
I have tried all page segmentation modes, but none of them produced the 
result you have posted. 
Could you please let me know what I might be doing wrong?

Here is the version detail for the tessreact on my machine:

tesseract 4.0.0
 leptonica-1.77.0
  libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11 
: libwebp 1.0.1 : libopenjp2 2.3.0
 Found AVX2
 Found AVX
 Found SSE

Here is the output I get for most of the psm modes:


8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3

Did you know? Did you know?

Your Comcast Business Internet Never miss a payment with text alerts.
service gives you access to millions Receive text message reminders when 
your
of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. 
Sign up at
and even more coverage. Find out business.comcast.com/myaccount.

more at business.comcast.conm/wifi.

Your bill is ready

   

Need help? We’re here for you.

 

> Visit business.comcast.com/help Please notify us immediately with any
Call 1-800-391-3000 questions regarding charges billed to your
aa account. Comcast will issue a credit or
Billing support refund for any verified billing error which is
Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty 
(60) days
and 7 am-8 pm Sat of the bill.

Technical support
Open 24 hours, 7 days a week

TT

Automatic payment If you’re moving, give us as much
Sign up at business.comcast.com/myaccount advanced notice as possible so we

Se Online can help make a smooth transition.
Visit business.comcast.com/myaccount

a By phone
Call 1-800-391-3000

Call 1-800-391-3000

IME

 

 

Regards,
Giriraj.

On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>
> If you want to OCR an invoice like the sample you posted, just use the 
> eng.traineddata and OCR the page. You do not need to do any training.
>
> Here is the output I get 
>
>
>
> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3
>
>
> Did you know?
>
>
> Your Comcast Business Internet
>
> service gives you access to millions
>
> of WiFi hotspots with the fastest WiFi
>
> and even more coverage. Find out
>
> more at businesscomcast.com/wifi.
>
>
>
> Need help? We’re here for you.
>
>
> 9 Visit business.comcast.com/help
>
> Call 1-800—391 -3000
>
> A
>
>
> Billing support
>
> Open 6 am-9 pm MTN, Mon through Fri
>
> and 7 am—8 pm Sat
>
>
> Technical support
>
> Open 24 hours, 7 days a week
>
>
>
> Did you know?
>
>
> Never miss a payment with text alerts.
>
> Receive text message reminders when your
>
> bill is ready to pay or past due. Sign up at
>
> business.comcast.com/myaccount.
>
>
>
> Your bill is ready
>
>
>
>
> Please notify us immediately with any
>
> questions regarding charges billed to your
>
> account. Comcast will issue a credit or
>
> refund for any verified billing error which is
>
> brought to our attention within sixty (60) days
>
> of the bill.
>
>
> llllllllllllllllllllllllllllllllll
>
>
> Additional payment options Moving? Let us help.
>
>
> Automatic payment
>
> Sign up at business.comcast.com/myaccount
>
>
> a Oniine
>
>
> Visit business.comcast.com/myaccount
>
>
> a By phone
>
> Call 1-800-391 -3000
>
>
> if you're moving, give us as much
>
> advanced notice as possible so we
>
> can help make a smooth transition.
>
>
> Call 1 -800-391 -3000
>
>
> |||||||llllllllllllllllllllllllll
>
>
>
>
> ShreeDevi
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <ghawi...@gmail.com 
> <javascript:>> wrote:
>
>> Hello all,
>>
>> I am surprised by how many people tell me that tesseract is the best 
>> open-source OCR tool but yet there is no video explaining step-by-step the 
>> problems that you can encounter, or a good explanation and documentation 
>> for OCR.
>>
>> Well even though, everyone loves challenges! So here's the challenge I 
>> faced. I brought many pdf files that are invoices and I want to train 
>> tesseract to be able to ocr them as scanned images. 
>> So first of all, I transformed these pdf files into tif files 
>> using: magick -density 300 -depth 4   2151.pdf -background white -fill 
>> white -alpha Off  2151%d.tif
>> This is ImageMagick. Nothing important here other than we have a 300 dpi 
>> image with an alpha channel off.
>>
>> You must rename them so : rename .tif files to: 
>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example
>>
>> Great! After this step you must create your box file right? So I simply 
>> called: 
>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox
>>
>> Then I fixed my files with CowBoxEditor as I wasn't finding the famous 
>> jTessBoxEditor online (weird right?) which did the job.
>>
>> After that, I created my .tr files:
>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>>
>> And here comes the surprises!!!
>> After having your .tr files you call unicharset_extractor. 
>> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? 
>> Which is wrong according to the documentation: 
>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
>> Second question: Should I write a box file, then the other or combine 
>> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2: 
>> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box  
>> Third question: set_unicharset_extractor why should I use it? It doesn't 
>> fix the metrics only specify if Latin or Common! Link: 
>> https://github.com/tesseract-ocr/tesseract/issues/318
>>
>> After all these unanswered questions, I used mftraining and cntraining 
>> (no problems). Finally, I renamed my inttemp, normproto, 
>> pffmtable, shapetable  and I combined them using combine_tessdata com.
>>
>> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same 
>> for shapetable, normproto, pffmtable
>>
>> I think these questions are asked more than once by all new users to 
>> tesseract. Please if any expert in tesseract can answer these questions it 
>> will be a great help for all the community.
>> Kindly find the attached 2 tif files and the boxes generated. 
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesser...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to