Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

ShreeDevi Kumar Fri, 21 Apr 2017 01:55:18 -0700

If you want to OCR an invoice like the sample you posted, just use the
eng.traineddata and OCR the page. You do not need to do any training.


Here is the output I get



8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3


Did you know?


Your Comcast Business Internet

service gives you access to millions

of WiFi hotspots with the fastest WiFi

and even more coverage. Find out

more at businesscomcast.com/wiﬁ.



Need help? We’re here for you.


9 Visit business.comcast.com/help

Call 1-800—391 -3000

A


Billing support

Open 6 am-9 pm MTN, Mon through Fri

and 7 am—8 pm Sat


Technical support

Open 24 hours, 7 days a week



Did you know?


Never miss a payment with text alerts.

Receive text message reminders when your

bill is ready to pay or past due. Sign up at

business.comcast.com/myaccount.



Your bill is ready




Please notify us immediately with any

questions regarding charges billed to your

account. Comcast will issue a credit or

refund for any verified billing error which is

brought to our attention within sixty (60) days

of the bill.


llllllllllllllllllllllllllllllllll


Additional payment options Moving? Let us help.


Automatic payment

Sign up at business.comcast.com/myaccount


a Oniine


Visit business.comcast.com/myaccount


a By phone

Call 1-800-391 -3000


if you're moving, give us as much

advanced notice as possible so we

can help make a smooth transition.


Call 1 -800-391 -3000


|||||||llllllllllllllllllllllllll




ShreeDevi
____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <ghawial...@gmail.com> wrote:

> Hello all,
>
> I am surprised by how many people tell me that tesseract is the best
> open-source OCR tool but yet there is no video explaining step-by-step the
> problems that you can encounter, or a good explanation and documentation
> for OCR.
>
> Well even though, everyone loves challenges! So here's the challenge I
> faced. I brought many pdf files that are invoices and I want to train
> tesseract to be able to ocr them as scanned images.
> So first of all, I transformed these pdf files into tif files
> using: magick -density 300 -depth 4   2151.pdf -background white -fill
> white -alpha Off  2151%d.tif
> This is ImageMagick. Nothing important here other than we have a 300 dpi
> image with an alpha channel off.
>
> You must rename them so : rename .tif files to:
> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example
>
> Great! After this step you must create your box file right? So I simply
> called:
> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox
>
> Then I fixed my files with CowBoxEditor as I wasn't finding the famous
> jTessBoxEditor online (weird right?) which did the job.
>
> After that, I created my .tr files:
> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>
> And here comes the surprises!!!
> After having your .tr files you call unicharset_extractor.
> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0?
> Which is wrong according to the documentation: https://github.
> com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea
> 5419978d82/doc/unicharset.5.asc
> Second question: Should I write a box file, then the other or combine
> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2:
> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box
> Third question: set_unicharset_extractor why should I use it? It doesn't
> fix the metrics only specify if Latin or Common! Link: https://github.com/
> tesseract-ocr/tesseract/issues/318
>
> After all these unanswered questions, I used mftraining and cntraining (no
> problems). Finally, I renamed my inttemp, normproto, pffmtable, shapetable
>  and I combined them using combine_tessdata com.
>
> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same
> for shapetable, normproto, pffmtable
>
> I think these questions are asked more than once by all new users to
> tesseract. Please if any expert in tesseract can answer these questions it
> will be a great help for all the community.
> Kindly find the attached 2 tif files and the boxes generated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%
> 40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVEmnAsMTjEnMw9n001-oxOhGd062KDd%3D5GB0aZ%3Dq79Ow%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

Reply via email to