If you want to OCR an invoice like the sample you posted, just use the eng.traineddata and OCR the page. You do not need to do any training.
Here is the output I get 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 Did you know? Your Comcast Business Internet service gives you access to millions of WiFi hotspots with the fastest WiFi and even more coverage. Find out more at businesscomcast.com/wifi. Need help? We’re here for you. 9 Visit business.comcast.com/help Call 1-800—391 -3000 A Billing support Open 6 am-9 pm MTN, Mon through Fri and 7 am—8 pm Sat Technical support Open 24 hours, 7 days a week Did you know? Never miss a payment with text alerts. Receive text message reminders when your bill is ready to pay or past due. Sign up at business.comcast.com/myaccount. Your bill is ready Please notify us immediately with any questions regarding charges billed to your account. Comcast will issue a credit or refund for any verified billing error which is brought to our attention within sixty (60) days of the bill. llllllllllllllllllllllllllllllllll Additional payment options Moving? Let us help. Automatic payment Sign up at business.comcast.com/myaccount a Oniine Visit business.comcast.com/myaccount a By phone Call 1-800-391 -3000 if you're moving, give us as much advanced notice as possible so we can help make a smooth transition. Call 1 -800-391 -3000 |||||||llllllllllllllllllllllllll ShreeDevi ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <ghawial...@gmail.com> wrote: > Hello all, > > I am surprised by how many people tell me that tesseract is the best > open-source OCR tool but yet there is no video explaining step-by-step the > problems that you can encounter, or a good explanation and documentation > for OCR. > > Well even though, everyone loves challenges! So here's the challenge I > faced. I brought many pdf files that are invoices and I want to train > tesseract to be able to ocr them as scanned images. > So first of all, I transformed these pdf files into tif files > using: magick -density 300 -depth 4 2151.pdf -background white -fill > white -alpha Off 2151%d.tif > This is ImageMagick. Nothing important here other than we have a 300 dpi > image with an alpha channel off. > > You must rename them so : rename .tif files to: > [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example > > Great! After this step you must create your box file right? So I simply > called: > tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox > tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox > > Then I fixed my files with CowBoxEditor as I wasn't finding the famous > jTessBoxEditor online (weird right?) which did the job. > > After that, I created my .tr files: > tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train > tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train > > And here comes the surprises!!! > After having your .tr files you call unicharset_extractor. > First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0? > Which is wrong according to the documentation: https://github. > com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea > 5419978d82/doc/unicharset.5.asc > Second question: Should I write a box file, then the other or combine > them? Option 1: unicharset_extractor com.test_font.exp0.box or Option 2: > unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box > Third question: set_unicharset_extractor why should I use it? It doesn't > fix the metrics only specify if Latin or Common! Link: https://github.com/ > tesseract-ocr/tesseract/issues/318 > > After all these unanswered questions, I used mftraining and cntraining (no > problems). Finally, I renamed my inttemp, normproto, pffmtable, shapetable > and I combined them using combine_tessdata com. > > Final question: If I named com.inttemp1 com.inttemp2 does it work? Same > for shapetable, normproto, pffmtable > > I think these questions are asked more than once by all new users to > tesseract. Please if any expert in tesseract can answer these questions it > will be a great help for all the community. > Kindly find the attached 2 tif files and the boxes generated. > > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit https://groups.google.com/d/ > msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0% > 40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVEmnAsMTjEnMw9n001-oxOhGd062KDd%3D5GB0aZ%3Dq79Ow%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.