@zdenko Please check this image (from the first post) with 3.0x and current 4.0x code to see if there is a regression in terms of recognition of 2 columns.
On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak <girira...@gmail.com> wrote: > Thank you, I will try it out next. > I wanted to use version 4 of tesseract since it uses LSTM based OCR > engine. Higher accuracy is one of the essential requirements for my usecase. > Would you know if v4 supports extracting text from a two column text > structure image file at all? > Thank you for your quick response Shree! > > Regards, > Giriraj. > > On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote: >> >> April 2017 - It is probably the 3.0x version. Try the 3.05 branch. >> >> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 >> 3.05.01 Release >> <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01> >> [image: @zdenop] <https://github.com/zdenop> zdenop >> <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits >> <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to >> 3.05 since this release >> >> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <giri...@gmail.com> wrote: >> >>> Hi Shree, >>> >>> Thank you for quick response. >>> I used the trained data by downloading the datasets at >>> https://github.com/tesseract-ocr/tessdata, >>> https://github.com/tesseract-ocr/tessdata_best and >>> https://github.com/tesseract-ocr/tessdata_fast. >>> >>> I ran following commands for each of these datasets and changed psm from >>> 1 to 13 , but more or less the output is like the one I posted. Couldn't >>> get the output as you have posted that has data in the right order of the >>> context. >>> >>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1 >>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1 >>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1 >>> >>> Not sure what I am doing wrong here, appreciate your help with this. >>> >>> Regards, >>> Giriraj >>> >>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote: >>>> >>>> Which eng.traineddata did you use? >>>> >>>> There are three options >>>> From tessdata, tessdata_best and tessdata_fast. >>>> >>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <giri...@gmail.com> wrote: >>>> >>>>> Hello Shree, >>>>> >>>>> I realize this post is more than two years old now, but would >>>>> appreciate any help. >>>>> I tried your suggestion on the same attached sample using tesseract v4 >>>>> and I am unable to get the result as you have posted. >>>>> I have tried all page segmentation modes, but none of them produced >>>>> the result you have posted. >>>>> Could you please let me know what I might be doing wrong? >>>>> >>>>> Here is the version detail for the tessreact on my machine: >>>>> >>>>> tesseract 4.0.0 >>>>> leptonica-1.77.0 >>>>> libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib >>>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0 >>>>> Found AVX2 >>>>> Found AVX >>>>> Found SSE >>>>> >>>>> Here is the output I get for most of the psm modes: >>>>> >>>>> >>>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3 >>>>> >>>>> Did you know? Did you know? >>>>> >>>>> Your Comcast Business Internet Never miss a payment with text alerts. >>>>> service gives you access to millions Receive text message reminders >>>>> when your >>>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past >>>>> due. Sign up at >>>>> and even more coverage. Find out business.comcast.com/myaccount. >>>>> >>>>> more at business.comcast.conm/wifi. >>>>> >>>>> Your bill is ready >>>>> >>>>> >>>>> >>>>> Need help? We’re here for you. >>>>> >>>>> >>>>> >>>>> > Visit business.comcast.com/help Please notify us immediately with >>>>> any >>>>> Call 1-800-391-3000 questions regarding charges billed to your >>>>> aa account. Comcast will issue a credit or >>>>> Billing support refund for any verified billing error which is >>>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within >>>>> sixty (60) days >>>>> and 7 am-8 pm Sat of the bill. >>>>> >>>>> Technical support >>>>> Open 24 hours, 7 days a week >>>>> >>>>> TT >>>>> >>>>> Automatic payment If you’re moving, give us as much >>>>> Sign up at business.comcast.com/myaccount advanced notice as possible >>>>> so we >>>>> >>>>> Se Online can help make a smooth transition. >>>>> Visit business.comcast.com/myaccount >>>>> >>>>> a By phone >>>>> Call 1-800-391-3000 >>>>> >>>>> Call 1-800-391-3000 >>>>> >>>>> IME >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Regards, >>>>> Giriraj. >>>>> >>>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote: >>>>>> >>>>>> If you want to OCR an invoice like the sample you posted, just use >>>>>> the eng.traineddata and OCR the page. You do not need to do any training. >>>>>> >>>>>> Here is the output I get >>>>>> >>>>>> >>>>>> >>>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3 >>>>>> >>>>>> >>>>>> Did you know? >>>>>> >>>>>> >>>>>> Your Comcast Business Internet >>>>>> >>>>>> service gives you access to millions >>>>>> >>>>>> of WiFi hotspots with the fastest WiFi >>>>>> >>>>>> and even more coverage. Find out >>>>>> >>>>>> more at businesscomcast.com/wifi. >>>>>> >>>>>> >>>>>> >>>>>> Need help? We’re here for you. >>>>>> >>>>>> >>>>>> 9 Visit business.comcast.com/help >>>>>> >>>>>> Call 1-800—391 -3000 >>>>>> >>>>>> A >>>>>> >>>>>> >>>>>> Billing support >>>>>> >>>>>> Open 6 am-9 pm MTN, Mon through Fri >>>>>> >>>>>> and 7 am—8 pm Sat >>>>>> >>>>>> >>>>>> Technical support >>>>>> >>>>>> Open 24 hours, 7 days a week >>>>>> >>>>>> >>>>>> >>>>>> Did you know? >>>>>> >>>>>> >>>>>> Never miss a payment with text alerts. >>>>>> >>>>>> Receive text message reminders when your >>>>>> >>>>>> bill is ready to pay or past due. Sign up at >>>>>> >>>>>> business.comcast.com/myaccount. >>>>>> >>>>>> >>>>>> >>>>>> Your bill is ready >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Please notify us immediately with any >>>>>> >>>>>> questions regarding charges billed to your >>>>>> >>>>>> account. Comcast will issue a credit or >>>>>> >>>>>> refund for any verified billing error which is >>>>>> >>>>>> brought to our attention within sixty (60) days >>>>>> >>>>>> of the bill. >>>>>> >>>>>> >>>>>> llllllllllllllllllllllllllllllllll >>>>>> >>>>>> >>>>>> Additional payment options Moving? Let us help. >>>>>> >>>>>> >>>>>> Automatic payment >>>>>> >>>>>> Sign up at business.comcast.com/myaccount >>>>>> >>>>>> >>>>>> a Oniine >>>>>> >>>>>> >>>>>> Visit business.comcast.com/myaccount >>>>>> >>>>>> >>>>>> a By phone >>>>>> >>>>>> Call 1-800-391 -3000 >>>>>> >>>>>> >>>>>> if you're moving, give us as much >>>>>> >>>>>> advanced notice as possible so we >>>>>> >>>>>> can help make a smooth transition. >>>>>> >>>>>> >>>>>> Call 1 -800-391 -3000 >>>>>> >>>>>> >>>>>> |||||||llllllllllllllllllllllllll >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> ShreeDevi >>>>>> ____________________________________________________________ >>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >>>>>> >>>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <ghawi...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hello all, >>>>>>> >>>>>>> I am surprised by how many people tell me that tesseract is the best >>>>>>> open-source OCR tool but yet there is no video explaining step-by-step >>>>>>> the >>>>>>> problems that you can encounter, or a good explanation and documentation >>>>>>> for OCR. >>>>>>> >>>>>>> Well even though, everyone loves challenges! So here's the challenge >>>>>>> I faced. I brought many pdf files that are invoices and I want to train >>>>>>> tesseract to be able to ocr them as scanned images. >>>>>>> So first of all, I transformed these pdf files into tif files >>>>>>> using: magick -density 300 -depth 4 2151.pdf -background white -fill >>>>>>> white -alpha Off 2151%d.tif >>>>>>> This is ImageMagick. Nothing important here other than we have a 300 >>>>>>> dpi image with an alpha channel off. >>>>>>> >>>>>>> You must rename them so : rename .tif files to: >>>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my >>>>>>> example >>>>>>> >>>>>>> Great! After this step you must create your box file right? So I >>>>>>> simply called: >>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop >>>>>>> makebox >>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop >>>>>>> makebox >>>>>>> >>>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the >>>>>>> famous jTessBoxEditor online (weird right?) which did the job. >>>>>>> >>>>>>> After that, I created my .tr files: >>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train >>>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train >>>>>>> >>>>>>> And here comes the surprises!!! >>>>>>> After having your .tr files you call unicharset_extractor. >>>>>>> First question: Why the glyph metrics are all >>>>>>> 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the documentation: >>>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc >>>>>>> Second question: Should I write a box file, then the other or >>>>>>> combine them? Option 1: unicharset_extractor com.test_font.exp0.box or >>>>>>> Option 2: unicharset_extractor com.test_font.exp0.box >>>>>>> com.test_font.exp1.box >>>>>>> Third question: set_unicharset_extractor why should I use it? It >>>>>>> doesn't fix the metrics only specify if Latin or Common! Link: >>>>>>> https://github.com/tesseract-ocr/tesseract/issues/318 >>>>>>> >>>>>>> After all these unanswered questions, I used mftraining and >>>>>>> cntraining (no problems). Finally, I renamed my inttemp, normproto, >>>>>>> pffmtable, shapetable and I combined them using combine_tessdata com. >>>>>>> >>>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work? >>>>>>> Same for shapetable, normproto, pffmtable >>>>>>> >>>>>>> I think these questions are asked more than once by all new users to >>>>>>> tesseract. Please if any expert in tesseract can answer these questions >>>>>>> it >>>>>>> will be a great help for all the community. >>>>>>> Kindly find the attached 2 tif files and the boxes generated. >>>>>>> >>>>>>> -- >>>>>>> You received this message because you are subscribed to the Google >>>>>>> Groups "tesseract-ocr" group. >>>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>>> send an email to tesser...@googlegroups.com. >>>>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>>>> To view this discussion on the web visit >>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com >>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>>>> . >>>>>>> For more options, visit https://groups.google.com/d/optout. >>>>>>> >>>>>> >>>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesser...@googlegroups.com. >>>>> To post to this group, send email to tesser...@googlegroups.com. >>>>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>>>> To view this discussion on the web visit >>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com >>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer> >>>>> . >>>>> For more options, visit https://groups.google.com/d/optout. >>>>> >>>> -- >>> You received this message because you are subscribed to the Google >>> Groups "tesseract-ocr" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to tesser...@googlegroups.com. >>> To post to this group, send email to tesser...@googlegroups.com. >>> Visit this group at https://groups.google.com/group/tesseract-ocr. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com >>> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer> >>> . >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> -- >> >> ____________________________________________________________ >> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com >> > -- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to tesseract-ocr+unsubscr...@googlegroups.com. > To post to this group, send email to tesseract-ocr@googlegroups.com. > Visit this group at https://groups.google.com/group/tesseract-ocr. > To view this discussion on the web visit > https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com > <https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com?utm_medium=email&utm_source=footer> > . > For more options, visit https://groups.google.com/d/optout. > -- ____________________________________________________________ भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. To post to this group, send email to tesseract-ocr@googlegroups.com. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXRh%2BLtGosb_JSOq%2BzhvMPBYG7XGy9tQSXhA35T1-y2A%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.