@zdenko Please check this image (from the first post) with 3.0x and current
4.0x code to see if there is a regression in terms of recognition of 2
columns.

On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak <girira...@gmail.com> wrote:

> Thank you, I will try it out next.
> I wanted to use version 4 of tesseract since it uses LSTM based OCR
> engine. Higher accuracy is one of the essential requirements for my usecase.
> Would you know if v4 supports extracting text from a  two column text
> structure image file at all?
> Thank you for your quick response Shree!
>
> Regards,
> Giriraj.
>
> On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>>
>> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>>
>> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
>> 3.05.01 Release
>> <https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01>
>> [image: @zdenop] <https://github.com/zdenop> zdenop
>> <https://github.com/zdenop> released this on Jun 1, 2017 · 26 commits
>> <https://github.com/tesseract-ocr/tesseract/compare/3.05.01...3.05> to
>> 3.05 since this release
>>
>> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak <giri...@gmail.com> wrote:
>>
>>> Hi Shree,
>>>
>>> Thank you for quick response.
>>> I used the trained data by downloading the datasets at
>>> https://github.com/tesseract-ocr/tessdata,
>>> https://github.com/tesseract-ocr/tessdata_best and
>>> https://github.com/tesseract-ocr/tessdata_fast.
>>>
>>> I ran following commands for each of these datasets and changed psm from
>>> 1 to 13 , but more or less the output is like the one I posted. Couldn't
>>> get the output as you have posted that has data in the right order of the
>>> context.
>>>
>>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
>>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
>>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>>>
>>> Not sure what I am doing wrong here, appreciate your help with this.
>>>
>>> Regards,
>>> Giriraj
>>>
>>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>>>
>>>> Which eng.traineddata did you use?
>>>>
>>>> There are three options
>>>> From tessdata, tessdata_best and tessdata_fast.
>>>>
>>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, <giri...@gmail.com> wrote:
>>>>
>>>>> Hello Shree,
>>>>>
>>>>> I realize this post is more than two years old now, but would
>>>>> appreciate any help.
>>>>> I tried your suggestion on the same attached sample using tesseract v4
>>>>> and I am unable to get the result as you have posted.
>>>>> I have tried all page segmentation modes, but none of them produced
>>>>> the result you have posted.
>>>>> Could you please let me know what I might be doing wrong?
>>>>>
>>>>> Here is the version detail for the tessreact on my machine:
>>>>>
>>>>> tesseract 4.0.0
>>>>>  leptonica-1.77.0
>>>>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
>>>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>>>>  Found AVX2
>>>>>  Found AVX
>>>>>  Found SSE
>>>>>
>>>>> Here is the output I get for most of the psm modes:
>>>>>
>>>>>
>>>>> 8633 0410 NO RP 1107122016 NNNNNYNN 07 000001 0001 Page 20f3
>>>>>
>>>>> Did you know? Did you know?
>>>>>
>>>>> Your Comcast Business Internet Never miss a payment with text alerts.
>>>>> service gives you access to millions Receive text message reminders
>>>>> when your
>>>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past
>>>>> due. Sign up at
>>>>> and even more coverage. Find out business.comcast.com/myaccount.
>>>>>
>>>>> more at business.comcast.conm/wifi.
>>>>>
>>>>> Your bill is ready
>>>>>
>>>>>
>>>>>
>>>>> Need help? We’re here for you.
>>>>>
>>>>>
>>>>>
>>>>> > Visit business.comcast.com/help Please notify us immediately with
>>>>> any
>>>>> Call 1-800-391-3000 questions regarding charges billed to your
>>>>> aa account. Comcast will issue a credit or
>>>>> Billing support refund for any verified billing error which is
>>>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within
>>>>> sixty (60) days
>>>>> and 7 am-8 pm Sat of the bill.
>>>>>
>>>>> Technical support
>>>>> Open 24 hours, 7 days a week
>>>>>
>>>>> TT
>>>>>
>>>>> Automatic payment If you’re moving, give us as much
>>>>> Sign up at business.comcast.com/myaccount advanced notice as possible
>>>>> so we
>>>>>
>>>>> Se Online can help make a smooth transition.
>>>>> Visit business.comcast.com/myaccount
>>>>>
>>>>> a By phone
>>>>> Call 1-800-391-3000
>>>>>
>>>>> Call 1-800-391-3000
>>>>>
>>>>> IME
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Giriraj.
>>>>>
>>>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>>>>>
>>>>>> If you want to OCR an invoice like the sample you posted, just use
>>>>>> the eng.traineddata and OCR the page. You do not need to do any training.
>>>>>>
>>>>>> Here is the output I get
>>>>>>
>>>>>>
>>>>>>
>>>>>> 8633 0410 NO RP 11 07122015 NNNNNYNN 01 000001 0001 Page 2 Of 3
>>>>>>
>>>>>>
>>>>>> Did you know?
>>>>>>
>>>>>>
>>>>>> Your Comcast Business Internet
>>>>>>
>>>>>> service gives you access to millions
>>>>>>
>>>>>> of WiFi hotspots with the fastest WiFi
>>>>>>
>>>>>> and even more coverage. Find out
>>>>>>
>>>>>> more at businesscomcast.com/wifi.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Need help? We’re here for you.
>>>>>>
>>>>>>
>>>>>> 9 Visit business.comcast.com/help
>>>>>>
>>>>>> Call 1-800—391 -3000
>>>>>>
>>>>>> A
>>>>>>
>>>>>>
>>>>>> Billing support
>>>>>>
>>>>>> Open 6 am-9 pm MTN, Mon through Fri
>>>>>>
>>>>>> and 7 am—8 pm Sat
>>>>>>
>>>>>>
>>>>>> Technical support
>>>>>>
>>>>>> Open 24 hours, 7 days a week
>>>>>>
>>>>>>
>>>>>>
>>>>>> Did you know?
>>>>>>
>>>>>>
>>>>>> Never miss a payment with text alerts.
>>>>>>
>>>>>> Receive text message reminders when your
>>>>>>
>>>>>> bill is ready to pay or past due. Sign up at
>>>>>>
>>>>>> business.comcast.com/myaccount.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Your bill is ready
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Please notify us immediately with any
>>>>>>
>>>>>> questions regarding charges billed to your
>>>>>>
>>>>>> account. Comcast will issue a credit or
>>>>>>
>>>>>> refund for any verified billing error which is
>>>>>>
>>>>>> brought to our attention within sixty (60) days
>>>>>>
>>>>>> of the bill.
>>>>>>
>>>>>>
>>>>>> llllllllllllllllllllllllllllllllll
>>>>>>
>>>>>>
>>>>>> Additional payment options Moving? Let us help.
>>>>>>
>>>>>>
>>>>>> Automatic payment
>>>>>>
>>>>>> Sign up at business.comcast.com/myaccount
>>>>>>
>>>>>>
>>>>>> a Oniine
>>>>>>
>>>>>>
>>>>>> Visit business.comcast.com/myaccount
>>>>>>
>>>>>>
>>>>>> a By phone
>>>>>>
>>>>>> Call 1-800-391 -3000
>>>>>>
>>>>>>
>>>>>> if you're moving, give us as much
>>>>>>
>>>>>> advanced notice as possible so we
>>>>>>
>>>>>> can help make a smooth transition.
>>>>>>
>>>>>>
>>>>>> Call 1 -800-391 -3000
>>>>>>
>>>>>>
>>>>>> |||||||llllllllllllllllllllllllll
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> ShreeDevi
>>>>>> ____________________________________________________________
>>>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>>>
>>>>>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi <ghawi...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello all,
>>>>>>>
>>>>>>> I am surprised by how many people tell me that tesseract is the best
>>>>>>> open-source OCR tool but yet there is no video explaining step-by-step 
>>>>>>> the
>>>>>>> problems that you can encounter, or a good explanation and documentation
>>>>>>> for OCR.
>>>>>>>
>>>>>>> Well even though, everyone loves challenges! So here's the challenge
>>>>>>> I faced. I brought many pdf files that are invoices and I want to train
>>>>>>> tesseract to be able to ocr them as scanned images.
>>>>>>> So first of all, I transformed these pdf files into tif files
>>>>>>> using: magick -density 300 -depth 4   2151.pdf -background white -fill
>>>>>>> white -alpha Off  2151%d.tif
>>>>>>> This is ImageMagick. Nothing important here other than we have a 300
>>>>>>> dpi image with an alpha channel off.
>>>>>>>
>>>>>>> You must rename them so : rename .tif files to:
>>>>>>> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my 
>>>>>>> example
>>>>>>>
>>>>>>> Great! After this step you must create your box file right? So I
>>>>>>> simply called:
>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop
>>>>>>> makebox
>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop
>>>>>>> makebox
>>>>>>>
>>>>>>> Then I fixed my files with CowBoxEditor as I wasn't finding the
>>>>>>> famous jTessBoxEditor online (weird right?) which did the job.
>>>>>>>
>>>>>>> After that, I created my .tr files:
>>>>>>> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
>>>>>>> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>>>>>>>
>>>>>>> And here comes the surprises!!!
>>>>>>> After having your .tr files you call unicharset_extractor.
>>>>>>> First question: Why the glyph metrics are all
>>>>>>> 0,255,0,255,0,0,0,0,0,0? Which is wrong according to the documentation:
>>>>>>> https://github.com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea5419978d82/doc/unicharset.5.asc
>>>>>>> Second question: Should I write a box file, then the other or
>>>>>>> combine them? Option 1: unicharset_extractor com.test_font.exp0.box   or
>>>>>>> Option 2: unicharset_extractor com.test_font.exp0.box
>>>>>>> com.test_font.exp1.box
>>>>>>> Third question: set_unicharset_extractor why should I use it? It
>>>>>>> doesn't fix the metrics only specify if Latin or Common! Link:
>>>>>>> https://github.com/tesseract-ocr/tesseract/issues/318
>>>>>>>
>>>>>>> After all these unanswered questions, I used mftraining and
>>>>>>> cntraining (no problems). Finally, I renamed my inttemp, normproto,
>>>>>>> pffmtable, shapetable  and I combined them using combine_tessdata com.
>>>>>>>
>>>>>>> Final question: If I named com.inttemp1 com.inttemp2 does it work?
>>>>>>> Same for shapetable, normproto, pffmtable
>>>>>>>
>>>>>>> I think these questions are asked more than once by all new users to
>>>>>>> tesseract. Please if any expert in tesseract can answer these questions 
>>>>>>> it
>>>>>>> will be a great help for all the community.
>>>>>>> Kindly find the attached 2 tif files and the boxes generated.
>>>>>>>
>>>>>>> --
>>>>>>> You received this message because you are subscribed to the Google
>>>>>>> Groups "tesseract-ocr" group.
>>>>>>> To unsubscribe from this group and stop receiving emails from it,
>>>>>>> send an email to tesser...@googlegroups.com.
>>>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>>>> To view this discussion on the web visit
>>>>>>> https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com
>>>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>>>> .
>>>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>>>
>>>>>>
>>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesser...@googlegroups.com.
>>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/5b9b67fd-1474-48d8-95d9-15b17d295cc2%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesser...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/feec8eef-8c5d-4017-8d35-16349ac49324%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/1b26faff-4d86-46e3-80f7-4c69376f27fa%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 

____________________________________________________________
भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWXRh%2BLtGosb_JSOq%2BzhvMPBYG7XGy9tQSXhA35T1-y2A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to