Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-28 Thread Giriraj Bhojak
Hi Shree,

Does this mean there is a bug in tesseract 4 and should I create one in
GitHub for two columns text with default psm?

Also, could you please expand on what you meant by ' other means of
selecting text region' ? Is there anything in tesseract that I can try to
identify text regions ?

Regards,
Giriraj

On Fri, Apr 26, 2019, 11:04 PM Shree Devi Kumar 
wrote:

> I did not post the command that I used, it was probably with default psm
> and code as of April 2017. If you really want to investigate, use the
> commit from master branch as of that time and test.
>
> In theory tesseract 4 should recognize two columns with the default psm.
> But there seem to be some issues with layout analysis.
>
> You could try other means of selecting text regions and using tesseract on
> those.
>
>
> On Sat, 27 Apr 2019, 02:57 Giriraj Bhojak,  wrote:
>
>> Hi Shree,
>>
>> I just tried the v3.05.02 as well for different modes and I still
>> couldn't produce the output as you posted with the image file.
>> I am wondering if I am doing anything wrong.
>> Here is the command I have run for the v3.05.02 tesseract and changed psm
>> mode from 1 to 13:
>>
>>
>> */usr/local/Cellar/tesseract/3.05.02/bin/tesseract --tessdata-dir
>> /usr/local/Cellar/tesseract/3.05.02/share/ "sample.tif" test --psm 3*
>>
>> It still produced the same output as earlier.
>> Please let me know what I might be doing incorrectly here.
>> Once again, thank you for your prompt responses.
>>
>>
>> Regards,
>> Giriraj.
>>
>>
>> On Friday, April 26, 2019 at 1:42:17 PM UTC-4, shree wrote:
>>>
>>> @zdenko Please check this image (from the first post) with 3.0x and
>>> current 4.0x code to see if there is a regression in terms of recognition
>>> of 2 columns.
>>>
>>> On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak 
>>> wrote:
>>>
 Thank you, I will try it out next.
 I wanted to use version 4 of tesseract since it uses LSTM based OCR
 engine. Higher accuracy is one of the essential requirements for my 
 usecase.
 Would you know if v4 supports extracting text from a  two column text
 structure image file at all?
 Thank you for your quick response Shree!

 Regards,
 Giriraj.

 On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>
> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>
> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
> 3.05.01 Release
> 
> [image: @zdenop]  zdenop
>  released this on Jun 1, 2017 · 26 commits
>  to
> 3.05 since this release
>
> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak 
> wrote:
>
>> Hi Shree,
>>
>> Thank you for quick response.
>> I used the trained data by downloading the datasets at
>> https://github.com/tesseract-ocr/tessdata,
>> https://github.com/tesseract-ocr/tessdata_best and
>> https://github.com/tesseract-ocr/tessdata_fast.
>>
>> I ran following commands for each of these datasets and changed psm
>> from 1 to 13 , but more or less the output is like the one I posted.
>> Couldn't get the output as you have posted that has data in the right 
>> order
>> of the context.
>>
>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample
>> --psm 1
>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample
>> --psm 1
>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>>
>> Not sure what I am doing wrong here, appreciate your help with this.
>>
>> Regards,
>> Giriraj
>>
>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>>
>>> Which eng.traineddata did you use?
>>>
>>> There are three options
>>> From tessdata, tessdata_best and tessdata_fast.
>>>
>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak, 
>>> wrote:
>>>
 Hello Shree,

 I realize this post is more than two years old now, but would
 appreciate any help.
 I tried your suggestion on the same attached sample using tesseract
 v4 and I am unable to get the result as you have posted.
 I have tried all page segmentation modes, but none of them produced
 the result you have posted.
 Could you please let me know what I might be doing wrong?

 Here is the version detail for the tessreact on my machine:

 tesseract 4.0.0
  leptonica-1.77.0
   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
  Found AVX2
  Found AVX
  Found SSE

 Here is the output I get for most of the psm modes:


 8633 0410 NO RP 1107122016 NYNN 

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
I did not post the command that I used, it was probably with default psm
and code as of April 2017. If you really want to investigate, use the
commit from master branch as of that time and test.

In theory tesseract 4 should recognize two columns with the default psm.
But there seem to be some issues with layout analysis.

You could try other means of selecting text regions and using tesseract on
those.


On Sat, 27 Apr 2019, 02:57 Giriraj Bhojak,  wrote:

> Hi Shree,
>
> I just tried the v3.05.02 as well for different modes and I still couldn't
> produce the output as you posted with the image file.
> I am wondering if I am doing anything wrong.
> Here is the command I have run for the v3.05.02 tesseract and changed psm
> mode from 1 to 13:
>
>
> */usr/local/Cellar/tesseract/3.05.02/bin/tesseract --tessdata-dir
> /usr/local/Cellar/tesseract/3.05.02/share/ "sample.tif" test --psm 3*
>
> It still produced the same output as earlier.
> Please let me know what I might be doing incorrectly here.
> Once again, thank you for your prompt responses.
>
>
> Regards,
> Giriraj.
>
>
> On Friday, April 26, 2019 at 1:42:17 PM UTC-4, shree wrote:
>>
>> @zdenko Please check this image (from the first post) with 3.0x and
>> current 4.0x code to see if there is a regression in terms of recognition
>> of 2 columns.
>>
>> On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak 
>> wrote:
>>
>>> Thank you, I will try it out next.
>>> I wanted to use version 4 of tesseract since it uses LSTM based OCR
>>> engine. Higher accuracy is one of the essential requirements for my usecase.
>>> Would you know if v4 supports extracting text from a  two column text
>>> structure image file at all?
>>> Thank you for your quick response Shree!
>>>
>>> Regards,
>>> Giriraj.
>>>
>>> On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:

 April 2017 - It is probably the 3.0x version. Try the 3.05 branch.

 https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
 3.05.01 Release
 
 [image: @zdenop]  zdenop
  released this on Jun 1, 2017 · 26 commits
  to
 3.05 since this release

 On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak 
 wrote:

> Hi Shree,
>
> Thank you for quick response.
> I used the trained data by downloading the datasets at
> https://github.com/tesseract-ocr/tessdata,
> https://github.com/tesseract-ocr/tessdata_best and
> https://github.com/tesseract-ocr/tessdata_fast.
>
> I ran following commands for each of these datasets and changed psm
> from 1 to 13 , but more or less the output is like the one I posted.
> Couldn't get the output as you have posted that has data in the right 
> order
> of the context.
>
> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample
> --psm 1
> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample
> --psm 1
> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>
> Not sure what I am doing wrong here, appreciate your help with this.
>
> Regards,
> Giriraj
>
> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>
>> Which eng.traineddata did you use?
>>
>> There are three options
>> From tessdata, tessdata_best and tessdata_fast.
>>
>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:
>>
>>> Hello Shree,
>>>
>>> I realize this post is more than two years old now, but would
>>> appreciate any help.
>>> I tried your suggestion on the same attached sample using tesseract
>>> v4 and I am unable to get the result as you have posted.
>>> I have tried all page segmentation modes, but none of them produced
>>> the result you have posted.
>>> Could you please let me know what I might be doing wrong?
>>>
>>> Here is the version detail for the tessreact on my machine:
>>>
>>> tesseract 4.0.0
>>>  leptonica-1.77.0
>>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>>
>>> Here is the output I get for most of the psm modes:
>>>
>>>
>>> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>>>
>>> Did you know? Did you know?
>>>
>>> Your Comcast Business Internet Never miss a payment with text alerts.
>>> service gives you access to millions Receive text message reminders
>>> when your
>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past
>>> due. Sign up at
>>> and even more coverage. Find out business.comcast.com/myaccount.
>>>
>>> more at business.comcast.conm/wifi.
>>>
>>> Your bill is ready
>>>

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Hi Shree,

I just tried the v3.05.02 as well for different modes and I still couldn't 
produce the output as you posted with the image file.
I am wondering if I am doing anything wrong.
Here is the command I have run for the v3.05.02 tesseract and changed psm 
mode from 1 to 13:


*/usr/local/Cellar/tesseract/3.05.02/bin/tesseract --tessdata-dir 
/usr/local/Cellar/tesseract/3.05.02/share/ "sample.tif" test --psm 3*

It still produced the same output as earlier.
Please let me know what I might be doing incorrectly here.
Once again, thank you for your prompt responses.


Regards,
Giriraj.


On Friday, April 26, 2019 at 1:42:17 PM UTC-4, shree wrote:
>
> @zdenko Please check this image (from the first post) with 3.0x and 
> current 4.0x code to see if there is a regression in terms of recognition 
> of 2 columns.
>
> On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak  > wrote:
>
>> Thank you, I will try it out next.
>> I wanted to use version 4 of tesseract since it uses LSTM based OCR 
>> engine. Higher accuracy is one of the essential requirements for my usecase.
>> Would you know if v4 supports extracting text from a  two column text 
>> structure image file at all?
>> Thank you for your quick response Shree!
>>
>> Regards,
>> Giriraj.
>>
>> On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>>>
>>> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>>>
>>> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 
>>> 3.05.01 Release 
>>> 
>>> [image: @zdenop]  zdenop 
>>>  released this on Jun 1, 2017 · 26 commits 
>>>  to 
>>> 3.05 since this release 
>>>
>>> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak  
>>> wrote:
>>>
 Hi Shree,

 Thank you for quick response.
 I used the trained data by downloading the datasets at 
 https://github.com/tesseract-ocr/tessdata, 
 https://github.com/tesseract-ocr/tessdata_best and 
 https://github.com/tesseract-ocr/tessdata_fast.

 I ran following commands for each of these datasets and changed psm 
 from 1 to 13 , but more or less the output is like the one I posted. 
 Couldn't get the output as you have posted that has data in the right 
 order 
 of the context.

 tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 
 1
 tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 
 1
 tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1

 Not sure what I am doing wrong here, appreciate your help with this.

 Regards,
 Giriraj

 On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>
> Which eng.traineddata did you use?
>
> There are three options
> From tessdata, tessdata_best and tessdata_fast.
>
> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:
>
>> Hello Shree,
>>
>> I realize this post is more than two years old now, but would 
>> appreciate any help.
>> I tried your suggestion on the same attached sample using tesseract 
>> v4 and I am unable to get the result as you have posted.
>> I have tried all page segmentation modes, but none of them produced 
>> the result you have posted. 
>> Could you please let me know what I might be doing wrong?
>>
>> Here is the version detail for the tessreact on my machine:
>>
>> tesseract 4.0.0
>>  leptonica-1.77.0
>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 
>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>>
>> Here is the output I get for most of the psm modes:
>>
>>
>> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>>
>> Did you know? Did you know?
>>
>> Your Comcast Business Internet Never miss a payment with text alerts.
>> service gives you access to millions Receive text message reminders 
>> when your
>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past 
>> due. Sign up at
>> and even more coverage. Find out business.comcast.com/myaccount.
>>
>> more at business.comcast.conm/wifi.
>>
>> Your bill is ready
>>
>>
>>
>> Need help? We’re here for you.
>>
>>  
>>
>> > Visit business.comcast.com/help Please notify us immediately with 
>> any
>> Call 1-800-391-3000 questions regarding charges billed to your
>> aa account. Comcast will issue a credit or
>> Billing support refund for any verified billing error which is
>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within 
>> sixty (60) days
>> and 7 am-8 pm Sat of the bill.
>>
>> Technical support
>> Open 24 hours, 7 days a week
>>
>> TT

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
@zdenko Please check this image (from the first post) with 3.0x and current
4.0x code to see if there is a regression in terms of recognition of 2
columns.

On Fri, Apr 26, 2019 at 10:25 PM Giriraj Bhojak  wrote:

> Thank you, I will try it out next.
> I wanted to use version 4 of tesseract since it uses LSTM based OCR
> engine. Higher accuracy is one of the essential requirements for my usecase.
> Would you know if v4 supports extracting text from a  two column text
> structure image file at all?
> Thank you for your quick response Shree!
>
> Regards,
> Giriraj.
>
> On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>>
>> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>>
>> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
>> 3.05.01 Release
>> 
>> [image: @zdenop]  zdenop
>>  released this on Jun 1, 2017 · 26 commits
>>  to
>> 3.05 since this release
>>
>> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak  wrote:
>>
>>> Hi Shree,
>>>
>>> Thank you for quick response.
>>> I used the trained data by downloading the datasets at
>>> https://github.com/tesseract-ocr/tessdata,
>>> https://github.com/tesseract-ocr/tessdata_best and
>>> https://github.com/tesseract-ocr/tessdata_fast.
>>>
>>> I ran following commands for each of these datasets and changed psm from
>>> 1 to 13 , but more or less the output is like the one I posted. Couldn't
>>> get the output as you have posted that has data in the right order of the
>>> context.
>>>
>>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
>>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
>>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>>>
>>> Not sure what I am doing wrong here, appreciate your help with this.
>>>
>>> Regards,
>>> Giriraj
>>>
>>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:

 Which eng.traineddata did you use?

 There are three options
 From tessdata, tessdata_best and tessdata_fast.

 On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:

> Hello Shree,
>
> I realize this post is more than two years old now, but would
> appreciate any help.
> I tried your suggestion on the same attached sample using tesseract v4
> and I am unable to get the result as you have posted.
> I have tried all page segmentation modes, but none of them produced
> the result you have posted.
> Could you please let me know what I might be doing wrong?
>
> Here is the version detail for the tessreact on my machine:
>
> tesseract 4.0.0
>  leptonica-1.77.0
>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>  Found AVX2
>  Found AVX
>  Found SSE
>
> Here is the output I get for most of the psm modes:
>
>
> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>
> Did you know? Did you know?
>
> Your Comcast Business Internet Never miss a payment with text alerts.
> service gives you access to millions Receive text message reminders
> when your
> of WiFi hotspots with the fastest WiFi bill is ready to pay or past
> due. Sign up at
> and even more coverage. Find out business.comcast.com/myaccount.
>
> more at business.comcast.conm/wifi.
>
> Your bill is ready
>
>
>
> Need help? We’re here for you.
>
>
>
> > Visit business.comcast.com/help Please notify us immediately with
> any
> Call 1-800-391-3000 questions regarding charges billed to your
> aa account. Comcast will issue a credit or
> Billing support refund for any verified billing error which is
> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within
> sixty (60) days
> and 7 am-8 pm Sat of the bill.
>
> Technical support
> Open 24 hours, 7 days a week
>
> TT
>
> Automatic payment If you’re moving, give us as much
> Sign up at business.comcast.com/myaccount advanced notice as possible
> so we
>
> Se Online can help make a smooth transition.
> Visit business.comcast.com/myaccount
>
> a By phone
> Call 1-800-391-3000
>
> Call 1-800-391-3000
>
> IME
>
>
>
>
>
> Regards,
> Giriraj.
>
> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>
>> If you want to OCR an invoice like the sample you posted, just use
>> the eng.traineddata and OCR the page. You do not need to do any training.
>>
>> Here is the output I get
>>
>>
>>
>> 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3
>>
>>
>> Did you know?
>>
>>
>> Your 

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Thank you, I will try it out next.
I wanted to use version 4 of tesseract since it uses LSTM based OCR engine. 
Higher accuracy is one of the essential requirements for my usecase.
Would you know if v4 supports extracting text from a  two column text 
structure image file at all?
Thank you for your quick response Shree!

Regards,
Giriraj.

On Friday, April 26, 2019 at 12:35:05 PM UTC-4, shree wrote:
>
> April 2017 - It is probably the 3.0x version. Try the 3.05 branch.
>
> https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01 
> 3.05.01 Release 
> 
> [image: @zdenop]  zdenop 
>  released this on Jun 1, 2017 · 26 commits 
>  to 
> 3.05 since this release 
>
> On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak  > wrote:
>
>> Hi Shree,
>>
>> Thank you for quick response.
>> I used the trained data by downloading the datasets at 
>> https://github.com/tesseract-ocr/tessdata, 
>> https://github.com/tesseract-ocr/tessdata_best and 
>> https://github.com/tesseract-ocr/tessdata_fast.
>>
>> I ran following commands for each of these datasets and changed psm from 
>> 1 to 13 , but more or less the output is like the one I posted. Couldn't 
>> get the output as you have posted that has data in the right order of the 
>> context.
>>
>> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
>> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
>> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>>
>> Not sure what I am doing wrong here, appreciate your help with this.
>>
>> Regards,
>> Giriraj
>>
>> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>>
>>> Which eng.traineddata did you use?
>>>
>>> There are three options
>>> From tessdata, tessdata_best and tessdata_fast.
>>>
>>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:
>>>
 Hello Shree,

 I realize this post is more than two years old now, but would 
 appreciate any help.
 I tried your suggestion on the same attached sample using tesseract v4 
 and I am unable to get the result as you have posted.
 I have tried all page segmentation modes, but none of them produced the 
 result you have posted. 
 Could you please let me know what I might be doing wrong?

 Here is the version detail for the tessreact on my machine:

 tesseract 4.0.0
  leptonica-1.77.0
   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 
 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
  Found AVX2
  Found AVX
  Found SSE

 Here is the output I get for most of the psm modes:


 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3

 Did you know? Did you know?

 Your Comcast Business Internet Never miss a payment with text alerts.
 service gives you access to millions Receive text message reminders 
 when your
 of WiFi hotspots with the fastest WiFi bill is ready to pay or past 
 due. Sign up at
 and even more coverage. Find out business.comcast.com/myaccount.

 more at business.comcast.conm/wifi.

 Your bill is ready



 Need help? We’re here for you.

  

 > Visit business.comcast.com/help Please notify us immediately with any
 Call 1-800-391-3000 questions regarding charges billed to your
 aa account. Comcast will issue a credit or
 Billing support refund for any verified billing error which is
 Open 6 am-9 pm MTN, Mon through Fri brought to our attention within 
 sixty (60) days
 and 7 am-8 pm Sat of the bill.

 Technical support
 Open 24 hours, 7 days a week

 TT

 Automatic payment If you’re moving, give us as much
 Sign up at business.comcast.com/myaccount advanced notice as possible 
 so we

 Se Online can help make a smooth transition.
 Visit business.comcast.com/myaccount

 a By phone
 Call 1-800-391-3000

 Call 1-800-391-3000

 IME

  

  

 Regards,
 Giriraj.

 On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>
> If you want to OCR an invoice like the sample you posted, just use the 
> eng.traineddata and OCR the page. You do not need to do any training.
>
> Here is the output I get 
>
>
>
> 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3
>
>
> Did you know?
>
>
> Your Comcast Business Internet
>
> service gives you access to millions
>
> of WiFi hotspots with the fastest WiFi
>
> and even more coverage. Find out
>
> more at businesscomcast.com/wifi.
>
>
>
> Need help? We’re here for you.
>
>
> 9 Visit business.comcast.com/help
>
> Call 

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
April 2017 - It is probably the 3.0x version. Try the 3.05 branch.

https://github.com/tesseract-ocr/tesseract/releases/tag/3.05.01
3.05.01 Release

[image: @zdenop]  zdenop
 released this on Jun 1, 2017 · 26 commits
 to 3.05
since this release

On Fri, Apr 26, 2019 at 9:24 PM Giriraj Bhojak  wrote:

> Hi Shree,
>
> Thank you for quick response.
> I used the trained data by downloading the datasets at
> https://github.com/tesseract-ocr/tessdata,
> https://github.com/tesseract-ocr/tessdata_best and
> https://github.com/tesseract-ocr/tessdata_fast.
>
> I ran following commands for each of these datasets and changed psm from 1
> to 13 , but more or less the output is like the one I posted. Couldn't get
> the output as you have posted that has data in the right order of the
> context.
>
> tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
> tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
> tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1
>
> Not sure what I am doing wrong here, appreciate your help with this.
>
> Regards,
> Giriraj
>
> On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>>
>> Which eng.traineddata did you use?
>>
>> There are three options
>> From tessdata, tessdata_best and tessdata_fast.
>>
>> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:
>>
>>> Hello Shree,
>>>
>>> I realize this post is more than two years old now, but would appreciate
>>> any help.
>>> I tried your suggestion on the same attached sample using tesseract v4
>>> and I am unable to get the result as you have posted.
>>> I have tried all page segmentation modes, but none of them produced the
>>> result you have posted.
>>> Could you please let me know what I might be doing wrong?
>>>
>>> Here is the version detail for the tessreact on my machine:
>>>
>>> tesseract 4.0.0
>>>  leptonica-1.77.0
>>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib
>>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>>  Found AVX2
>>>  Found AVX
>>>  Found SSE
>>>
>>> Here is the output I get for most of the psm modes:
>>>
>>>
>>> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>>>
>>> Did you know? Did you know?
>>>
>>> Your Comcast Business Internet Never miss a payment with text alerts.
>>> service gives you access to millions Receive text message reminders when
>>> your
>>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due.
>>> Sign up at
>>> and even more coverage. Find out business.comcast.com/myaccount.
>>>
>>> more at business.comcast.conm/wifi.
>>>
>>> Your bill is ready
>>>
>>>
>>>
>>> Need help? We’re here for you.
>>>
>>>
>>>
>>> > Visit business.comcast.com/help Please notify us immediately with any
>>> Call 1-800-391-3000 questions regarding charges billed to your
>>> aa account. Comcast will issue a credit or
>>> Billing support refund for any verified billing error which is
>>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within
>>> sixty (60) days
>>> and 7 am-8 pm Sat of the bill.
>>>
>>> Technical support
>>> Open 24 hours, 7 days a week
>>>
>>> TT
>>>
>>> Automatic payment If you’re moving, give us as much
>>> Sign up at business.comcast.com/myaccount advanced notice as possible
>>> so we
>>>
>>> Se Online can help make a smooth transition.
>>> Visit business.comcast.com/myaccount
>>>
>>> a By phone
>>> Call 1-800-391-3000
>>>
>>> Call 1-800-391-3000
>>>
>>> IME
>>>
>>>
>>>
>>>
>>>
>>> Regards,
>>> Giriraj.
>>>
>>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:

 If you want to OCR an invoice like the sample you posted, just use the
 eng.traineddata and OCR the page. You do not need to do any training.

 Here is the output I get



 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3


 Did you know?


 Your Comcast Business Internet

 service gives you access to millions

 of WiFi hotspots with the fastest WiFi

 and even more coverage. Find out

 more at businesscomcast.com/wifi.



 Need help? We’re here for you.


 9 Visit business.comcast.com/help

 Call 1-800—391 -3000

 A


 Billing support

 Open 6 am-9 pm MTN, Mon through Fri

 and 7 am—8 pm Sat


 Technical support

 Open 24 hours, 7 days a week



 Did you know?


 Never miss a payment with text alerts.

 Receive text message reminders when your

 bill is ready to pay or past due. Sign up at

 business.comcast.com/myaccount.



 Your bill is ready




 Please notify us immediately with any

 questions regarding charges billed to your


Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Giriraj Bhojak
Hi Shree,

Thank you for quick response.
I used the trained data by downloading the datasets at 
https://github.com/tesseract-ocr/tessdata, 
https://github.com/tesseract-ocr/tessdata_best and 
https://github.com/tesseract-ocr/tessdata_fast.

I ran following commands for each of these datasets and changed psm from 1 
to 13 , but more or less the output is like the one I posted. Couldn't get 
the output as you have posted that has data in the right order of the 
context.

tesseract --tessdata-dir tessdata_best-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata_fast-master "sample.tif" sample --psm 1
tesseract --tessdata-dir tessdata-master "sample.tif" sample --psm 1

Not sure what I am doing wrong here, appreciate your help with this.

Regards,
Giriraj

On Friday, April 26, 2019 at 3:04:34 AM UTC-4, shree wrote:
>
> Which eng.traineddata did you use?
>
> There are three options
> From tessdata, tessdata_best and tessdata_fast.
>
> On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  > wrote:
>
>> Hello Shree,
>>
>> I realize this post is more than two years old now, but would appreciate 
>> any help.
>> I tried your suggestion on the same attached sample using tesseract v4 
>> and I am unable to get the result as you have posted.
>> I have tried all page segmentation modes, but none of them produced the 
>> result you have posted. 
>> Could you please let me know what I might be doing wrong?
>>
>> Here is the version detail for the tessreact on my machine:
>>
>> tesseract 4.0.0
>>  leptonica-1.77.0
>>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 
>> 1.2.11 : libwebp 1.0.1 : libopenjp2 2.3.0
>>  Found AVX2
>>  Found AVX
>>  Found SSE
>>
>> Here is the output I get for most of the psm modes:
>>
>>
>> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>>
>> Did you know? Did you know?
>>
>> Your Comcast Business Internet Never miss a payment with text alerts.
>> service gives you access to millions Receive text message reminders when 
>> your
>> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due. 
>> Sign up at
>> and even more coverage. Find out business.comcast.com/myaccount.
>>
>> more at business.comcast.conm/wifi.
>>
>> Your bill is ready
>>
>>
>>
>> Need help? We’re here for you.
>>
>>  
>>
>> > Visit business.comcast.com/help Please notify us immediately with any
>> Call 1-800-391-3000 questions regarding charges billed to your
>> aa account. Comcast will issue a credit or
>> Billing support refund for any verified billing error which is
>> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty 
>> (60) days
>> and 7 am-8 pm Sat of the bill.
>>
>> Technical support
>> Open 24 hours, 7 days a week
>>
>> TT
>>
>> Automatic payment If you’re moving, give us as much
>> Sign up at business.comcast.com/myaccount advanced notice as possible so 
>> we
>>
>> Se Online can help make a smooth transition.
>> Visit business.comcast.com/myaccount
>>
>> a By phone
>> Call 1-800-391-3000
>>
>> Call 1-800-391-3000
>>
>> IME
>>
>>  
>>
>>  
>>
>> Regards,
>> Giriraj.
>>
>> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>>
>>> If you want to OCR an invoice like the sample you posted, just use the 
>>> eng.traineddata and OCR the page. You do not need to do any training.
>>>
>>> Here is the output I get 
>>>
>>>
>>>
>>> 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3
>>>
>>>
>>> Did you know?
>>>
>>>
>>> Your Comcast Business Internet
>>>
>>> service gives you access to millions
>>>
>>> of WiFi hotspots with the fastest WiFi
>>>
>>> and even more coverage. Find out
>>>
>>> more at businesscomcast.com/wifi.
>>>
>>>
>>>
>>> Need help? We’re here for you.
>>>
>>>
>>> 9 Visit business.comcast.com/help
>>>
>>> Call 1-800—391 -3000
>>>
>>> A
>>>
>>>
>>> Billing support
>>>
>>> Open 6 am-9 pm MTN, Mon through Fri
>>>
>>> and 7 am—8 pm Sat
>>>
>>>
>>> Technical support
>>>
>>> Open 24 hours, 7 days a week
>>>
>>>
>>>
>>> Did you know?
>>>
>>>
>>> Never miss a payment with text alerts.
>>>
>>> Receive text message reminders when your
>>>
>>> bill is ready to pay or past due. Sign up at
>>>
>>> business.comcast.com/myaccount.
>>>
>>>
>>>
>>> Your bill is ready
>>>
>>>
>>>
>>>
>>> Please notify us immediately with any
>>>
>>> questions regarding charges billed to your
>>>
>>> account. Comcast will issue a credit or
>>>
>>> refund for any verified billing error which is
>>>
>>> brought to our attention within sixty (60) days
>>>
>>> of the bill.
>>>
>>>
>>> ll
>>>
>>>
>>> Additional payment options Moving? Let us help.
>>>
>>>
>>> Automatic payment
>>>
>>> Sign up at business.comcast.com/myaccount
>>>
>>>
>>> a Oniine
>>>
>>>
>>> Visit business.comcast.com/myaccount
>>>
>>>
>>> a By phone
>>>
>>> Call 1-800-391 -3000
>>>
>>>
>>> if you're moving, give us as much
>>>
>>> advanced notice as possible so we
>>>
>>> can help make a smooth transition.
>>>
>>>
>>> Call 1 -800-391 -3000
>>>
>>>

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2019-04-26 Thread Shree Devi Kumar
Which eng.traineddata did you use?

There are three options
>From tessdata, tessdata_best and tessdata_fast.

On Fri, 26 Apr 2019, 09:19 Giriraj Bhojak,  wrote:

> Hello Shree,
>
> I realize this post is more than two years old now, but would appreciate
> any help.
> I tried your suggestion on the same attached sample using tesseract v4 and
> I am unable to get the result as you have posted.
> I have tried all page segmentation modes, but none of them produced the
> result you have posted.
> Could you please let me know what I might be doing wrong?
>
> Here is the version detail for the tessreact on my machine:
>
> tesseract 4.0.0
>  leptonica-1.77.0
>   libgif 5.1.4 : libjpeg 9c : libpng 1.6.36 : libtiff 4.0.10 : zlib 1.2.11
> : libwebp 1.0.1 : libopenjp2 2.3.0
>  Found AVX2
>  Found AVX
>  Found SSE
>
> Here is the output I get for most of the psm modes:
>
>
> 8633 0410 NO RP 1107122016 NYNN 07 01 0001 Page 20f3
>
> Did you know? Did you know?
>
> Your Comcast Business Internet Never miss a payment with text alerts.
> service gives you access to millions Receive text message reminders when
> your
> of WiFi hotspots with the fastest WiFi bill is ready to pay or past due.
> Sign up at
> and even more coverage. Find out business.comcast.com/myaccount.
>
> more at business.comcast.conm/wifi.
>
> Your bill is ready
>
>
>
> Need help? We’re here for you.
>
>
>
> > Visit business.comcast.com/help Please notify us immediately with any
> Call 1-800-391-3000 questions regarding charges billed to your
> aa account. Comcast will issue a credit or
> Billing support refund for any verified billing error which is
> Open 6 am-9 pm MTN, Mon through Fri brought to our attention within sixty
> (60) days
> and 7 am-8 pm Sat of the bill.
>
> Technical support
> Open 24 hours, 7 days a week
>
> TT
>
> Automatic payment If you’re moving, give us as much
> Sign up at business.comcast.com/myaccount advanced notice as possible so
> we
>
> Se Online can help make a smooth transition.
> Visit business.comcast.com/myaccount
>
> a By phone
> Call 1-800-391-3000
>
> Call 1-800-391-3000
>
> IME
>
>
>
>
>
> Regards,
> Giriraj.
>
> On Friday, April 21, 2017 at 4:55:03 AM UTC-4, shree wrote:
>>
>> If you want to OCR an invoice like the sample you posted, just use the
>> eng.traineddata and OCR the page. You do not need to do any training.
>>
>> Here is the output I get
>>
>>
>>
>> 8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3
>>
>>
>> Did you know?
>>
>>
>> Your Comcast Business Internet
>>
>> service gives you access to millions
>>
>> of WiFi hotspots with the fastest WiFi
>>
>> and even more coverage. Find out
>>
>> more at businesscomcast.com/wifi.
>>
>>
>>
>> Need help? We’re here for you.
>>
>>
>> 9 Visit business.comcast.com/help
>>
>> Call 1-800—391 -3000
>>
>> A
>>
>>
>> Billing support
>>
>> Open 6 am-9 pm MTN, Mon through Fri
>>
>> and 7 am—8 pm Sat
>>
>>
>> Technical support
>>
>> Open 24 hours, 7 days a week
>>
>>
>>
>> Did you know?
>>
>>
>> Never miss a payment with text alerts.
>>
>> Receive text message reminders when your
>>
>> bill is ready to pay or past due. Sign up at
>>
>> business.comcast.com/myaccount.
>>
>>
>>
>> Your bill is ready
>>
>>
>>
>>
>> Please notify us immediately with any
>>
>> questions regarding charges billed to your
>>
>> account. Comcast will issue a credit or
>>
>> refund for any verified billing error which is
>>
>> brought to our attention within sixty (60) days
>>
>> of the bill.
>>
>>
>> ll
>>
>>
>> Additional payment options Moving? Let us help.
>>
>>
>> Automatic payment
>>
>> Sign up at business.comcast.com/myaccount
>>
>>
>> a Oniine
>>
>>
>> Visit business.comcast.com/myaccount
>>
>>
>> a By phone
>>
>> Call 1-800-391 -3000
>>
>>
>> if you're moving, give us as much
>>
>> advanced notice as possible so we
>>
>> can help make a smooth transition.
>>
>>
>> Call 1 -800-391 -3000
>>
>>
>> |||ll
>>
>>
>>
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi  wrote:
>>
>>> Hello all,
>>>
>>> I am surprised by how many people tell me that tesseract is the best
>>> open-source OCR tool but yet there is no video explaining step-by-step the
>>> problems that you can encounter, or a good explanation and documentation
>>> for OCR.
>>>
>>> Well even though, everyone loves challenges! So here's the challenge I
>>> faced. I brought many pdf files that are invoices and I want to train
>>> tesseract to be able to ocr them as scanned images.
>>> So first of all, I transformed these pdf files into tif files
>>> using: magick -density 300 -depth 4   2151.pdf -background white -fill
>>> white -alpha Off  2151%d.tif
>>> This is ImageMagick. Nothing important here other than we have a 300 dpi
>>> image with an alpha channel off.
>>>
>>> You must rename them so : rename .tif files to:
>>> 

Re: [tesseract-ocr] Training tesseract-ocr unicharset_extractor, mftraining, cntraining

2017-04-21 Thread ShreeDevi Kumar
If you want to OCR an invoice like the sample you posted, just use the
eng.traineddata and OCR the page. You do not need to do any training.

Here is the output I get



8633 0410 NO RP 11 07122015 NYNN 01 01 0001 Page 2 Of 3


Did you know?


Your Comcast Business Internet

service gives you access to millions

of WiFi hotspots with the fastest WiFi

and even more coverage. Find out

more at businesscomcast.com/wifi.



Need help? We’re here for you.


9 Visit business.comcast.com/help

Call 1-800—391 -3000

A


Billing support

Open 6 am-9 pm MTN, Mon through Fri

and 7 am—8 pm Sat


Technical support

Open 24 hours, 7 days a week



Did you know?


Never miss a payment with text alerts.

Receive text message reminders when your

bill is ready to pay or past due. Sign up at

business.comcast.com/myaccount.



Your bill is ready




Please notify us immediately with any

questions regarding charges billed to your

account. Comcast will issue a credit or

refund for any verified billing error which is

brought to our attention within sixty (60) days

of the bill.


ll


Additional payment options Moving? Let us help.


Automatic payment

Sign up at business.comcast.com/myaccount


a Oniine


Visit business.comcast.com/myaccount


a By phone

Call 1-800-391 -3000


if you're moving, give us as much

advanced notice as possible so we

can help make a smooth transition.


Call 1 -800-391 -3000


|||ll




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Fri, Apr 21, 2017 at 11:34 AM, Alain Ghawi  wrote:

> Hello all,
>
> I am surprised by how many people tell me that tesseract is the best
> open-source OCR tool but yet there is no video explaining step-by-step the
> problems that you can encounter, or a good explanation and documentation
> for OCR.
>
> Well even though, everyone loves challenges! So here's the challenge I
> faced. I brought many pdf files that are invoices and I want to train
> tesseract to be able to ocr them as scanned images.
> So first of all, I transformed these pdf files into tif files
> using: magick -density 300 -depth 4   2151.pdf -background white -fill
> white -alpha Off  2151%d.tif
> This is ImageMagick. Nothing important here other than we have a 300 dpi
> image with an alpha channel off.
>
> You must rename them so : rename .tif files to:
> [lang].[name_font].exp0.tif (com.test_font.exp0.tif) This is for my example
>
> Great! After this step you must create your box file right? So I simply
> called:
> tesseract com.test_font.exp0.tif com.test_font.exp0 batch.nochop makebox
> tesseract com.test_font.exp0.tif com.test_font.exp1 batch.nochop makebox
>
> Then I fixed my files with CowBoxEditor as I wasn't finding the famous
> jTessBoxEditor online (weird right?) which did the job.
>
> After that, I created my .tr files:
> tesseract com.test_font.exp0.tif com.test_font.exp0 nobatch box.train
> tesseract com.test_font.exp1.tif com.test_font.exp1 nobatch box.train
>
> And here comes the surprises!!!
> After having your .tr files you call unicharset_extractor.
> First question: Why the glyph metrics are all 0,255,0,255,0,0,0,0,0,0?
> Which is wrong according to the documentation: https://github.
> com/tesseract-ocr/tesseract/blob/a3ba11b030345d32829b1e8355afea
> 5419978d82/doc/unicharset.5.asc
> Second question: Should I write a box file, then the other or combine
> them? Option 1: unicharset_extractor com.test_font.exp0.box   or Option 2:
> unicharset_extractor com.test_font.exp0.box com.test_font.exp1.box
> Third question: set_unicharset_extractor why should I use it? It doesn't
> fix the metrics only specify if Latin or Common! Link: https://github.com/
> tesseract-ocr/tesseract/issues/318
>
> After all these unanswered questions, I used mftraining and cntraining (no
> problems). Finally, I renamed my inttemp, normproto, pffmtable, shapetable
>  and I combined them using combine_tessdata com.
>
> Final question: If I named com.inttemp1 com.inttemp2 does it work? Same
> for shapetable, normproto, pffmtable
>
> I think these questions are asked more than once by all new users to
> tesseract. Please if any expert in tesseract can answer these questions it
> will be a great help for all the community.
> Kindly find the attached 2 tif files and the boxes generated.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/beb558f3-d52c-4eca-a668-501a9804ffb0%
> 40googlegroups.com
>