Re: [tesseract-ocr] Training stops before specified iterations

2019-07-18 Thread Shree Devi Kumar
The target character error rate may have been achieved.

On Fri, 19 Jul 2019, 11:14 Pooja Kamra,  wrote:

> In training comand, max iterations given are 1. But training stops
> after 4600 iterations.
> What can be reason for this.
>
> Regards,
> Pooja
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/2a751eec-45b1-4f85-b9cf-cebcdcb73b73%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXPUQCQJiYs4f47kVOkZSMwavWw%2BRBCvCPFnkwDzCV2vw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread Shree Devi Kumar
Also https://github.com/tesseract-ocr/tesseract/pull/2576

On Fri, 19 Jul 2019, 11:14 Shree Devi Kumar,  wrote:

> Please check out the recent commits in master branch
>
> https://github.com/tesseract-ocr/tesseract/pull/2554
>
> On Fri, 19 Jul 2019, 10:55 ElGato ElMago,  wrote:
>
>> Hi,
>>
>> Let's call them phantom characters then.
>>
>> Was psm 7 the solution for the issue 1778?  None of the psm option didn't
>> solve my problem though I see different output.
>>
>> I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway.
>> How did you get bounding box for each character?  Alto and lstmbox
>> only show bbox for a group of characters.
>>
>> ElMagoElGato
>>
>> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>>
>>> Phantom characters here for me too:
>>>
>>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>>
>>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also
>>> improved.
>>>
>>> I wrote some code that uses symbols iterator to discard symbols that are
>>> clearly duplicated: too small, overlapping, etc. But it was not easy to
>>> make it work decently and it is not 100% reliable with false negatives and
>>> positives. I cannot share the code and it is quite ugly anyway.
>>>
>>> Here there is another MRZ model with training data:
>>>
>>> https://github.com/DoubangoTelecom/tesseractMRZ
>>>
>>>
>>>
>>>
>>> Lorenzo
>>>
>>>
>>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu  ha
>>> scritto:
>>>
 I’m getting the “phantom character” issue as well using the OCRB that
 Shree trained on MRZ lines. For example for a 0 it will sometimes add both
 a 0 and an O to the output , thus outputting 45 characters total instead of
 44. I haven’t looked at the bounding box output yet but I suspect a phantom
 thin character is added somewhere that I can discard .. or maybe two chars
 will have the same bounding box. If anyone else has fixed this issue
 further up (eg so the output doesn’t contain the phantom characters in the
 first place) id be interested.

 On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago 
 wrote:

> Hi,
>
> I'll go back to more of training later.  Before doing so, I'd like to
> investigate results a little bit.  The hocr and lstmbox options give some
> details of positions of characters.  The results show positions that
> perfectly correspond to letters in the image.  But the text output 
> contains
> a character that obviously does not exist.
>
> Then I found a config file 'lstmdebug' that generates far more
> information.  I hope it explains what happened with each character.  I'm
> yet to read the debug output but I'd appreciate it if someone could tell 
> me
> how to read it because it's really complex.
>
> Regards,
> ElMagoElGato
>
> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>
>> See https://github.com/Shreeshrii/tessdata_MICR
>>
>> I have uploaded my files there.
>>
>> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
>> is the bash script that runs the training.
>>
>> You can modify as needed. Please note this is for legacy/base
>> tesseract --oem 0.
>>
>> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago 
>> wrote:
>>
>>> Thanks a lot, shree.  It seems you know everything.
>>>
>>> I tried the MICR0.traineddata and the first two mcr.traineddata.
>>> The last one was blocked by the browser.  Each of the traineddata had 
>>> mixed
>>> results.  All of them are getting symbols fairly good but getting spaces
>>> randomly and reading some numbers wrong.
>>>
>>> MICR0 seems the best among them.  Did you suggest that you'd be able
>>> to update it?  It gets tripple D very often where there's only one, and 
>>> so
>>> on.
>>>
>>> Also, I tried to fine tune from MICR0 but I found that I need to
>>> change the language-specific.sh.  It specifies some parameters for each
>>> language.  Do you have any guidance for it?
>>>
>>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:

 see
 http://www.devscope.net/Content/ocrchecks.aspx
 https://github.com/BigPino67/Tesseract-MICR-OCR

 https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ


 On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago 
 wrote:

> That'll be nice if there's traineddata out there but I didn't find
> any.  I see free fonts and commercial OCR software but not 
> traineddata.
> Tessdata repository obviously doesn't have one, either.
>
> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>>
>> Please also search for existing MICR traineddata files.
>>
>> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago 
>> wrote:
>>
>>> So I did several tests from scratch.  In the last attempt, I
>>> 

[tesseract-ocr] Training stops before specified iterations

2019-07-18 Thread Pooja Kamra
In training comand, max iterations given are 1. But training stops 
after 4600 iterations.
What can be reason for this.

Regards,
Pooja

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/2a751eec-45b1-4f85-b9cf-cebcdcb73b73%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread Shree Devi Kumar
Please check out the recent commits in master branch

https://github.com/tesseract-ocr/tesseract/pull/2554

On Fri, 19 Jul 2019, 10:55 ElGato ElMago,  wrote:

> Hi,
>
> Let's call them phantom characters then.
>
> Was psm 7 the solution for the issue 1778?  None of the psm option didn't
> solve my problem though I see different output.
>
> I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway.
> How did you get bounding box for each character?  Alto and lstmbox
> only show bbox for a group of characters.
>
> ElMagoElGato
>
> 2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:
>
>> Phantom characters here for me too:
>>
>> https://github.com/tesseract-ocr/tesseract/issues/1778
>>
>> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also
>> improved.
>>
>> I wrote some code that uses symbols iterator to discard symbols that are
>> clearly duplicated: too small, overlapping, etc. But it was not easy to
>> make it work decently and it is not 100% reliable with false negatives and
>> positives. I cannot share the code and it is quite ugly anyway.
>>
>> Here there is another MRZ model with training data:
>>
>> https://github.com/DoubangoTelecom/tesseractMRZ
>>
>>
>>
>>
>> Lorenzo
>>
>>
>> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu  ha
>> scritto:
>>
>>> I’m getting the “phantom character” issue as well using the OCRB that
>>> Shree trained on MRZ lines. For example for a 0 it will sometimes add both
>>> a 0 and an O to the output , thus outputting 45 characters total instead of
>>> 44. I haven’t looked at the bounding box output yet but I suspect a phantom
>>> thin character is added somewhere that I can discard .. or maybe two chars
>>> will have the same bounding box. If anyone else has fixed this issue
>>> further up (eg so the output doesn’t contain the phantom characters in the
>>> first place) id be interested.
>>>
>>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago 
>>> wrote:
>>>
 Hi,

 I'll go back to more of training later.  Before doing so, I'd like to
 investigate results a little bit.  The hocr and lstmbox options give some
 details of positions of characters.  The results show positions that
 perfectly correspond to letters in the image.  But the text output contains
 a character that obviously does not exist.

 Then I found a config file 'lstmdebug' that generates far more
 information.  I hope it explains what happened with each character.  I'm
 yet to read the debug output but I'd appreciate it if someone could tell me
 how to read it because it's really complex.

 Regards,
 ElMagoElGato

 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:

> See https://github.com/Shreeshrii/tessdata_MICR
>
> I have uploaded my files there.
>
> https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
> is the bash script that runs the training.
>
> You can modify as needed. Please note this is for legacy/base
> tesseract --oem 0.
>
> On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago 
> wrote:
>
>> Thanks a lot, shree.  It seems you know everything.
>>
>> I tried the MICR0.traineddata and the first two mcr.traineddata.  The
>> last one was blocked by the browser.  Each of the traineddata had mixed
>> results.  All of them are getting symbols fairly good but getting spaces
>> randomly and reading some numbers wrong.
>>
>> MICR0 seems the best among them.  Did you suggest that you'd be able
>> to update it?  It gets tripple D very often where there's only one, and 
>> so
>> on.
>>
>> Also, I tried to fine tune from MICR0 but I found that I need to
>> change the language-specific.sh.  It specifies some parameters for each
>> language.  Do you have any guidance for it?
>>
>> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>>
>>> see
>>> http://www.devscope.net/Content/ocrchecks.aspx
>>> https://github.com/BigPino67/Tesseract-MICR-OCR
>>>
>>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>>
>>>
>>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago 
>>> wrote:
>>>
 That'll be nice if there's traineddata out there but I didn't find
 any.  I see free fonts and commercial OCR software but not traineddata.
 Tessdata repository obviously doesn't have one, either.

 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:
>
> Please also search for existing MICR traineddata files.
>
> On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago 
> wrote:
>
>> So I did several tests from scratch.  In the last attempt, I made
>> a training text with 4,000 lines in the following format,
>>
>> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>>
>>
>> and combined it with eng.digits.training_text in which symbols
>> are converted to E13B 

Re: [tesseract-ocr] Trained data for E13B font

2019-07-18 Thread ElGato ElMago
Hi,

Let's call them phantom characters then.

Was psm 7 the solution for the issue 1778?  None of the psm option didn't 
solve my problem though I see different output.

I use tesseract 5.0-alpha mostly but 4.1 showed the same results anyway.  
How did you get bounding box for each character?  Alto and lstmbox 
only show bbox for a group of characters.

ElMagoElGato

2019年7月17日水曜日 18時58分31秒 UTC+9 Lorenzo Blz:

> Phantom characters here for me too:
>
> https://github.com/tesseract-ocr/tesseract/issues/1778
>
> Are you using 4.1? Bounding boxes were fixed in 4.1 maybe this was also 
> improved.
>
> I wrote some code that uses symbols iterator to discard symbols that are 
> clearly duplicated: too small, overlapping, etc. But it was not easy to 
> make it work decently and it is not 100% reliable with false negatives and 
> positives. I cannot share the code and it is quite ugly anyway.
>
> Here there is another MRZ model with training data:
>
> https://github.com/DoubangoTelecom/tesseractMRZ
>
>
>
>
> Lorenzo
>
>
> Il giorno mer 17 lug 2019 alle ore 11:26 Claudiu  > ha scritto:
>
>> I’m getting the “phantom character” issue as well using the OCRB that 
>> Shree trained on MRZ lines. For example for a 0 it will sometimes add both 
>> a 0 and an O to the output , thus outputting 45 characters total instead of 
>> 44. I haven’t looked at the bounding box output yet but I suspect a phantom 
>> thin character is added somewhere that I can discard .. or maybe two chars 
>> will have the same bounding box. If anyone else has fixed this issue 
>> further up (eg so the output doesn’t contain the phantom characters in the 
>> first place) id be interested. 
>>
>> On Wed, Jul 17, 2019 at 10:01 AM ElGato ElMago > > wrote:
>>
>>> Hi,
>>>
>>> I'll go back to more of training later.  Before doing so, I'd like to 
>>> investigate results a little bit.  The hocr and lstmbox options give some 
>>> details of positions of characters.  The results show positions that 
>>> perfectly correspond to letters in the image.  But the text output contains 
>>> a character that obviously does not exist.
>>>
>>> Then I found a config file 'lstmdebug' that generates far more 
>>> information.  I hope it explains what happened with each character.  I'm 
>>> yet to read the debug output but I'd appreciate it if someone could tell me 
>>> how to read it because it's really complex.
>>>
>>> Regards,
>>> ElMagoElGato
>>>
>>> 2019年6月14日金曜日 19時58分49秒 UTC+9 shree:
>>>
 See https://github.com/Shreeshrii/tessdata_MICR

 I have uploaded my files there. 

 https://github.com/Shreeshrii/tessdata_MICR/blob/master/MICR.sh
 is the bash script that runs the training.

 You can modify as needed. Please note this is for legacy/base tesseract 
 --oem 0.

 On Fri, Jun 14, 2019 at 1:26 PM ElGato ElMago  
 wrote:

> Thanks a lot, shree.  It seems you know everything.
>
> I tried the MICR0.traineddata and the first two mcr.traineddata.  The 
> last one was blocked by the browser.  Each of the traineddata had mixed 
> results.  All of them are getting symbols fairly good but getting spaces 
> randomly and reading some numbers wrong.
>
> MICR0 seems the best among them.  Did you suggest that you'd be able 
> to update it?  It gets tripple D very often where there's only one, and 
> so 
> on.
>
> Also, I tried to fine tune from MICR0 but I found that I need to 
> change the language-specific.sh.  It specifies some parameters for each 
> language.  Do you have any guidance for it?
>
> 2019年6月14日金曜日 1時48分40秒 UTC+9 shree:
>>
>> see 
>> http://www.devscope.net/Content/ocrchecks.aspx 
>> https://github.com/BigPino67/Tesseract-MICR-OCR
>> https://groups.google.com/d/msg/tesseract-ocr/obWI4cz8rXg/6l82hEySgOgJ
>>  
>>
>> On Mon, Jun 10, 2019 at 11:21 AM ElGato ElMago  
>> wrote:
>>
>>> That'll be nice if there's traineddata out there but I didn't find 
>>> any.  I see free fonts and commercial OCR software but not traineddata. 
>>>  
>>> Tessdata repository obviously doesn't have one, either.
>>>
>>> 2019年6月8日土曜日 1時52分10秒 UTC+9 shree:

 Please also search for existing MICR traineddata files.

 On Thu, Jun 6, 2019 at 1:09 PM ElGato ElMago  
 wrote:

> So I did several tests from scratch.  In the last attempt, I made 
> a training text with 4,000 lines in the following format,
>
> 110004310510<   <02 :4002=0181:801= 0008752 <00039 ;001000;
>
>
> and combined it with eng.digits.training_text in which symbols are 
> converted to E13B symbols.  This makes about 12,000 lines of training 
> text.  It's amazing that this thing generates a good reader out of 
> nowhere.  But then it is not very good.  For example:
>
> <01 :1901=1386:021= 001<10001< 

[tesseract-ocr] Re: VietOCR 5.0 Java & .NET Releases

2019-07-18 Thread Quan Nguyen
VietOCR v5.5.0 & VietOCR.NET v5.5.0 Releases

A Java/.NET WPF GUI frontend for Tesseract OCR engine. The releases include 
the following improvements:

- Upgrade to Tesseract 4.1.0

http://vietocr.sf.net

>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/e0e513b6-1068-450a-8ba3-e6e9fd0ca784%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] tesseract produces one time bad one time good results

2019-07-18 Thread Claudiu
Can someone explain, what does lstm_use_matrix  option do?

On Thu, Jul 18, 2019 at 11:36 AM Shree Devi Kumar 
wrote:

> Binarize and invert the images to get black text on white. I tried with
> latest code from master branch on github, gives correct results.
>
> tesseract 2-bw.png stdout --psm 6 --dpi 300 --tessdata-dir ~/tessdata
> --oem 1 --user-patterns ./timestamp.patterns.txt -c lstm_use_matrix=1 -c
> tessedit_char_whitelist='/0123456789.:TZ'
>
> 2019/07/04T08:45:16.250236Z
>
> tesseract 2-bw.png stdout --psm 6 --dpi 300 --tessdata-dir ~/tessdata_best
> --oem 1 --user-patterns ./timestamp.patterns.txt -c lstm_use_matrix=1 -c
> tessedit_char_whitelist='/0123456789.:TZ'
>
> 2019/07/04T08:45:16.250236Z
>
> tesseract 2-bw.png stdout --psm 6 --dpi 300 --tessdata-dir ~/tessdata_fast
> --oem 1 --user-patterns ./timestamp.patterns.txt -c lstm_use_matrix=1 -c
> tessedit_char_whitelist='/0123456789.:TZ'
>
> 2019/07/04T08:45:16.250236Z
>
> On Wed, Jul 17, 2019 at 6:55 PM Ste Die  wrote:
>
>> Hi
>>
>> I have a set of images with timestamps and use @ fc29
>> tesseract 3.05.02
>>  leptonica-1.78.0
>>
>> In many case the timestamp is read properly but in some cases it produces
>> really bad chars.
>>
>> My last settings were these, but also lots of different settings didn't
>> change much.
>> tesseract 1.png stdout -c tessedit_char_whitelist='TZ:/0123456789.' --psm
>> 13 --oem 0
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> 2019/07/04T0835:16.222753Z
>>
>> tesseract 2.png stdout -c tessedit_char_whitelist='TZ:/0123456789.' --psm
>> 13 --oem 0
>> Warning. Invalid resolution 0 dpi. Using 70 instead.
>> 2 119/07/ 4 383458 16.2365722
>>
>> Who can help me?
>>
>> Thanks a lot.
>> Stefan
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to tesseract-ocr+unsubscr...@googlegroups.com.
>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/tesseract-ocr/9961ba5e-3eea-4269-964b-6e76867ee19a%40googlegroups.com
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduVLaNWLWnuU7TjzV6OsC1idPAeXCjwSkcz2JaBv_Gmd5A%40mail.gmail.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAGJ7VxHRgypWDm%2BfSeoo6Rb%2BQbZLbYKZS0FgD3cboSPDFtyO9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.