Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-30 Thread Lorenzo Bolzani
Hello ShreeDevi,
thanks for your answer. I tried to use the 4.0 version but I get a
different kind of errors. And, as far as I know
, the whitelist is
not yet supported in the 4.0 version so I decided to go with the 3.05
because I think this feature can be important in my case.

I updated and built the 4.0 version right now and this is what I get (using
the command line you provided) on some of the problematic samples.

tesseract 4.0.0-beta.1-163-gd3f6
 leptonica-1.75.3
  libjpeg 8d (libjpeg-turbo 1.4.2) : libpng 1.6.20 : zlib 1.2.8

 Found AVX2
 Found AVX
 Found SSE

Please notice how a small change in the threshold parameters of
binarization influences the result in numbers-test2.png and d.numbers.png.
I realize the error on numbers-test2.png is something you do not have.

Using the 3.05 version with learning: after two "warmup" "epochs" on 50
samples (with 2 or 3 errors each), I get 100% accuracy for three epochs on
those same samples. After that, sometimes I still get one wrong sample. The
result is very good but the fact that it changes "randomly" depending on
the provided data is not something I'm very comfortable with in a
production environment.

It also makes very very difficult to do fine tuning and evaluate the impact
of the changes and the final performances.

This is why I'm considering disabling the learning at the same time I'm
happy with the improvements I get and I'd like to reproduce them in a
controlled way. Maybe I can print all the parameters after each epoch and
see if there are changes but I suspect the fine tuning is internal and does
not affect user provided values directly.


Thanks

Lorenzo


2018-04-30 7:49 GMT+02:00 ShreeDevi Kumar :

> Try tesseract-4.0.0-beta
>
> I get correct results with it from command line
>
>
> # tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast
> -l eng  --oem 1 --psm 6
> Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
> Warning. Invalid resolution 0 dpi. Using 70 instead.
>
> # tesseract numbers-test2.png numbers-test2 --tessdata-dir ./tessdata_fast
> -l eng --oem 1 --psm 6
> Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
> Warning. Invalid resolution 0 dpi. Using 70 instead.
>
> # tesseract letters-test.png letters-test  --tessdata-dir ./tessdata_fast
> -l eng --oem 1 --psm 6
> Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
> Warning. Invalid resolution 0 dpi. Using 70 instead.
> #
>
>
>
>
> ShreeDevi
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>
> On Sat, Apr 28, 2018 at 5:03 PM, Lorenzo Blz  wrote:
>
>>
>> Hi, I'm using tesseract to recognize small fragments of text like this
>> (actual images I'm using):
>>
>>
>>
>>
>>
>> Numers are fixed lenght (7 digits) and letters are always 2 chars
>> uppercase. I'm using a whitelist (a different one depeding if the fragment
>> is text or digits, I know this in advance). And it works reasonable well.
>> The size of these fragments is fixed, I rescale them to the same height (54
>> pixels, I could change it or add some borders). These are extracted from
>> smartphone pictures so the original resolution varies a lot.
>>
>> I'm using lang "eng+ita" because in this way I get better results. I'm
>> also using user-patterns but they are not helping much. I'm using the api
>> through tesserocr  python bindings.
>>
>> I think there are may parameters I can fine tune but I tried a few
>> (load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these
>> improved the results (a very small textord_min_linesize=0.2 made them
>> worse, so they are being used). I've read the FAQ and the docs but there
>> are really too many parameters to understand what to change and how.
>>
>> In particular my current problem is adaptive learning: when I process a
>> large batch of pictures the result varies depending on other fragments.
>> Fragments that are perfectly readable and correctly classified when
>> processed individually, give different, wrong, results when processed in a
>> batch (I mean reusing the same api instance for multiple images).
>>
>> I tried to disable it but it looks like
>>  it cannot be
>> disabled when using multiple languages(?).
>>
>> If I use only "ita" (and no whitelist, no learning) the first image in
>> this post is recognized as (text [confidence]):
>>
>> ('5748788\n\n', [81])
>> ('5748788\n\n', [81])
>> ('5748788\n\n', [81])
>> ('5748788\n\n', [81])
>>
>> With learning (multiple calls, no whitelist, lang: ita):
>>
>> ('5748788\n\n', [81])
>> ('5748788\n\n', [81])
>> ('5748788\n\n', [90])
>> ('5748788\n\n', [90])
>>
>> so it improves to a higher confidence (I do not know how much the
>> confidence value matters in real life). It looks like 

Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-29 Thread ShreeDevi Kumar
Try tesseract-4.0.0-beta

I get correct results with it from command line


# tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast -l
eng  --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract numbers-test2.png numbers-test2 --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract letters-test.png letters-test  --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
#




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 5:03 PM, Lorenzo Blz  wrote:

>
> Hi, I'm using tesseract to recognize small fragments of text like this
> (actual images I'm using):
>
>
>
>
>
> Numers are fixed lenght (7 digits) and letters are always 2 chars
> uppercase. I'm using a whitelist (a different one depeding if the fragment
> is text or digits, I know this in advance). And it works reasonable well.
> The size of these fragments is fixed, I rescale them to the same height (54
> pixels, I could change it or add some borders). These are extracted from
> smartphone pictures so the original resolution varies a lot.
>
> I'm using lang "eng+ita" because in this way I get better results. I'm
> also using user-patterns but they are not helping much. I'm using the api
> through tesserocr  python bindings.
>
> I think there are may parameters I can fine tune but I tried a few
> (load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these
> improved the results (a very small textord_min_linesize=0.2 made them
> worse, so they are being used). I've read the FAQ and the docs but there
> are really too many parameters to understand what to change and how.
>
> In particular my current problem is adaptive learning: when I process a
> large batch of pictures the result varies depending on other fragments.
> Fragments that are perfectly readable and correctly classified when
> processed individually, give different, wrong, results when processed in a
> batch (I mean reusing the same api instance for multiple images).
>
> I tried to disable it but it looks like
>  it cannot be
> disabled when using multiple languages(?).
>
> If I use only "ita" (and no whitelist, no learning) the first image in
> this post is recognized as (text [confidence]):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
>
> With learning (multiple calls, no whitelist, lang: ita):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [90])
> ('5748788\n\n', [90])
>
> so it improves to a higher confidence (I do not know how much the
> confidence value matters in real life). It looks like learning is doing
> something good even with no whitelist (I could use the whitelist anyway,
> just to be sure, but the starting point looks better).
>
> I'm wondering if I can do some kind of "warmup" with learning enabled and
> later turn it off (I'll try this today). But how many samples do I need?
> And it seems a little hacky.
>
> Or maybe there is some way to print debug informations from the learning
> part to see what parameters are changed and set them manually later (I
> tried a few debug params but got no output).
>
> Or maybe it is quite easy to manually find good parameters for this kind
> of regular text to get close to 90 confidence.
>
> On the "AT" fragment I get 89 confidence and I think it may be quite low
> for this kind of simple clean text.
>
> What I need are (good) consistent results in all situations for the same
> image. What do you think?
>
>
> Thanks, bye
>
> Lorenzo
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 

[tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-28 Thread Lorenzo Blz

Hi, I'm using tesseract to recognize small fragments of text like this 
(actual images I'm using):





Numers are fixed lenght (7 digits) and letters are always 2 chars 
uppercase. I'm using a whitelist (a different one depeding if the fragment 
is text or digits, I know this in advance). And it works reasonable well. 
The size of these fragments is fixed, I rescale them to the same height (54 
pixels, I could change it or add some borders). These are extracted from 
smartphone pictures so the original resolution varies a lot.

I'm using lang "eng+ita" because in this way I get better results. I'm also 
using user-patterns but they are not helping much. I'm using the api 
through tesserocr  python bindings.

I think there are may parameters I can fine tune but I tried a few 
(load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these 
improved the results (a very small textord_min_linesize=0.2 made them 
worse, so they are being used). I've read the FAQ and the docs but there 
are really too many parameters to understand what to change and how. 

In particular my current problem is adaptive learning: when I process a 
large batch of pictures the result varies depending on other fragments. 
Fragments that are perfectly readable and correctly classified when 
processed individually, give different, wrong, results when processed in a 
batch (I mean reusing the same api instance for multiple images).

I tried to disable it but it looks like 
 it cannot be 
disabled when using multiple languages(?).

If I use only "ita" (and no whitelist, no learning) the first image in this 
post is recognized as (text [confidence]):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [81])

With learning (multiple calls, no whitelist, lang: ita):

('5748788\n\n', [81])
('5748788\n\n', [81])
('5748788\n\n', [90])
('5748788\n\n', [90])

so it improves to a higher confidence (I do not know how much the 
confidence value matters in real life). It looks like learning is doing 
something good even with no whitelist (I could use the whitelist anyway, 
just to be sure, but the starting point looks better).

I'm wondering if I can do some kind of "warmup" with learning enabled and 
later turn it off (I'll try this today). But how many samples do I need? 
And it seems a little hacky.

Or maybe there is some way to print debug informations from the learning 
part to see what parameters are changed and set them manually later (I 
tried a few debug params but got no output).

Or maybe it is quite easy to manually find good parameters for this kind of 
regular text to get close to 90 confidence.

On the "AT" fragment I get 89 confidence and I think it may be quite low 
for this kind of simple clean text.

What I need are (good) consistent results in all situations for the same 
image. What do you think?


Thanks, bye

Lorenzo

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.