Re: [tesseract-ocr] tesseract performs wrong auto-correction sometimes : how to disable it?

2018-04-29 Thread ShreeDevi Kumar
Please provide a sample image to test.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Thu, Apr 26, 2018 at 1:35 PM, Youcef  wrote:

>
> I'm using master branch with tessdata_fast models
>
> Le mercredi 25 avril 2018 18:49:22 UTC+2, shree a écrit :
>
>> Which version of tesseract are you using?
>>
>> ShreeDevi
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> On Wed, Apr 25, 2018 at 8:29 PM, Youcef  wrote:
>>
>>> Hi,
>>>
>>>
>>> Tesseract seems to post process its prediction.
>>>
>>> Here after, what I get after OCRizing images (same font, same size
>>> images generated with text2image):
>>>
>>> - an image containing "12345678I" => `123456781`
>>> - an image containing "GLOTHUVFI" => `GLOTHUVFI`
>>> - an image containing "12345678H" => `12345678H`
>>> - an image containing "GLOTHUVFH" => `GLOTHUVFH`
>>> - an image containing "12345678A" => `123456784`
>>> - an image containing "GLOTHUVFA" => `GLOTHUVFA`
>>>
>>> It looks like Tesseract doesn't like a word with a some numbers and one
>>> letter at the end. In fact, if the letter looks like a number ("I" and "A"
>>> looks like "1" and "4" respectively), it replaces it by the closest number.
>>> I have tried to tune following parameters without any changement in the
>>> result:
>>>
>>> - segment_penalty_dict_frequent_word
>>> - language_model_penalty_chartype
>>>
>>> Thanks for any help.
>>>
>>> Regards
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/4722674d-27a1-4b8e-8c5a-9e07dbe3ca7d%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/aeec51e2-455a-494b-9eb4-9597c303e469%
> 40googlegroups.com
> 
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXnd9-yJVFAWyyaMbSmi_Gi%2B-2jsDumXTL3Wxb7DwwLsw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Tesseract config for simple single words text and questions about learning

2018-04-29 Thread ShreeDevi Kumar
Try tesseract-4.0.0-beta

I get correct results with it from command line


# tesseract numbers-test.png numbers-test --tessdata-dir ./tessdata_fast -l
eng  --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract numbers-test2.png numbers-test2 --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.

# tesseract letters-test.png letters-test  --tessdata-dir ./tessdata_fast
-l eng --oem 1 --psm 6
Tesseract Open Source OCR Engine v4.0.0-beta.1-200-g37d20 with Leptonica
Warning. Invalid resolution 0 dpi. Using 70 instead.
#




ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sat, Apr 28, 2018 at 5:03 PM, Lorenzo Blz  wrote:

>
> Hi, I'm using tesseract to recognize small fragments of text like this
> (actual images I'm using):
>
>
>
>
>
> Numers are fixed lenght (7 digits) and letters are always 2 chars
> uppercase. I'm using a whitelist (a different one depeding if the fragment
> is text or digits, I know this in advance). And it works reasonable well.
> The size of these fragments is fixed, I rescale them to the same height (54
> pixels, I could change it or add some borders). These are extracted from
> smartphone pictures so the original resolution varies a lot.
>
> I'm using lang "eng+ita" because in this way I get better results. I'm
> also using user-patterns but they are not helping much. I'm using the api
> through tesserocr  python bindings.
>
> I think there are may parameters I can fine tune but I tried a few
> (load_system_dawg, load_freq_dawg, textord_min_linesize) but none of these
> improved the results (a very small textord_min_linesize=0.2 made them
> worse, so they are being used). I've read the FAQ and the docs but there
> are really too many parameters to understand what to change and how.
>
> In particular my current problem is adaptive learning: when I process a
> large batch of pictures the result varies depending on other fragments.
> Fragments that are perfectly readable and correctly classified when
> processed individually, give different, wrong, results when processed in a
> batch (I mean reusing the same api instance for multiple images).
>
> I tried to disable it but it looks like
>  it cannot be
> disabled when using multiple languages(?).
>
> If I use only "ita" (and no whitelist, no learning) the first image in
> this post is recognized as (text [confidence]):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
>
> With learning (multiple calls, no whitelist, lang: ita):
>
> ('5748788\n\n', [81])
> ('5748788\n\n', [81])
> ('5748788\n\n', [90])
> ('5748788\n\n', [90])
>
> so it improves to a higher confidence (I do not know how much the
> confidence value matters in real life). It looks like learning is doing
> something good even with no whitelist (I could use the whitelist anyway,
> just to be sure, but the starting point looks better).
>
> I'm wondering if I can do some kind of "warmup" with learning enabled and
> later turn it off (I'll try this today). But how many samples do I need?
> And it seems a little hacky.
>
> Or maybe there is some way to print debug informations from the learning
> part to see what parameters are changed and set them manually later (I
> tried a few debug params but got no output).
>
> Or maybe it is quite easy to manually find good parameters for this kind
> of regular text to get close to 90 confidence.
>
> On the "AT" fragment I get 89 confidence and I think it may be quite low
> for this kind of simple clean text.
>
> What I need are (good) consistent results in all situations for the same
> image. What do you think?
>
>
> Thanks, bye
>
> Lorenzo
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/563f2458-d63f-4198-8e73-abc448112423%
> 40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 

Re: [tesseract-ocr] tesseract 4 beta: openCL useage

2018-04-29 Thread shree

>
> Please see https://github.com/tesseract-ocr/tesseract/issues/837
>

This discussion is better held there. 

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/12cdf293-833a-49f7-8c71-935dbc7878f8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-29 Thread ShreeDevi Kumar
Check that your training text has enough samples for d.

ShreeDevi

भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

On Sun, Apr 29, 2018 at 1:51 PM,  wrote:

> I did. Unfortunately they don't aswer...
> Have you any advice for me, to improve the
> training proccess? How many training texts should i use? Or is it possible
> that there is a problem with this font at all? Would help very much to find
> that out.
>
> Best regards Dave
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit https://groups.google.com/d/
> msgid/tesseract-ocr/b050af7c-d3bf-468f-aedc-a93c905b8855%
> 40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWkDCeSCDhGqP5rMSxhP%3D0SdGCuK5NmYWCE4FkXcpOjbw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Trained font - always one letter wrong

2018-04-29 Thread dave . hardy
I did. Unfortunately they don't aswer...
Have you any advice for me, to improve the 
training proccess? How many training texts should i use? Or is it possible that 
there is a problem with this font at all? Would help very much to find that 
out. 

Best regards Dave

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/b050af7c-d3bf-468f-aedc-a93c905b8855%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.