Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

damon Fri, 10 Aug 2018 05:04:40 -0700

I just realised some of the output underneath "Trying word using lang fo, 
oem 0" might be useful information! here it is:
Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 . [2e ]p 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with . [2e ]p:
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with . [2e ]p:
53. ViterbiStateEntry(NEW) with ratings_sum=43.4269 length=3 cost=54.283619 
top_choice_flags=0x19 XH_GOOD
New Best Word Choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978


Stopper:  53. (word=n, case=y, xht_ok=NORMAL=[0,256])

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 n [6e ]a 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n p C ( 20 6 24 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ri)
found ambiguity: ri ( 85 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 tr)
found ambiguity: tr ( 114 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ij)
found ambiguity: ij ( 116 )
candidate ngram: n ( 20 )
current ngram from spec: n i ( 20 16 )
comparison result: -1

Resulting ambig_blob_choices:
r0.00 c0.00 x[0,1]: 3 5 [35 ]0

r0.00 c0.00 x[0,1]: 27 3 [33 ]0

r0.00 c0.00 x[0,1]: 20 n [6e ]a
r-1.00 c0.00 x[0,1]: 85 ri [72 69 ]
r-1.00 c0.00 x[0,1]: 114 tr [74 72 ]
r-1.00 c0.00 x[0,1]: 116 ij [69 6a ]

53n ViterbiStateEntry(NEW) with ratings_sum=43.4676 length=3 cost=67.374825 
top_choice_flags=0x2 inconsistent=(punc 0 case 0 chartype 1 script 0 font 
0) XH_GOOD
New Secondary Word Choice : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1 1 
C -5.085 -3.497 -2.159

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 H [48 ]A 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with H [48 ]A:
candidate ngram: H ( 51 )
current ngram from spec: H p p ( 51 6 6 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with H [48 ]A:
53H ViterbiStateEntry(NEW) with ratings_sum=43.4944 length=3 cost=67.416374 
top_choice_flags=0x4 inconsistent=(punc 0 case 0 chartype 1 script 0 font 
0) XH_GOOD
New Secondary Word Choice : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1 1 
C -5.085 -3.497 -2.279

Filtering against best choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Best Raw Choice : 53. : R=43.4269, C=-5.08463, F=1, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Cooked Choice #0 : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Cooked Choice #1 : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1 1 
C -5.085 -3.497 -2.159

Cooked Choice #2 : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1 1 
C -5.085 -3.497 -2.279

Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
multiple=y)
Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

On Friday, 10 August 2018 11:31:28 UTC+1, [email protected] 
wrote:
>
> Hi Shree, thanks for your patience and help!
>
> I have managed to produce the tesseract.log file with your help. Now i'm 
> trying to understand it a bit more. here is a quick snippet of the output i 
> want to show you:
> *Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
> multiple=y)*
> *Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
> R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0*
> *pos NORM NORM NORM*
> *str 5 3 .*
> *state: 1 1 1 *
> *C -5.085 -3.497 -1.978*
> *1 new words worse than 1 old words: r: 54.2836 v 1.81739 c: -5.08463 v 
> -3.90478 valid dict: 0 v 0*
> *Already done word with lang eng at:Bounding box=(499,2)->(514,1361)*
> *Processing word with lang eng at:Bounding box=(672,1253)->(762,1288)*
> *Trying word using lang eng, oem 1*
> *Best choice: accepted=1, adaptable=0, done=1 : Lang result : Date : 
> R=2.05422, C=-0.662761, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
> *pos NORM NORM NORM NORM*
> *str D a t e*
> *state: 1 1 1 1 *
> *C -0.085 -0.095 -0.088 -0.085*
> *1 new words better than 0 old words: r: 2.05422 v 0 c: -0.662761 v 0 
> valid dict: 1 v 0*
> *Processing word with lang eng at:Bounding box=(521,1084)->(842,1156)*
> *Trying word using lang eng, oem 1*
> *Best choice: accepted=1, adaptable=0, done=1 : Lang result : May : 
> R=1.64554, C=-0.733805, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
> *pos NORM NORM NORM*
> *str M a y*
> *state: 1 1 1 *
> *C -0.092 -0.085 -0.105*
> *Best choice: accepted=0, adaptable=0, done=1 : Lang result : 182.2. : 
> R=4.51301, C=-4.37332, F=1, Perm=6, xht=[0,3.40282e+038], ambig=0*
> *pos NORM NORM NORM NORM NORM NORM*
> *str 1 8 2 . 2 .*
> *state: 1 1 1 1 1 1 *
> *C -0.116 -0.204 -0.176 -0.612 -0.210 -0.625*
> *1 new words better than 0 old words: r: 1.64554 v 0 c: -0.733805 v 0 
> valid dict: 1 v 0*
> *1 new words better than 0 old words: r: 4.51301 v 0 c: -4.37332 v 0 valid 
> dict: 0 v 0*
> *Trying word using lang fo, oem 0*
>
> As you can see on the very last line, it says "Trying word using lang fo," 
> I can see this line being repeated about 5 times so it seems that sometimes 
> it does use the fo dictionary. However i wonder how it works. How does it 
> know when to use fo after looking at eng? does it only look at fo when it 
> sees a box coordinate for a letter/word but it's unable to find letters to 
> assign it and so it uses the next dictionary? If so, how can it be that 
> when entering "fo+eng" in the command instead of "eng+fo" make no 
> difference to the priority of the dictionary being assigned first for 
> search?
>
> On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>>
>> output tesseract.log file should be produced in the directory from where 
>> you are running the command, usually where your OCR output is created. 
>>
>> On Thu, Aug 9, 2018 at 3:48 PM <[email protected]> wrote:
>>
>>> Hello Shree, thank you for your prompt reply.
>>>
>>> I have now changed the logfile as instructed. Where can i find the 
>>> output tesseract.log file? will it be produced in the same location as the 
>>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
>>> guessing the tesseract.log file will be produced once i've used logfile in 
>>> the commands.
>>>
>>> Kind Regards,
>>>
>>> Damon
>>>
>>>
>>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>>
>>>> i think this could be if your new traineddats is not trained to as high 
>>>> a accuracy level as the eng traineddata.
>>>>
>>>> You can setup a debug log to verify this. see 
>>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>>  
>>>> for details
>>>>
>>>> On Wed, Aug 8, 2018 at 6:04 PM <[email protected]> wrote:
>>>>
>>>>> i'm trying to use the combination of two traineddata dictionaries 
>>>>> together due to one of them being able to recognise specific numbers 
>>>>> better 
>>>>> than the other.
>>>>>
>>>>> Here is an example of the code line.
>>>>>
>>>>>                  $codeLine .= '<br>magick convert "'.$filePath.'" 
>>>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>>>                  $codeLine .= '<br>tesseract "'.$output.'.jpg" 
>>>>> "'.$output.'" -l fo+eng txt pdf';
>>>>>
>>>>> Despite the fact i put "fo" in front (this is the one that recognises 
>>>>> the number 4 better), it still gives me an output text file that is 
>>>>> exactly 
>>>>> identical to the "eng" dictionary output when i run that solo on it's 
>>>>> own. 
>>>>>
>>>>> For some reason, it chooses to not just prioritise eng but also 
>>>>> completely ignoring the fo traineddata file completely.
>>>>>
>>>>> The "fo" file definitely works as i've tested it solo.
>>>>>
>>>>> I have attached an image example of the text i'd like to OCR and the 
>>>>> two relevant traineddata files.
>>>>>
>>>>> -- 
>>>>> You received this message because you are subscribed to the Google 
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>>> an email to [email protected].
>>>>> To post to this group, send email to [email protected].
>>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>>> To view this discussion on the web visit 
>>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>>>>  
>>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>>> .
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>
>>>> -- 
>>>>
>>>> ____________________________________________________________
>>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
>>>  
>>> <https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> -- 
>>
>> ____________________________________________________________
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/4d586bd6-83a4-4ff7-896e-6a429b82306f%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

Reply via email to