Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

damon Fri, 10 Aug 2018 03:31:45 -0700

Hi Shree, thanks for your patience and help!

I have managed to produce the tesseract.log file with your help. Now i'm 
trying to understand it a bit more. here is a quick snippet of the output i 
want to show you:
*Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
multiple=y)*
*Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0*
*pos NORM NORM NORM*
*str 5 3 .*
*state: 1 1 1 *
*C -5.085 -3.497 -1.978*
*1 new words worse than 1 old words: r: 54.2836 v 1.81739 c: -5.08463 v 
-3.90478 valid dict: 0 v 0*
*Already done word with lang eng at:Bounding box=(499,2)->(514,1361)*
*Processing word with lang eng at:Bounding box=(672,1253)->(762,1288)*
*Trying word using lang eng, oem 1*
*Best choice: accepted=1, adaptable=0, done=1 : Lang result : Date : 
R=2.05422, C=-0.662761, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM NORM*
*str D a t e*
*state: 1 1 1 1 *
*C -0.085 -0.095 -0.088 -0.085*
*1 new words better than 0 old words: r: 2.05422 v 0 c: -0.662761 v 0 valid 
dict: 1 v 0*
*Processing word with lang eng at:Bounding box=(521,1084)->(842,1156)*
*Trying word using lang eng, oem 1*
*Best choice: accepted=1, adaptable=0, done=1 : Lang result : May : 
R=1.64554, C=-0.733805, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM*
*str M a y*
*state: 1 1 1 *
*C -0.092 -0.085 -0.105*
*Best choice: accepted=0, adaptable=0, done=1 : Lang result : 182.2. : 
R=4.51301, C=-4.37332, F=1, Perm=6, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM NORM NORM NORM*
*str 1 8 2 . 2 .*
*state: 1 1 1 1 1 1 *
*C -0.116 -0.204 -0.176 -0.612 -0.210 -0.625*
*1 new words better than 0 old words: r: 1.64554 v 0 c: -0.733805 v 0 valid 
dict: 1 v 0*
*1 new words better than 0 old words: r: 4.51301 v 0 c: -4.37332 v 0 valid 
dict: 0 v 0*
*Trying word using lang fo, oem 0*


As you can see on the very last line, it says "Trying word using lang fo," 
I can see this line being repeated about 5 times so it seems that sometimes 
it does use the fo dictionary. However i wonder how it works. How does it 
know when to use fo after looking at eng? does it only look at fo when it 
sees a box coordinate for a letter/word but it's unable to find letters to 
assign it and so it uses the next dictionary? If so, how can it be that 
when entering "fo+eng" in the command instead of "eng+fo" make no 
difference to the priority of the dictionary being assigned first for 
search?

On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>
> output tesseract.log file should be produced in the directory from where 
> you are running the command, usually where your OCR output is created. 
>
> On Thu, Aug 9, 2018 at 3:48 PM <[email protected] 
> <javascript:>> wrote:
>
>> Hello Shree, thank you for your prompt reply.
>>
>> I have now changed the logfile as instructed. Where can i find the output 
>> tesseract.log file? will it be produced in the same location as the 
>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
>> guessing the tesseract.log file will be produced once i've used logfile in 
>> the commands.
>>
>> Kind Regards,
>>
>> Damon
>>
>>
>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>
>>> i think this could be if your new traineddats is not trained to as high 
>>> a accuracy level as the eng traineddata.
>>>
>>> You can setup a debug log to verify this. see 
>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>  
>>> for details
>>>
>>> On Wed, Aug 8, 2018 at 6:04 PM <[email protected]> wrote:
>>>
>>>> i'm trying to use the combination of two traineddata dictionaries 
>>>> together due to one of them being able to recognise specific numbers 
>>>> better 
>>>> than the other.
>>>>
>>>> Here is an example of the code line.
>>>>
>>>>                  $codeLine .= '<br>magick convert "'.$filePath.'" 
>>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>>                  $codeLine .= '<br>tesseract "'.$output.'.jpg" 
>>>> "'.$output.'" -l fo+eng txt pdf';
>>>>
>>>> Despite the fact i put "fo" in front (this is the one that recognises 
>>>> the number 4 better), it still gives me an output text file that is 
>>>> exactly 
>>>> identical to the "eng" dictionary output when i run that solo on it's own. 
>>>>
>>>> For some reason, it chooses to not just prioritise eng but also 
>>>> completely ignoring the fo traineddata file completely.
>>>>
>>>> The "fo" file definitely works as i've tested it solo.
>>>>
>>>> I have attached an image example of the text i'd like to OCR and the 
>>>> two relevant traineddata files.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> To post to this group, send email to [email protected].
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/51749bbf-1605-4a12-a26a-0b0a9b0c17a5%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

Reply via email to