Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

damon Fri, 10 Aug 2018 08:09:42 -0700

Hi Shree, just a quick update.

I've now looked into this output tesseract.log further and now understand 
how it works and how it will go through different choices and eventually 
decides on a "best choice". However the output doesn't explain how it then 
decides what has overriding priority on giving the best outcome. The fact 
that even after it scours through the "fo" dictionary, it decides on best 
choice for this dictionary, immediately it will move onto eng dictionary 
and seems to decide to use eng dictionary output because (i'm guessing), it 
regards it as more accurate. This means your theory about our custom "fo" 
dictionary not being trained to a high enough accuracy level seems to be 
correct. Is there any possible way i can train either eng or fo to improve 
it's accuracy or override another dictionary on specific characters it's 
getting wrong? for example, in our case, the eng.traneddata dictionary 
sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.


Your help on this would be greatly appreciated!

Kind Regards,

Damon 

On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>
> output tesseract.log file should be produced in the directory from where 
> you are running the command, usually where your OCR output is created. 
>
> On Thu, Aug 9, 2018 at 3:48 PM <da...@maxcommunications.co.uk 
> <javascript:>> wrote:
>
>> Hello Shree, thank you for your prompt reply.
>>
>> I have now changed the logfile as instructed. Where can i find the output 
>> tesseract.log file? will it be produced in the same location as the 
>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
>> guessing the tesseract.log file will be produced once i've used logfile in 
>> the commands.
>>
>> Kind Regards,
>>
>> Damon
>>
>>
>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>
>>> i think this could be if your new traineddats is not trained to as high 
>>> a accuracy level as the eng traineddata.
>>>
>>> You can setup a debug log to verify this. see 
>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>  
>>> for details
>>>
>>> On Wed, Aug 8, 2018 at 6:04 PM <da...@maxcommunications.co.uk> wrote:
>>>
>>>> i'm trying to use the combination of two traineddata dictionaries 
>>>> together due to one of them being able to recognise specific numbers 
>>>> better 
>>>> than the other.
>>>>
>>>> Here is an example of the code line.
>>>>
>>>>                  $codeLine .= '<br>magick convert "'.$filePath.'" 
>>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>>                  $codeLine .= '<br>tesseract "'.$output.'.jpg" 
>>>> "'.$output.'" -l fo+eng txt pdf';
>>>>
>>>> Despite the fact i put "fo" in front (this is the one that recognises 
>>>> the number 4 better), it still gives me an output text file that is 
>>>> exactly 
>>>> identical to the "eng" dictionary output when i run that solo on it's own. 
>>>>
>>>> For some reason, it chooses to not just prioritise eng but also 
>>>> completely ignoring the fo traineddata file completely.
>>>>
>>>> The "fo" file definitely works as i've tested it solo.
>>>>
>>>> I have attached an image example of the text i'd like to OCR and the 
>>>> two relevant traineddata files.
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "tesseract-ocr" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to tesseract-oc...@googlegroups.com.
>>>> To post to this group, send email to tesser...@googlegroups.com.
>>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>>> -- 
>>>
>>> ____________________________________________________________
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com <javascript:>.
>> To post to this group, send email to tesser...@googlegroups.com 
>> <javascript:>.
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> ____________________________________________________________
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/6f5c53f8-1e5f-46f5-a452-f7d485ead9c8%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

Reply via email to