Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread Shree Devi Kumar
I do not know about the internal algorithms used by tesseract.

If you are having accuracy issues with certain letters and digits, I will
suggest that you fine-tune  for impact using the images or similar font.

Please see wiki page on training 4.0 for the command - look for fine tuning
for new font/impact. Use eng.traineddata as base, 50-100 lines of training
text and 300-400 iterations max.

On Fri 10 Aug, 2018, 8:39 PM ,  wrote:

> Hi Shree, just a quick update.
>
> I've now looked into this output tesseract.log further and now understand
> how it works and how it will go through different choices and eventually
> decides on a "best choice". However the output doesn't explain how it then
> decides what has overriding priority on giving the best outcome. The fact
> that even after it scours through the "fo" dictionary, it decides on best
> choice for this dictionary, immediately it will move onto eng dictionary
> and seems to decide to use eng dictionary output because (i'm guessing), it
> regards it as more accurate. This means your theory about our custom "fo"
> dictionary not being trained to a high enough accuracy level seems to be
> correct. Is there any possible way i can train either eng or fo to improve
> it's accuracy or override another dictionary on specific characters it's
> getting wrong? for example, in our case, the eng.traneddata dictionary
> sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.
>
> Your help on this would be greatly appreciated!
>
> Kind Regards,
>
> Damon
>
> On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>>
>> output tesseract.log file should be produced in the directory from where
>> you are running the command, usually where your OCR output is created.
>>
>> On Thu, Aug 9, 2018 at 3:48 PM  wrote:
>>
>>> Hello Shree, thank you for your prompt reply.
>>>
>>> I have now changed the logfile as instructed. Where can i find the
>>> output tesseract.log file? will it be produced in the same location as the
>>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
>>> guessing the tesseract.log file will be produced once i've used logfile in
>>> the commands.
>>>
>>> Kind Regards,
>>>
>>> Damon
>>>
>>>
>>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:

 i think this could be if your new traineddats is not trained to as high
 a accuracy level as the eng traineddata.

 You can setup a debug log to verify this. see
 https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
 for details

 On Wed, Aug 8, 2018 at 6:04 PM  wrote:

> i'm trying to use the combination of two traineddata dictionaries
> together due to one of them being able to recognise specific numbers 
> better
> than the other.
>
> Here is an example of the code line.
>
>  $codeLine .= 'magick convert "'.$filePath.'"
> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>  $codeLine .= 'tesseract "'.$output.'.jpg"
> "'.$output.'" -l fo+eng txt pdf';
>
> Despite the fact i put "fo" in front (this is the one that recognises
> the number 4 better), it still gives me an output text file that is 
> exactly
> identical to the "eng" dictionary output when i run that solo on it's own.
>
> For some reason, it chooses to not just prioritise eng but also
> completely ignoring the fo traineddata file completely.
>
> The "fo" file definitely works as i've tested it solo.
>
> I have attached an image example of the text i'd like to OCR and the
> two relevant traineddata files.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


 --

 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To 

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread damon
Hi Shree, just a quick update.

I've now looked into this output tesseract.log further and now understand 
how it works and how it will go through different choices and eventually 
decides on a "best choice". However the output doesn't explain how it then 
decides what has overriding priority on giving the best outcome. The fact 
that even after it scours through the "fo" dictionary, it decides on best 
choice for this dictionary, immediately it will move onto eng dictionary 
and seems to decide to use eng dictionary output because (i'm guessing), it 
regards it as more accurate. This means your theory about our custom "fo" 
dictionary not being trained to a high enough accuracy level seems to be 
correct. Is there any possible way i can train either eng or fo to improve 
it's accuracy or override another dictionary on specific characters it's 
getting wrong? for example, in our case, the eng.traneddata dictionary 
sometimes gets 3's and 5's mixed up and it has a lot of trouble with 4's.

Your help on this would be greatly appreciated!

Kind Regards,

Damon 

On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>
> output tesseract.log file should be produced in the directory from where 
> you are running the command, usually where your OCR output is created. 
>
> On Thu, Aug 9, 2018 at 3:48 PM  > wrote:
>
>> Hello Shree, thank you for your prompt reply.
>>
>> I have now changed the logfile as instructed. Where can i find the output 
>> tesseract.log file? will it be produced in the same location as the 
>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
>> guessing the tesseract.log file will be produced once i've used logfile in 
>> the commands.
>>
>> Kind Regards,
>>
>> Damon
>>
>>
>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>
>>> i think this could be if your new traineddats is not trained to as high 
>>> a accuracy level as the eng traineddata.
>>>
>>> You can setup a debug log to verify this. see 
>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>  
>>> for details
>>>
>>> On Wed, Aug 8, 2018 at 6:04 PM  wrote:
>>>
 i'm trying to use the combination of two traineddata dictionaries 
 together due to one of them being able to recognise specific numbers 
 better 
 than the other.

 Here is an example of the code line.

  $codeLine .= 'magick convert "'.$filePath.'" 
 -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
  $codeLine .= 'tesseract "'.$output.'.jpg" 
 "'.$output.'" -l fo+eng txt pdf';

 Despite the fact i put "fo" in front (this is the one that recognises 
 the number 4 better), it still gives me an output text file that is 
 exactly 
 identical to the "eng" dictionary output when i run that solo on it's own. 

 For some reason, it chooses to not just prioritise eng but also 
 completely ignoring the fo traineddata file completely.

 The "fo" file definitely works as i've tested it solo.

 I have attached an image example of the text i'd like to OCR and the 
 two relevant traineddata files.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.
 To view this discussion on the web visit 
 https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
  
 
 .
 For more options, visit https://groups.google.com/d/optout.

>>>
>>>
>>> -- 
>>>
>>> 
>>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You 

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread damon
I just realised some of the output underneath "Trying word using lang fo, 
oem 0" might be useful information! here it is:
Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 . [2e ]p 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with . [2e ]p:
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with . [2e ]p:
53. ViterbiStateEntry(NEW) with ratings_sum=43.4269 length=3 cost=54.283619 
top_choice_flags=0x19 XH_GOOD
New Best Word Choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Stopper:  53. (word=n, case=y, xht_ok=NORMAL=[0,256])

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 n [6e ]a 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n p C ( 20 6 24 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with n [6e ]a:
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ri)
found ambiguity: ri ( 85 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 tr)
found ambiguity: tr ( 114 )
candidate ngram: n ( 20 )
current ngram from spec: n ( 20 )
comparison result: 0
fixpt+=(2 3 0 1 ij)
found ambiguity: ij ( 116 )
candidate ngram: n ( 20 )
current ngram from spec: n i ( 20 16 )
comparison result: -1

Resulting ambig_blob_choices:
r0.00 c0.00 x[0,1]: 3 5 [35 ]0

r0.00 c0.00 x[0,1]: 27 3 [33 ]0

r0.00 c0.00 x[0,1]: 20 n [6e ]a
r-1.00 c0.00 x[0,1]: 85 ri [72 69 ]
r-1.00 c0.00 x[0,1]: 114 tr [74 72 ]
r-1.00 c0.00 x[0,1]: 116 ij [69 6a ]

53n ViterbiStateEntry(NEW) with ratings_sum=43.4676 length=3 cost=67.374825 
top_choice_flags=0x2 inconsistent=(punc 0 case 0 chartype 1 script 0 font 
0) XH_GOOD
New Secondary Word Choice : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1 1 
C -5.085 -3.497 -2.159

Running NoDangerousAmbig() for 5 [35 ]0 3 [33 ]0 H [48 ]A 
Looking for replaceable ngrams starting with 5 [35 ]0:
Looking for replaceable ngrams starting with 3 [33 ]0:
Looking for replaceable ngrams starting with H [48 ]A:
candidate ngram: H ( 51 )
current ngram from spec: H p p ( 51 6 6 )
comparison result: -1
Looking for ambiguous ngrams starting with 5 [35 ]0:
Looking for ambiguous ngrams starting with 3 [33 ]0:
Looking for ambiguous ngrams starting with H [48 ]A:
53H ViterbiStateEntry(NEW) with ratings_sum=43.4944 length=3 cost=67.416374 
top_choice_flags=0x4 inconsistent=(punc 0 case 0 chartype 1 script 0 font 
0) XH_GOOD
New Secondary Word Choice : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1 1 
C -5.085 -3.497 -2.279

Filtering against best choice : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, 
xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Best Raw Choice : 53. : R=43.4269, C=-5.08463, F=1, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Cooked Choice #0 : 53. : R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

Cooked Choice #1 : 53n : R=67.3748, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 n
state: 1 1 1 
C -5.085 -3.497 -2.159

Cooked Choice #2 : 53H : R=67.4164, C=-5.08463, F=1.5, Perm=2, xht=[0,256], 
ambig=0
pos NORM NORM NORM
str 5 3 H
state: 1 1 1 
C -5.085 -3.497 -2.279

Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
multiple=y)
Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0
pos NORM NORM NORM
str 5 3 .
state: 1 1 1 
C -5.085 -3.497 -1.978

On Friday, 10 August 2018 11:31:28 UTC+1, da...@maxcommunications.co.uk 
wrote:
>
> Hi Shree, thanks for your patience and help!
>
> I have managed to produce the tesseract.log file with your help. Now i'm 
> trying to understand it a bit more. here is a quick snippet of the output i 
> want to show you:
> *Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
> multiple=y)*
> *Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
> R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0*
> *pos NORM NORM NORM*
> *str 5 3 .*
> *state: 1 1 1 *
> *C -5.085 -3.497 -1.978*
> *1 new words worse than 1 old words: r: 54.2836 v 1.81739 c: -5.08463 v 
> -3.90478 valid dict: 0 v 0*
> *Already done word with lang eng at:Bounding box=(499,2)->(514,1361)*
> *Processing word with lang eng 

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread damon
Hi Shree, thanks for your patience and help!

I have managed to produce the tesseract.log file with your help. Now i'm 
trying to understand it a bit more. here is a quick snippet of the output i 
want to show you:
*Rejecter: 5 [35 ]0 3 [33 ]0 . [2e ]p  (word=n, case=y, unambig=y, 
multiple=y)*
*Best choice: accepted=0, adaptable=0, done=0 : Lang result : 53. : 
R=54.2836, C=-5.08463, F=1.5, Perm=2, xht=[0,256], ambig=0*
*pos NORM NORM NORM*
*str 5 3 .*
*state: 1 1 1 *
*C -5.085 -3.497 -1.978*
*1 new words worse than 1 old words: r: 54.2836 v 1.81739 c: -5.08463 v 
-3.90478 valid dict: 0 v 0*
*Already done word with lang eng at:Bounding box=(499,2)->(514,1361)*
*Processing word with lang eng at:Bounding box=(672,1253)->(762,1288)*
*Trying word using lang eng, oem 1*
*Best choice: accepted=1, adaptable=0, done=1 : Lang result : Date : 
R=2.05422, C=-0.662761, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM NORM*
*str D a t e*
*state: 1 1 1 1 *
*C -0.085 -0.095 -0.088 -0.085*
*1 new words better than 0 old words: r: 2.05422 v 0 c: -0.662761 v 0 valid 
dict: 1 v 0*
*Processing word with lang eng at:Bounding box=(521,1084)->(842,1156)*
*Trying word using lang eng, oem 1*
*Best choice: accepted=1, adaptable=0, done=1 : Lang result : May : 
R=1.64554, C=-0.733805, F=1, Perm=8, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM*
*str M a y*
*state: 1 1 1 *
*C -0.092 -0.085 -0.105*
*Best choice: accepted=0, adaptable=0, done=1 : Lang result : 182.2. : 
R=4.51301, C=-4.37332, F=1, Perm=6, xht=[0,3.40282e+038], ambig=0*
*pos NORM NORM NORM NORM NORM NORM*
*str 1 8 2 . 2 .*
*state: 1 1 1 1 1 1 *
*C -0.116 -0.204 -0.176 -0.612 -0.210 -0.625*
*1 new words better than 0 old words: r: 1.64554 v 0 c: -0.733805 v 0 valid 
dict: 1 v 0*
*1 new words better than 0 old words: r: 4.51301 v 0 c: -4.37332 v 0 valid 
dict: 0 v 0*
*Trying word using lang fo, oem 0*

As you can see on the very last line, it says "Trying word using lang fo," 
I can see this line being repeated about 5 times so it seems that sometimes 
it does use the fo dictionary. However i wonder how it works. How does it 
know when to use fo after looking at eng? does it only look at fo when it 
sees a box coordinate for a letter/word but it's unable to find letters to 
assign it and so it uses the next dictionary? If so, how can it be that 
when entering "fo+eng" in the command instead of "eng+fo" make no 
difference to the priority of the dictionary being assigned first for 
search?

On Thursday, 9 August 2018 11:29:11 UTC+1, shree wrote:
>
> output tesseract.log file should be produced in the directory from where 
> you are running the command, usually where your OCR output is created. 
>
> On Thu, Aug 9, 2018 at 3:48 PM  > wrote:
>
>> Hello Shree, thank you for your prompt reply.
>>
>> I have now changed the logfile as instructed. Where can i find the output 
>> tesseract.log file? will it be produced in the same location as the 
>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
>> guessing the tesseract.log file will be produced once i've used logfile in 
>> the commands.
>>
>> Kind Regards,
>>
>> Damon
>>
>>
>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>>
>>> i think this could be if your new traineddats is not trained to as high 
>>> a accuracy level as the eng traineddata.
>>>
>>> You can setup a debug log to verify this. see 
>>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>>>  
>>> for details
>>>
>>> On Wed, Aug 8, 2018 at 6:04 PM  wrote:
>>>
 i'm trying to use the combination of two traineddata dictionaries 
 together due to one of them being able to recognise specific numbers 
 better 
 than the other.

 Here is an example of the code line.

  $codeLine .= 'magick convert "'.$filePath.'" 
 -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
  $codeLine .= 'tesseract "'.$output.'.jpg" 
 "'.$output.'" -l fo+eng txt pdf';

 Despite the fact i put "fo" in front (this is the one that recognises 
 the number 4 better), it still gives me an output text file that is 
 exactly 
 identical to the "eng" dictionary output when i run that solo on it's own. 

 For some reason, it chooses to not just prioritise eng but also 
 completely ignoring the fo traineddata file completely.

 The "fo" file definitely works as i've tested it solo.

 I have attached an image example of the text i'd like to OCR and the 
 two relevant traineddata files.

 -- 
 You received this message because you are subscribed to the Google 
 Groups "tesseract-ocr" group.
 To unsubscribe from this group and stop receiving emails from it, send 
 an email to tesseract-oc...@googlegroups.com.
 To post to this group, send email to tesser...@googlegroups.com.
 Visit this group at https://groups.google.com/group/tesseract-ocr.

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-10 Thread Damon Kwong
Hi Shree, I've tried to run my commands again by having logfile as the last
variable which has been changed to:
*debug_file tesseract.log*
*multilang_debug_level 3*
*stopper_debug_level 3*
When i entered the command with logfile at the end, it gives an output in
cmd saying: http://puu.sh/BbTla/a34624a9a4.png

The problem is that the files do exist because i tried running the command
again without logfile and the files were being produced... very
confusing... any idea why it can't find the files? as you can see, the
directories are in speech marks too.

On 9 August 2018 at 11:55, Damon Kwong 
wrote:

> Ahh i see, i will report back once i have the output file if i can't
> figure out the reason why. You've been very helpful, thanks again :)
>
> On 9 August 2018 at 11:28, Shree Devi Kumar  wrote:
>
>> output tesseract.log file should be produced in the directory from where
>> you are running the command, usually where your OCR output is created.
>>
>> On Thu, Aug 9, 2018 at 3:48 PM  wrote:
>>
>>> Hello Shree, thank you for your prompt reply.
>>>
>>> I have now changed the logfile as instructed. Where can i find the
>>> output tesseract.log file? will it be produced in the same location as the
>>> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
>>> guessing the tesseract.log file will be produced once i've used logfile in
>>> the commands.
>>>
>>> Kind Regards,
>>>
>>> Damon
>>>
>>>
>>> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:

 i think this could be if your new traineddats is not trained to as high
 a accuracy level as the eng traineddata.

 You can setup a debug log to verify this. see
 https://github.com/tesseract-ocr/tesseract/issues/1275#
 issuecomment-360367865 for details

 On Wed, Aug 8, 2018 at 6:04 PM  wrote:

> i'm trying to use the combination of two traineddata dictionaries
> together due to one of them being able to recognise specific numbers 
> better
> than the other.
>
> Here is an example of the code line.
>
>  $codeLine .= 'magick convert "'.$filePath.'"
> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>  $codeLine .= 'tesseract "'.$output.'.jpg"
> "'.$output.'" -l fo+eng txt pdf';
>
> Despite the fact i put "fo" in front (this is the one that recognises
> the number 4 better), it still gives me an output text file that is 
> exactly
> identical to the "eng" dictionary output when i run that solo on it's own.
>
> For some reason, it chooses to not just prioritise eng but also
> completely ignoring the fo traineddata file completely.
>
> The "fo" file definitely works as i've tested it solo.
>
> I have attached an image example of the text i'd like to OCR and the
> two relevant traineddata files.
>
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to tesseract-oc...@googlegroups.com.
> To post to this group, send email to tesser...@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-bae
> b-4ba9-9cbd-adda6cba957c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


 --

 
 भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-ocr+unsubscr...@googlegroups.com.
>>> To post to this group, send email to tesseract-ocr@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit https://groups.google.com/d/ms
>>> gid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40goo
>>> glegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
>> --
>> You received this message because you are subscribed to a topic in the
>> Google Groups "tesseract-ocr" group.
>> To unsubscribe from this topic, visit https://groups.google.com/d/to
>> pic/tesseract-ocr/k5fU3wQzXmY/unsubscribe.
>> To unsubscribe from this 

Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-09 Thread Shree Devi Kumar
output tesseract.log file should be produced in the directory from where
you are running the command, usually where your OCR output is created.

On Thu, Aug 9, 2018 at 3:48 PM  wrote:

> Hello Shree, thank you for your prompt reply.
>
> I have now changed the logfile as instructed. Where can i find the output
> tesseract.log file? will it be produced in the same location as the
> logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm
> guessing the tesseract.log file will be produced once i've used logfile in
> the commands.
>
> Kind Regards,
>
> Damon
>
>
> On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>>
>> i think this could be if your new traineddats is not trained to as high a
>> accuracy level as the eng traineddata.
>>
>> You can setup a debug log to verify this. see
>> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
>> for details
>>
>> On Wed, Aug 8, 2018 at 6:04 PM  wrote:
>>
>>> i'm trying to use the combination of two traineddata dictionaries
>>> together due to one of them being able to recognise specific numbers better
>>> than the other.
>>>
>>> Here is an example of the code line.
>>>
>>>  $codeLine .= 'magick convert "'.$filePath.'"
>>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>>  $codeLine .= 'tesseract "'.$output.'.jpg"
>>> "'.$output.'" -l fo+eng txt pdf';
>>>
>>> Despite the fact i put "fo" in front (this is the one that recognises
>>> the number 4 better), it still gives me an output text file that is exactly
>>> identical to the "eng" dictionary output when i run that solo on it's own.
>>>
>>> For some reason, it chooses to not just prioritise eng but also
>>> completely ignoring the fo traineddata file completely.
>>>
>>> The "fo" file definitely works as i've tested it solo.
>>>
>>> I have attached an image example of the text i'd like to OCR and the two
>>> relevant traineddata files.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to tesseract-oc...@googlegroups.com.
>>> To post to this group, send email to tesser...@googlegroups.com.
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>> 
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>> --
>>
>> 
>> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWK2gdGYGq_BX21YAAo5tuAFcs_eFkaLho9Hz0T4OegpQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-09 Thread damon
Hello Shree, thank you for your prompt reply.

I have now changed the logfile as instructed. Where can i find the output 
tesseract.log file? will it be produced in the same location as the 
logfile? in C:\Program Files (x86)\Tesseract-OCR\tessdata\configs ? I'm 
guessing the tesseract.log file will be produced once i've used logfile in 
the commands.

Kind Regards,

Damon


On Wednesday, 8 August 2018 19:07:02 UTC+1, shree wrote:
>
> i think this could be if your new traineddats is not trained to as high a 
> accuracy level as the eng traineddata.
>
> You can setup a debug log to verify this. see 
> https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865 
> for details
>
> On Wed, Aug 8, 2018 at 6:04 PM  > wrote:
>
>> i'm trying to use the combination of two traineddata dictionaries 
>> together due to one of them being able to recognise specific numbers better 
>> than the other.
>>
>> Here is an example of the code line.
>>
>>  $codeLine .= 'magick convert "'.$filePath.'" 
>> -quality 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>>  $codeLine .= 'tesseract "'.$output.'.jpg" 
>> "'.$output.'" -l fo+eng txt pdf';
>>
>> Despite the fact i put "fo" in front (this is the one that recognises the 
>> number 4 better), it still gives me an output text file that is exactly 
>> identical to the "eng" dictionary output when i run that solo on it's own. 
>>
>> For some reason, it chooses to not just prioritise eng but also 
>> completely ignoring the fo traineddata file completely.
>>
>> The "fo" file definitely works as i've tested it solo.
>>
>> I have attached an image example of the text i'd like to OCR and the two 
>> relevant traineddata files.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to tesseract-oc...@googlegroups.com .
>> To post to this group, send email to tesser...@googlegroups.com 
>> .
>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
>>  
>> 
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> -- 
>
> 
> भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/befd629e-e433-45dd-bf1a-7a5c955e9a61%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: [tesseract-ocr] Problem with using two trained.data files in combination for a better result.

2018-08-08 Thread Shree Devi Kumar
i think this could be if your new traineddats is not trained to as high a
accuracy level as the eng traineddata.

You can setup a debug log to verify this. see
https://github.com/tesseract-ocr/tesseract/issues/1275#issuecomment-360367865
for details

On Wed, Aug 8, 2018 at 6:04 PM  wrote:

> i'm trying to use the combination of two traineddata dictionaries together
> due to one of them being able to recognise specific numbers better than the
> other.
>
> Here is an example of the code line.
>
>  $codeLine .= 'magick convert "'.$filePath.'" -quality
> 90 -density 300x300  -units PixelsPerInch "'.$output.'.jpg"'; //
>  $codeLine .= 'tesseract "'.$output.'.jpg"
> "'.$output.'" -l fo+eng txt pdf';
>
> Despite the fact i put "fo" in front (this is the one that recognises the
> number 4 better), it still gives me an output text file that is exactly
> identical to the "eng" dictionary output when i run that solo on it's own.
>
> For some reason, it chooses to not just prioritise eng but also completely
> ignoring the fo traineddata file completely.
>
> The "fo" file definitely works as i've tested it solo.
>
> I have attached an image example of the text i'd like to OCR and the two
> relevant traineddata files.
>
> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to tesseract-ocr+unsubscr...@googlegroups.com.
> To post to this group, send email to tesseract-ocr@googlegroups.com.
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/1a5a6768-baeb-4ba9-9cbd-adda6cba957c%40googlegroups.com
> 
> .
> For more options, visit https://groups.google.com/d/optout.
>


-- 


भजन - कीर्तन - आरती @ http://bhajans.ramparivar.com

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
To post to this group, send email to tesseract-ocr@googlegroups.com.
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduXF6eSn8cfFLUJrTjJ-ojDuATy_wogH-5ugS4CHt5PFQQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.