Thanks your reply. I tried the example with different values for language_model_penalty_non_freq_dict_word and language_model_penalty_non_dict_word, varying form the default 0.1 and 0.15 resp, to 0.01 and 0.015 resp end even to 0.9 and 0.95 resp. The result was always the same, i.e. most words in the different languages were recognised. BTW, I was wondering how non-English words are recognised, as it appears that I only have the English dictionary, because I used the default installer and did not download any other dictionaries. I guess I have a misconception of how tesseract works, could you point me to any documentation that explains the overall strategy that is employed. Cheers, Uwe
On Friday, April 12, 2013 6:14:30 PM UTC+9:30, zdenop wrote: > > On Fri, Apr 12, 2013 at 3:10 AM, <[email protected] <javascript:>> wrote: > >> Note that there still appears to be a problem with the bazaar example: >> >> Even though the normal dictionary is supposed to be supressed and the >> user wordlist used instead, the whole text in eurotext.tif is still >> returned, including words that are not contained in the user wordlist - it >> appears that disabling of the default language resources in bazaar did not >> work. >> >> Would it be possible for you to run the example in the manual on a MS >> Windows machine and check if it works for you. >> > > Does it mean you expect to get in result only words from (user) > dictionary? I do not think this is possible. Anyway you can > increase strength of dictionaries (see FAQ), but I was never stressful to > reach 100% of OCR result just by putting words to dictionary and increasing > its strength.. > > > >> Thanks, >> Uwe >> >> >> >> On Thursday, April 4, 2013 7:31:18 PM UTC+10:30, [email protected] wrote: >> >>> I found the problem: >>> >>> When I created the bazaar file etc., the editor appended the standard MS >>> Windows (DOS) line terminations, i.e. <CR><LF>; whereas all other files >>> that come with Tessract have Unix style line terminations, i.e. <LF>. >>> That's why they are ill-formatted in the standard Notepad.exe - the, much >>> better, Notepad2 displays everything correctly and also allows to save the >>> files with Unix line terminations. >>> >>> Doing this eliminated the problem. It appears that the code does not >>> handle line terminations in a way that makes it platform independent - >>> noting that the output files are also written with Unix line terminations >>> in a DOS environment. So, it may have tried to open C:\Program Files >>> (x86)\Tesseract-OCR\tessdata/**eng.user-words<CR>, which obviously does >>> not exist. * I wonder why this was not a problem for anyone else?* >>> >>> This can typically be overcome by opening text files explicitly as text >>> files (which then recognises the different terminations at the different >>> platforms) and using things like fgetl, which removes the line >>> termination. Conversely, when such files are written, the \n is handled as >>> expected by the platform. >>> >>> the whole text in eurotext.tif is still returned - it appears that >>> disabling of the default language resources in bazaar did not work. >>> Could you please run the example in the manual on a MS Windows machine >>> and check if it works for you. >>> So, the files are now found; but >>> >>> Thanks, >>> Uwe >>> >>> On Friday, March 22, 2013 9:51:21 AM UTC+10:30, [email protected] wrote: >>> >>>> This is already set - looks like this was done by the installer. >>>> >>>> >>>> Uwe >>>> >>>> >>>> On Thursday, March 21, 2013 8:37:47 PM UTC+10:30, zdenop wrote: >>>> >>>>> Did you use environment setting TESSDA**TA_PREFIX ? If no, can you >>>>> set it (to "C:\Program Files (x86)\Tesseract-OCR\")? >>>>> >>>>> Zdenko >>>>> >>>>> >>>>> On Thu, Mar 21, 2013 at 2:08 AM, <[email protected]> wrote: >>>>> >>>>>> >>>>>> Thanks for the reply. >>>>>>> Yes, the file does exist, I can open it from my working directory >>>>>>> using fopen('C:\Program Files (x86)\Tesseract-OCR\tessdata/**e** >>>>>>> ng.user-words','rt') and read the content using fgetl and the like. >>>>>>> >>>>>>> I also tried the -l eng and -l eng1 test and it behaved as you have >>>>>>> described. >>>>>>> >>>>>> -- >>>>>> -- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To post to this group, send email to [email protected] >>>>>> To unsubscribe from this group, send email to >>>>>> tesseract-oc...@googlegroups.**com >>>>>> For more options, visit this group at >>>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>>> >>>>>> --- >>>>>> You received this message because you are subscribed to the Google >>>>>> Groups "tesseract-ocr" group. >>>>>> To unsubscribe from this group and stop receiving emails from it, >>>>>> send an email to tesseract-oc...@googlegroups.**com. >>>>>> For more options, visit >>>>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out> >>>>>> . >>>>>> >>>>>> >>>>>> >>>>> >>>>> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected]<javascript:> >> To unsubscribe from this group, send email to >> [email protected] <javascript:> >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

