On Fri, Apr 12, 2013 at 3:10 AM, <[email protected]> wrote: > Note that there still appears to be a problem with the bazaar example: > > Even though the normal dictionary is supposed to be supressed and the user > wordlist used instead, the whole text in eurotext.tif is still returned, > including words that are not contained in the user wordlist - it appears > that disabling of the default language resources in bazaar did not work. > > Would it be possible for you to run the example in the manual on a MS > Windows machine and check if it works for you. >
Does it mean you expect to get in result only words from (user) dictionary? I do not think this is possible. Anyway you can increase strength of dictionaries (see FAQ), but I was never stressful to reach 100% of OCR result just by putting words to dictionary and increasing its strength.. > Thanks, > Uwe > > > > On Thursday, April 4, 2013 7:31:18 PM UTC+10:30, [email protected] wrote: > >> I found the problem: >> >> When I created the bazaar file etc., the editor appended the standard MS >> Windows (DOS) line terminations, i.e. <CR><LF>; whereas all other files >> that come with Tessract have Unix style line terminations, i.e. <LF>. >> That's why they are ill-formatted in the standard Notepad.exe - the, much >> better, Notepad2 displays everything correctly and also allows to save the >> files with Unix line terminations. >> >> Doing this eliminated the problem. It appears that the code does not >> handle line terminations in a way that makes it platform independent - >> noting that the output files are also written with Unix line terminations >> in a DOS environment. So, it may have tried to open C:\Program Files >> (x86)\Tesseract-OCR\tessdata/**eng.user-words<CR>, which obviously does >> not exist. * I wonder why this was not a problem for anyone else?* >> >> This can typically be overcome by opening text files explicitly as text >> files (which then recognises the different terminations at the different >> platforms) and using things like fgetl, which removes the line >> termination. Conversely, when such files are written, the \n is handled as >> expected by the platform. >> >> the whole text in eurotext.tif is still returned - it appears that >> disabling of the default language resources in bazaar did not work. >> Could you please run the example in the manual on a MS Windows machine >> and check if it works for you. >> So, the files are now found; but >> >> Thanks, >> Uwe >> >> On Friday, March 22, 2013 9:51:21 AM UTC+10:30, [email protected] wrote: >> >>> This is already set - looks like this was done by the installer. >>> >>> >>> Uwe >>> >>> >>> On Thursday, March 21, 2013 8:37:47 PM UTC+10:30, zdenop wrote: >>> >>>> Did you use environment setting TESSDA**TA_PREFIX ? If no, can you set >>>> it (to "C:\Program Files (x86)\Tesseract-OCR\")? >>>> >>>> Zdenko >>>> >>>> >>>> On Thu, Mar 21, 2013 at 2:08 AM, <[email protected]> wrote: >>>> >>>>> >>>>> Thanks for the reply. >>>>>> Yes, the file does exist, I can open it from my working directory >>>>>> using fopen('C:\Program Files (x86)\Tesseract-OCR\tessdata/**e** >>>>>> ng.user-words','rt') and read the content using fgetl and the like. >>>>>> >>>>>> I also tried the -l eng and -l eng1 test and it behaved as you have >>>>>> described. >>>>>> >>>>> -- >>>>> -- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To post to this group, send email to [email protected] >>>>> To unsubscribe from this group, send email to >>>>> tesseract-oc...@googlegroups.**com >>>>> For more options, visit this group at >>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en> >>>>> >>>>> --- >>>>> You received this message because you are subscribed to the Google >>>>> Groups "tesseract-ocr" group. >>>>> To unsubscribe from this group and stop receiving emails from it, send >>>>> an email to tesseract-oc...@googlegroups.**com. >>>>> For more options, visit >>>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out> >>>>> . >>>>> >>>>> >>>>> >>>> >>>> -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

