Note that there still appears to be a problem with the bazaar example: Even though the normal dictionary is supposed to be supressed and the user wordlist used instead, the whole text in eurotext.tif is still returned, including words that are not contained in the user wordlist - it appears that disabling of the default language resources in bazaar did not work.
Would it be possible for you to run the example in the manual on a MS Windows machine and check if it works for you. Thanks, Uwe On Thursday, April 4, 2013 7:31:18 PM UTC+10:30, [email protected] wrote: > I found the problem: > > When I created the bazaar file etc., the editor appended the standard MS > Windows (DOS) line terminations, i.e. <CR><LF>; whereas all other files > that come with Tessract have Unix style line terminations, i.e. <LF>. > That's why they are ill-formatted in the standard Notepad.exe - the, much > better, Notepad2 displays everything correctly and also allows to save the > files with Unix line terminations. > > Doing this eliminated the problem. It appears that the code does not > handle line terminations in a way that makes it platform independent - > noting that the output files are also written with Unix line terminations > in a DOS environment. So, it may have tried to open C:\Program Files > (x86)\Tesseract-OCR\tessdata/eng.user-words<CR>, which obviously does not > exist. * I wonder why this was not a problem for anyone else?* > > This can typically be overcome by opening text files explicitly as text > files (which then recognises the different terminations at the different > platforms) and using things like fgetl, which removes the line > termination. Conversely, when such files are written, the \n is handled as > expected by the platform. > > the whole text in eurotext.tif is still returned - it appears that > disabling of the default language resources in bazaar did not work. > Could you please run the example in the manual on a MS Windows machine and > check if it works for you. > So, the files are now found; but > > Thanks, > Uwe > > On Friday, March 22, 2013 9:51:21 AM UTC+10:30, [email protected] wrote: > >> This is already set - looks like this was done by the installer. >> >> >> Uwe >> >> >> On Thursday, March 21, 2013 8:37:47 PM UTC+10:30, zdenop wrote: >> >>> Did you use environment setting TESSDATA_PREFIX ? If no, can you set it >>> (to "C:\Program Files (x86)\Tesseract-OCR\")? >>> >>> Zdenko >>> >>> >>> On Thu, Mar 21, 2013 at 2:08 AM, <[email protected]> wrote: >>> >>>> >>>> Thanks for the reply. >>>>> Yes, the file does exist, I can open it from my working directory >>>>> using fopen('C:\Program Files (x86)\Tesseract-OCR\tessdata/** >>>>> eng.user-words','rt') and read the content using fgetl and the like. >>>>> >>>>> I also tried the -l eng and -l eng1 test and it behaved as you have >>>>> described. >>>>> >>>> -- >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To post to this group, send email to [email protected] >>>> To unsubscribe from this group, send email to >>>> [email protected] >>>> For more options, visit this group at >>>> http://groups.google.com/group/tesseract-ocr?hl=en >>>> >>>> --- >>>> You received this message because you are subscribed to the Google >>>> Groups "tesseract-ocr" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>>> >>>> >>>> >>> >>> -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

