Re: Problem re-creating user-words example in tesseract1 doc

zdenko podobny Fri, 12 Apr 2013 01:45:04 -0700

On Fri, Apr 12, 2013 at 3:10 AM, <[email protected]> wrote:

> Note that there still appears to be a problem with the bazaar example:
>
> Even though the normal dictionary is supposed to be supressed and the user
> wordlist used instead, the whole text in eurotext.tif is still returned,
> including words that are not contained in the user wordlist - it appears
> that disabling of the default language resources in bazaar did not work.
>
> Would it be possible for you to run the example in the manual on a MS
> Windows machine and check if it works for you.
>


Does it mean you expect to get in result only words from (user) dictionary?
 I do not think this is possible. Anyway you can increase strength of
dictionaries (see FAQ), but I was never stressful to reach 100% of OCR
result just by putting words to dictionary and increasing its strength..



> Thanks,
>    Uwe
>
>
>
> On Thursday, April 4, 2013 7:31:18 PM UTC+10:30, [email protected] wrote:
>
>> I found the problem:
>>
>> When I created the bazaar file etc., the editor appended the standard MS
>> Windows (DOS) line terminations, i.e. <CR><LF>; whereas all other files
>> that come with Tessract have Unix style line terminations, i.e. <LF>.
>> That's why they are ill-formatted in the standard Notepad.exe - the, much
>> better, Notepad2 displays everything correctly and also allows to save the
>> files with Unix line terminations.
>>
>> Doing this eliminated the problem.  It appears that the code does not
>> handle line terminations in a way that makes it platform independent -
>> noting that the output files are also written with Unix line terminations
>> in a DOS environment.  So, it may have tried to open C:\Program Files
>> (x86)\Tesseract-OCR\tessdata/**eng.user-words<CR>, which obviously does
>> not exist. * I wonder why this was not a problem for anyone else?*
>>
>> This can typically be overcome by opening text files explicitly as text
>> files (which then recognises the different terminations at the different
>> platforms) and using things like fgetl, which removes the line
>> termination.  Conversely, when such files are written, the \n is handled as
>> expected by the platform.
>>
>>   the whole text in eurotext.tif is still returned - it appears that
>> disabling of the default language resources in bazaar did not work.
>> Could you please run the example in the manual on a MS Windows machine
>> and check if it works for you.
>> So, the files are now found; but
>>
>> Thanks,
>>    Uwe
>>
>> On Friday, March 22, 2013 9:51:21 AM UTC+10:30, [email protected] wrote:
>>
>>> This is already set - looks like this was done by the installer.
>>>
>>>
>>>    Uwe
>>>
>>>
>>> On Thursday, March 21, 2013 8:37:47 PM UTC+10:30, zdenop wrote:
>>>
>>>> Did you use environment setting TESSDA**TA_PREFIX ? If no, can you set
>>>> it (to "C:\Program Files (x86)\Tesseract-OCR\")?
>>>>
>>>> Zdenko
>>>>
>>>>
>>>> On Thu, Mar 21, 2013 at 2:08 AM, <[email protected]> wrote:
>>>>
>>>>>
>>>>> Thanks for the reply.
>>>>>> Yes, the file does exist, I can open it from my working directory
>>>>>> using fopen('C:\Program Files (x86)\Tesseract-OCR\tessdata/**e**
>>>>>> ng.user-words','rt') and read the content using fgetl and the like.
>>>>>>
>>>>>> I also tried the -l eng and -l eng1 test and it behaved as you have
>>>>>> described.
>>>>>>
>>>>>  --
>>>>> --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To post to this group, send email to [email protected]
>>>>> To unsubscribe from this group, send email to
>>>>> tesseract-oc...@googlegroups.**com
>>>>> For more options, visit this group at
>>>>> http://groups.google.com/**group/tesseract-ocr?hl=en<http://groups.google.com/group/tesseract-ocr?hl=en>
>>>>>
>>>>> ---
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "tesseract-ocr" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to tesseract-oc...@googlegroups.**com.
>>>>> For more options, visit 
>>>>> https://groups.google.com/**groups/opt_out<https://groups.google.com/groups/opt_out>
>>>>> .
>>>>>
>>>>>
>>>>>
>>>>
>>>>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Problem re-creating user-words example in tesseract1 doc

Reply via email to