After simply rescaling the image to 300 DPI, I got nearly perfect
result. It is interesting to note that with English data, "find" was
misclassified as "find" -- the dictionary could not get it right.

The Windows Search Engine
The search engine in Windows XP will automatically OCR a tiff image
allowing the user to find a document based on words in the document;
however, this process is done every time a search is performed. This
can take hours or even days based on the volume of documents that
have to be OCR'd. Once the document is found it has to be OCR'd again
in order to use the find feature. If the file was created as a Text
Searchable ‘fiff image to begin with the search engine does not OCR it
again; it uses the text that is contained in the file allowing for
quicker
retrieval and the document itself to be searched.

On Mar 29, 10:15 am, "Robert P. J. Day" <[email protected]> wrote:
>   hoping this query isn't wildly off-topic but i have an 8-bit B/W tif
> file (attached), which gthumb shows me to be eminently readable on my
> ubuntu system.  at this point, i'd like to use one of ubuntu's
> libtiff-tools utilities to convert it to the ideal 2-level B/W tif
> file that tesseract prefers, and i was playing with the "tiffdither"
> utility, but no matter what threshold i use, i can't get a clean B/W
> representation of what appears to be a perfectly legible 8-bit file.
>
>   perhaps i just don't understand what the threshold means WRT to
> dithering.  what would the proper step be to transform the attached
> file into the obvious single-bit tif equivalent?  thanks.
>
> rday
>
> --
>
> ========================================================================
> Robert P. J. Day                               Waterloo, Ontario, CANADA
>                        http://crashcourse.ca
>
> Twitter:                                      http://twitter.com/rpjday
> LinkedIn:                              http://ca.linkedin.com/in/rpjday
> ========================================================================
>
>  tessbw.tif
> 146KViewDownload

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to