Re: how to "dither" this file properly for tesseract

Quan Nguyen Tue, 29 Mar 2011 17:59:24 -0700

After simply rescaling the image to 300 DPI, I got nearly perfect
result. It is interesting to note that with English data, "find" was
misclassified as "ﬁnd" -- the dictionary could not get it right.


The Windows Search Engine
The search engine in Windows XP will automatically OCR a tiff image
allowing the user to ﬁnd a document based on words in the document;
however, this process is done every time a search is performed. This
can take hours or even days based on the volume of documents that
have to be OCR'd. Once the document is found it has to be OCR'd again
in order to use the ﬁnd feature. If the ﬁle was created as a Text
Searchable ‘ﬁff image to begin with the search engine does not OCR it
again; it uses the text that is contained in the ﬁle allowing for
quicker
retrieval and the document itself to be searched.

On Mar 29, 10:15 am, "Robert P. J. Day" <[email protected]> wrote:
>   hoping this query isn't wildly off-topic but i have an 8-bit B/W tif
> file (attached), which gthumb shows me to be eminently readable on my
> ubuntu system.  at this point, i'd like to use one of ubuntu's
> libtiff-tools utilities to convert it to the ideal 2-level B/W tif
> file that tesseract prefers, and i was playing with the "tiffdither"
> utility, but no matter what threshold i use, i can't get a clean B/W
> representation of what appears to be a perfectly legible 8-bit file.
>
>   perhaps i just don't understand what the threshold means WRT to
> dithering.  what would the proper step be to transform the attached
> file into the obvious single-bit tif equivalent?  thanks.
>
> rday
>
> --
>
> ========================================================================
> Robert P. J. Day                               Waterloo, Ontario, CANADA
>                        http://crashcourse.ca
>
> Twitter:                                      http://twitter.com/rpjday
> LinkedIn:                              http://ca.linkedin.com/in/rpjday
> ========================================================================
>
>  tessbw.tif
> 146KViewDownload

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: how to "dither" this file properly for tesseract

Reply via email to