After simply rescaling the image to 300 DPI, I got nearly perfect result. It is interesting to note that with English data, "find" was misclassified as "find" -- the dictionary could not get it right.
The Windows Search Engine The search engine in Windows XP will automatically OCR a tiff image allowing the user to find a document based on words in the document; however, this process is done every time a search is performed. This can take hours or even days based on the volume of documents that have to be OCR'd. Once the document is found it has to be OCR'd again in order to use the find feature. If the file was created as a Text Searchable ‘fiff image to begin with the search engine does not OCR it again; it uses the text that is contained in the file allowing for quicker retrieval and the document itself to be searched. On Mar 29, 10:15 am, "Robert P. J. Day" <[email protected]> wrote: > hoping this query isn't wildly off-topic but i have an 8-bit B/W tif > file (attached), which gthumb shows me to be eminently readable on my > ubuntu system. at this point, i'd like to use one of ubuntu's > libtiff-tools utilities to convert it to the ideal 2-level B/W tif > file that tesseract prefers, and i was playing with the "tiffdither" > utility, but no matter what threshold i use, i can't get a clean B/W > representation of what appears to be a perfectly legible 8-bit file. > > perhaps i just don't understand what the threshold means WRT to > dithering. what would the proper step be to transform the attached > file into the obvious single-bit tif equivalent? thanks. > > rday > > -- > > ======================================================================== > Robert P. J. Day Waterloo, Ontario, CANADA > http://crashcourse.ca > > Twitter: http://twitter.com/rpjday > LinkedIn: http://ca.linkedin.com/in/rpjday > ======================================================================== > > tessbw.tif > 146KViewDownload -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en.

