Hi Jon, I tried your images with VietOCR, which makes the images more amenable to Tesseract engine, and it produced fairly accurate results. I think it could have been better if -density 300 had been used.
You can open PDF directly in VietOCR if GhostScript has been installed. http://sf.net/projects/vietocr Regards, Quan On Sep 10, 1:21 pm, Jon <[email protected]> wrote: > I just installed the 3.0.1 version of tesseract (used the Windows > installer for 3.0 and then added the zipped 3.0.1 to the directory.) > Only the english training file is present, for now. I then tested > tesseract using the phototest.tif file in the doc subdir and it worked > just fine. (Admin privileges were set.) > > (I'm running on Windows 7 Professional, 64-bit, on a Lenovo T510 > laptop.) > > I also installed ImageMagick 6.7.2-Q16 using their installer. I then > converted a PDF article into eight .tif page files using it. All that > worked okay and the images look correct to me. To do that, I used the > following command: > > convert -density 150 -depth 8 -colorspace gray -verbose pic32.PDF p > %02d.tif > > This produced the p00.tif to p07.tif files without exhibiting an error > and, as I said, they appeared to display fine using Windows Live Photo > Gallery, for example. > > However, tesseract 3.0.1 crashes (Windows wants to look up possible > solutions before killing the program) on any or all of these .tif > files that were produced. I have placed the first two files at my web > site at: > > http://www.infinitefactors.org/misc/images/tesseract/p00.tif > http://www.infinitefactors.org/misc/images/tesseract/p01.tif > > (These files are each about 4 megabyte in size. The directory listing > is disabled and only the two listed above are world readable, in a > modest attempt to protect the copyright holder and focus on this > problem I'm having.) > > I'm not sure if I need to change the ImageMagick conversion settings, > as all of this is pretty new to me. (First time out.) It's possible > that if I convert the PDF using different settings more to the liking > of tesseract that I'd have better results. I will attempt a few > changes on my own, mostly at random because of my profound ignorance, > but I'm looking for helpful thoughts in the meantime. > > It's my hope to eventually learn how to convert PDF files that are > huge scans of old documents I have from the large PDF file format into > more compressed versions where the text is converted well and the PDF > is much shorter and searchable, as well. But that's long term. For > now, I'd just like to figure out how to make these tif pages work. > > Thanks in advance. And I apologize for my ignorance. > > Jon -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en

