720 dpi seems high. Is that the native scan resolution? I'd use the native resolution unless it's less than 200 dpi or more than 400 dpi. Similarly, why are you rendering to tiffgray when the input looks like it's bitonal? tesseract is just going to have to threshold back to bitonal again, resulting in two conversions where none are needed.
Don't have time to play with it myself, but perhaps you could outline the matrix of different conversions you've tried so far so to help folks what's already been tried and eliminated as not helpful. Tom On Friday, January 22, 2016 at 5:14:48 AM UTC-5, Timo Grossenbacher wrote: > > Hey, > > Given the input file 2000.pdf, and the following code, ... > > # first, conversion to TIFF with ghostscript > ghostscript -o 2000_gs.tif -sDEVICE=tiffgray -r720x720 -g6120x7920 - > sCompression=lzw 2000.pdf > # then, rotation with imagemagick > convert 2000_gs.tif -rotate 89.4 -background white -alpha Off 2000_rotated > .tif > # then, OCR with tesseract, using suggested parameters > tesseract 2000_rotated.tif 2000_readable_gs_custom -c load_system_dawg=0 -c > load_freq_dawg=0 -c textord_tablefind_recognize_tables=1 -c > textord_tabfind_find_tables=1 pdf > > ...the quality of the OCR is really poor - hardly 30% of the text is > searchable in 2000_readable_gs_custom.pdf. > > I have uploaded all the files to > https://www.sendspace.com/filegroup/dGA6ojm%2BQ4tZ6gdkyuSM0xSIUD8P2vbB > > When I OCR the same file with Adobe Acrobat Professional, I get almost > 100% accuracy. Of course I'd like to do it rather with FOSS than with a > commercial product, so do you have any hints on how I could mitigate those > problems? > > Thanks a lot, > Timo > -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/c7f8e99f-917d-4328-85d7-d375fde485b5%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

