> I can't agree here, and probably only few would. OCR from a pixel > height of 2 is crazy, and can never result in anything sane. > Second argument against it: the dots are some tens of blanks away from > any other character. There is no isolated character in any layout, > that has a height of a few dots only and can usably be recognized.
I was not defending tesseract's result (I agree that this is a bug), I was merely suggesting a possible cause for this bug. You must however realize that tesseract (as per explanation in the paper I cited) does not do layout analysis. The model used is a single column, with top down vertically spaced text, left to right, and therefore the input must conform to this. > I'd love to have a larger number of options to pass to tesseract; e.g. > minimal height of character to be recognized, ASCII/UTF-8. If you have some time for hacking, you're in luck :) I was looking for a list of options a little while back when I was writing tessboxes for another project (http://www.lbreyer.com/tessboxes.html) and I grepped the source code of tesseract 2.03 for likely variables. I created a small document TESSERACT.OPTIONS.txt with the list of variables, which you can find in the files area here or in the tarball for tessboxes in the doc directory. There is good news and bad news. The good news is that the options can be written in an options file, so they can be changed without recompiling. The bad news is that there are many (probably 200+) options without detailed explanations (unless you read the code of course). This is why you must read Ray's paper if you want to begin to understand what the options do. > > Thanks again for the great explanations and your efforts! > Glad to help, good luck! Laird. --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en -~----------~----~----~----~------~----~------~--~---

