On Saturday, April 23, 2016 at 9:02:24 AM UTC-7, zdenop wrote: > Why analyze? Don't you know in advance if you are asking to OCR page or just paragraph, line or word???
No. My user is viewing an image of a large construction blueprint. They select "Copy Text" and draw a rectangle around part of the image which contains text. I need my program to ocr any text in that sub-image and copy it to the clipboard. I have no idea if they select a character, a word, a single line sentence or a multi-line sentence. I was tracing down a non-fatal error message which was printed to the console when running tesseract. I found out tesseract was calling leptonica to segment the page and that leptonica was emitting an error and returning fail because the image was below a certain height. It was not trying to segment the image. The leptonica developer made the arbitrary decision that it didn't make sense to segment the page because it was too small. If leptonica makes such judgements, the tesseract has to intelligently deal with it. If tesseract does not want to deal with it, then I must deal with it. If I refuse to deal with it then I can ask my user to describe what they selected and make them deal with it. If I asked my user if they selected a single character, a single word, a single line of words or multiple lines of words, they would conclude that my software is a steaming pile of crap. So that leaves me to solve the problem. It's my opinion that it crazy for an ocr program to return "Empty Page!" when I feed it an image with "A2.12" on it because it is below a certain size or because it lacks white space or because I told it to expect multiple lines of text with varying heights instead of "Expect a single word". It's returning "Empty Page!" without even trying to ocr the image! The last 6 psm options are in a nice hierarchy. If you don't think it makes sense to fall back to a more primitive setting when the advanced setting fails, then I will have to create a patched version which does that. It makes no sense for me to launch tesseract two or three times to ocr "A2.12". TIA scott -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at https://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/a9774503-df11-4c9f-9a71-79b78e628c3c%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

