Re: [tesseract-ocr] Empty page result. Bug?

S.J. Becker Sat, 23 Apr 2016 11:12:50 -0700

On Saturday, April 23, 2016 at 9:02:24 AM UTC-7, zdenop wrote:

> Why analyze? Don't you know in advance if you are asking to OCR page or 
just paragraph, line or word???

No.

My user is viewing an image of a large construction blueprint. They select
"Copy Text"
and draw a rectangle around part of the image which contains text. I need
my program
to ocr any text in that sub-image and copy it to the clipboard.

I have no idea if they select a character, a word, a single line sentence
or a multi-line
sentence.

I was tracing down a non-fatal error message which was printed to the
console when
running tesseract. I found out tesseract was calling leptonica to segment
the page
and that leptonica was emitting an error and returning fail because the
image was below
a certain height. It was not trying to segment the image.

The leptonica developer made the arbitrary decision that it didn't make
sense to
segment the page because it was too small. If leptonica makes such
judgements,
the tesseract has to intelligently deal with it. If tesseract does not want
to deal with
it, then I must deal with it. If I refuse to deal with it then I can ask my
user to describe
what they selected and make them deal with it.

If I asked my user if they selected a single character, a single word, a
single line of
words or multiple lines of words, they would conclude that my software is a
steaming
pile of crap. So that leaves me to solve the problem.

It's my opinion that it crazy for an ocr program to return "Empty Page!"
when I feed
it an image with "A2.12" on it because it is below a certain size or
because it lacks
white space or because I told it to expect multiple lines of text with
varying heights
instead of "Expect a single word".

It's returning "Empty Page!" without even trying to ocr the image!

The last 6 psm options are in a nice hierarchy. If you don't think it makes
sense
to fall back to a more primitive setting when the advanced setting fails,
then I
will have to create a patched version which does that.

It makes no sense for me to launch tesseract two or three times to ocr
"A2.12".

TIA
scott

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/a9774503-df11-4c9f-9a71-79b78e628c3c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Empty page result. Bug?

Reply via email to