I was privately asked about this.  This is disabled by default in Tika, but
users can choose to turn it on.  There is a very slight performance hit,
but it can be magical for some PDFs that use angles, obv.

I thought I'd give an answer a try publicly.  Tilman and PDFBox colleagues,
please correct/supplement as necessary.

PDF is a presentation based format.  There are operators that say, e.g.
print "hello world" at x,y coordinates.  One of the operators can rotate
the text so that viewers will present the text rotated to that degree.

The challenge for text extraction is that the operators typically write
bits of text, "hel", "lo", "wor", "ld" _and_ there's no notion of what a
"line" is...just x,y coordinates...so text extractors have to do some work
to figure out which text bits are on the same line and should be stitched
together...why?  Well, PDF!

If you don't detectAngles, text extractors can have challenges stitching
the text back together  (guessing what text bits are on the same "line") so
that you might get, e.g "hel wor lo ld".

Cheers,

       Tim

Reply via email to