Hello,

I’m currently trying to use Apache Tika to extract text from various PDF 
files.

I’ve been searching through the API but couldn’t exactly assess if what I 
want is possible.

The normal parsing operation outputs a list of lines

        line 1
        line 2
        …
        line n

I was curious about the possibility of, not only extracting the lines, but 
obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position 
on the PDF file (if viewed as an image)

        line 1  (metadata 1)
        line 2  (metadata 1)
        …
        line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul

Reply via email to