Parsing text from PDF while keeping positional information

[email protected] Tue, 29 Aug 2017 03:22:51 -0700

Hello,

Iâm currently trying to use Apache Tika to extract text from various PDF 
files.


Iâve been searching through the API but couldnât exactly assess if what I 
want is possible.

The normal parsing operation outputs a list of lines

        line 1
        line 2
        â¦
        line n

I was curious about the possibility of, not only extracting the lines, but 
obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position 
on the PDF file (if viewed as an image)

        line 1  (metadata 1)
        line 2  (metadata 1)
        â¦
        line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul

Parsing text from PDF while keeping positional information

Reply via email to