Hello,
Iâm currently trying to use Apache Tika to extract text from various PDF
files.
Iâve been searching through the API but couldnât exactly assess if what I
want is possible.
The normal parsing operation outputs a list of lines
line 1
line 2
â¦
line n
I was curious about the possibility of, not only extracting the lines, but
obtain positional information regarding each one
e.g.: the page from where the line was parsed and also the cartesian position
on the PDF file (if viewed as an image)
line 1 (metadata 1)
line 2 (metadata 1)
â¦
line n (metadata n)
Is this possible with Apache Tika?
Thanks,
Raul