We don't currently do this, unfortunately. Wildloop has a pull request that
would add this: https://github.com/apache/tika/pull/152
If at all possible, I'd want to make this be the same format as the hocr we're
getting from tesseract so that consumers don't have to have one way of
processing our xhtml for OCR, but a different one for pdfs.
What do you think?
Best,
Tim
-----Original Message-----
From: [email protected] [mailto:[email protected]]
Sent: Tuesday, August 29, 2017 6:22 AM
To: [email protected]
Subject: Parsing text from PDF while keeping positional information
Hello,
I’m currently trying to use Apache Tika to extract text from various PDF
files.
I’ve been searching through the API but couldn’t exactly assess if what I
want is possible.
The normal parsing operation outputs a list of lines
line 1
line 2
…
line n
I was curious about the possibility of, not only extracting the lines, but
obtain positional information regarding each one
e.g.: the page from where the line was parsed and also the cartesian position
on the PDF file (if viewed as an image)
line 1 (metadata 1)
line 2 (metadata 1)
…
line n (metadata n)
Is this possible with Apache Tika?
Thanks,
Raul