We don't currently do this, unfortunately.  Wildloop has a pull request that 
would add this: https://github.com/apache/tika/pull/152

If at all possible, I'd want to make this be the same format as the hocr we're 
getting from tesseract so that consumers don't have to have one way of 
processing our xhtml for OCR, but a different one for pdfs.

What do you think?

Best,

            Tim



-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, August 29, 2017 6:22 AM
To: [email protected]
Subject: Parsing text from PDF while keeping positional information

Hello,

I’m currently trying to use Apache Tika to extract text from various PDF 
files.

I’ve been searching through the API but couldn’t exactly assess if what I 
want is possible.

The normal parsing operation outputs a list of lines

        line 1
        line 2
        …
        line n

I was curious about the possibility of, not only extracting the lines, but 
obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position 
on the PDF file (if viewed as an image)

        line 1  (metadata 1)
        line 2  (metadata 1)
        …
        line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul

Reply via email to