RE: Parsing text from PDF while keeping positional information

Allison, Timothy B. Tue, 29 Aug 2017 06:29:33 -0700

We don't currently do this, unfortunately.  Wildloop has a pull request that 
would add this: https://github.com/apache/tika/pull/152


If at all possible, I'd want to make this be the same format as the hocr we're 
getting from tesseract so that consumers don't have to have one way of 
processing our xhtml for OCR, but a different one for pdfs.

What do you think?

Best,

            Tim



-----Original Message-----
From: [email protected] [mailto:[email protected]] 
Sent: Tuesday, August 29, 2017 6:22 AM
To: [email protected]
Subject: Parsing text from PDF while keeping positional information

Hello,

Iâ€™m currently trying to use Apache Tika to extract text from various PDF 
files.

Iâ€™ve been searching through the API but couldnâ€™t exactly assess if what I 
want is possible.

The normal parsing operation outputs a list of lines

        line 1
        line 2
        â€¦
        line n

I was curious about the possibility of, not only extracting the lines, but 
obtain positional information regarding each one

e.g.: the page from where the line was parsed and also the cartesian position 
on the PDF file (if viewed as an image)

        line 1  (metadata 1)
        line 2  (metadata 1)
        â€¦
        line n  (metadata n)

Is this possible with Apache Tika?

Thanks,
Raul

RE: Parsing text from PDF while keeping positional information

Reply via email to