Header extractions from PDFs (and others)

Grant Ingersoll Mon, 07 Jan 2019 05:50:33 -0800

Hi,

I'm working on a project extracting data out of user manuals (mainly PDFs)
for indexing and searching.  I want to be able to mark in the search engine
the header information.  The Tika extraction (coming from PDFBox) using an
OOTB setup strips out all identifying header information, AFAICT.  I know
the main reason is that PDF itself is just providing layout information
primarily and likely doesn't give any indication that something is
semantically a header, but I wanted to check here to see if anyone knows of
a way to do this.


For example, the document might look something like this (we'll see how
well this comes through as HTML):

<snip>
*THIS IS A HEADER*

This is normal text.
</snip>

Tika's XML representation then comes back with something like:
<p>THIS IS A HEADER</p>
<p/>
<p>This is normal text</p>

Ideally, it would come back with something like:
<h1>THIS IS A HEADER</h1>
<p/>
<p>This is normal text</p>

Thanks,
Grant

Header extractions from PDFs (and others)

Reply via email to