Hi,

I'm working on a project extracting data out of user manuals (mainly PDFs)
for indexing and searching.  I want to be able to mark in the search engine
the header information.  The Tika extraction (coming from PDFBox) using an
OOTB setup strips out all identifying header information, AFAICT.  I know
the main reason is that PDF itself is just providing layout information
primarily and likely doesn't give any indication that something is
semantically a header, but I wanted to check here to see if anyone knows of
a way to do this.

For example, the document might look something like this (we'll see how
well this comes through as HTML):

<snip>
*THIS IS A HEADER*

This is normal text.
</snip>

Tika's XML representation then comes back with something like:
<p>THIS IS A HEADER</p>
<p/>
<p>This is normal text</p>

Ideally, it would come back with something like:
<h1>THIS IS A HEADER</h1>
<p/>
<p>This is normal text</p>

Thanks,
Grant

Reply via email to