Hi, I'm working on a project extracting data out of user manuals (mainly PDFs) for indexing and searching. I want to be able to mark in the search engine the header information. The Tika extraction (coming from PDFBox) using an OOTB setup strips out all identifying header information, AFAICT. I know the main reason is that PDF itself is just providing layout information primarily and likely doesn't give any indication that something is semantically a header, but I wanted to check here to see if anyone knows of a way to do this.
For example, the document might look something like this (we'll see how well this comes through as HTML): <snip> *THIS IS A HEADER* This is normal text. </snip> Tika's XML representation then comes back with something like: <p>THIS IS A HEADER</p> <p/> <p>This is normal text</p> Ideally, it would come back with something like: <h1>THIS IS A HEADER</h1> <p/> <p>This is normal text</p> Thanks, Grant
