Thanks, Tim, will look into that parser. Totally agree on the need for more formatting info as attributes.
On Mon, Jan 7, 2019 at 11:30 AM Tim Allison <[email protected]> wrote: > Grant, > > You might want to try the Grobid parser, which is built to process > academic papers. [0] > > Generally, though, as you can imagine, building a cross-language, > cross-genre header (and footer) extractor is going to require > heuristics and/or ML on a tagged set...with fingers crossed that > there's enough signal from which to learn the classification. There > are some research papers on this topic[1], and I suspect the > commercial extractors might do a reasonable job, but this is, > unfortunately, beyond the scope of what Tika currently offers. > > One thing we could do on the Tika end is a better job of including > font size/location/boldedness[2], etc in the xhtml output. Then > consumers could write their own heuristics for their specific document > sets. > > As you know, as nlp continues to move into production, it will only > become more important for open source tools to be able to reconstruct > the logical components of PDFs and image-based files. > > Cheers, > > Tim > > [0] https://wiki.apache.org/tika/GrobidJournalParser or just straight > grobid: https://grobid.readthedocs.io/en/latest/Introduction/...see > also: https://www.crossref.org/labs/pdfextract/ > [1] e.g. > https://www.researchgate.net/publication/221253782_Header_and_Footer_Extraction_by_Page-Association > [2] hand-waving...need to look into size/boldedness...location is > fairly straightforward. > > On Mon, Jan 7, 2019 at 8:50 AM Grant Ingersoll <[email protected]> > wrote: > > > > Hi, > > > > I'm working on a project extracting data out of user manuals (mainly > PDFs) for indexing and searching. I want to be able to mark in the search > engine the header information. The Tika extraction (coming from PDFBox) > using an OOTB setup strips out all identifying header information, AFAICT. > I know the main reason is that PDF itself is just providing layout > information primarily and likely doesn't give any indication that something > is semantically a header, but I wanted to check here to see if anyone knows > of a way to do this. > > > > For example, the document might look something like this (we'll see how > well this comes through as HTML): > > > > <snip> > > THIS IS A HEADER > > > > This is normal text. > > </snip> > > > > Tika's XML representation then comes back with something like: > > <p>THIS IS A HEADER</p> > > <p/> > > <p>This is normal text</p> > > > > Ideally, it would come back with something like: > > <h1>THIS IS A HEADER</h1> > > <p/> > > <p>This is normal text</p> > > > > Thanks, > > Grant >
