Re: Header extractions from PDFs (and others)

Grant Ingersoll Wed, 09 Jan 2019 05:50:34 -0800

Thanks, Tim, will look into that parser.  Totally agree on the need for
more formatting info as attributes.


On Mon, Jan 7, 2019 at 11:30 AM Tim Allison <[email protected]> wrote:

> Grant,
>
>   You might want to try the Grobid parser, which is built to process
> academic papers. [0]
>
>   Generally, though, as you can imagine, building a cross-language,
> cross-genre header (and footer) extractor is going to require
> heuristics and/or ML on a tagged set...with fingers crossed that
> there's enough signal from which to learn the classification. There
> are some research papers on this topic[1], and I suspect the
> commercial extractors might do a reasonable job, but this is,
> unfortunately, beyond the scope of what Tika currently offers.
>
>   One thing we could do on the Tika end is a better job of including
> font size/location/boldedness[2], etc in the xhtml output.  Then
> consumers could write their own heuristics for their specific document
> sets.
>
>   As you know, as nlp continues to move into production, it will only
> become more important for open source tools to be able to reconstruct
> the logical components of PDFs and image-based files.
>
>      Cheers,
>
>                   Tim
>
> [0] https://wiki.apache.org/tika/GrobidJournalParser or just straight
> grobid: https://grobid.readthedocs.io/en/latest/Introduction/...see
> also: https://www.crossref.org/labs/pdfextract/
> [1] e.g.
> https://www.researchgate.net/publication/221253782_Header_and_Footer_Extraction_by_Page-Association
> [2] hand-waving...need to look into size/boldedness...location is
> fairly straightforward.
>
> On Mon, Jan 7, 2019 at 8:50 AM Grant Ingersoll <[email protected]>
> wrote:
> >
> > Hi,
> >
> > I'm working on a project extracting data out of user manuals (mainly
> PDFs) for indexing and searching.  I want to be able to mark in the search
> engine the header information.  The Tika extraction (coming from PDFBox)
> using an OOTB setup strips out all identifying header information, AFAICT.
> I know the main reason is that PDF itself is just providing layout
> information primarily and likely doesn't give any indication that something
> is semantically a header, but I wanted to check here to see if anyone knows
> of a way to do this.
> >
> > For example, the document might look something like this (we'll see how
> well this comes through as HTML):
> >
> > <snip>
> > THIS IS A HEADER
> >
> > This is normal text.
> > </snip>
> >
> > Tika's XML representation then comes back with something like:
> > <p>THIS IS A HEADER</p>
> > <p/>
> > <p>This is normal text</p>
> >
> > Ideally, it would come back with something like:
> > <h1>THIS IS A HEADER</h1>
> > <p/>
> > <p>This is normal text</p>
> >
> > Thanks,
> > Grant
>

Re: Header extractions from PDFs (and others)

Reply via email to