Re: trying to do better text extraction

Andreas Lehmkühler Fri, 29 Jan 2010 00:35:19 -0800

Hi,

Gesendet: Fr, 29. Jan 2010 Von: Ted Dunning<[email protected]>


> I am working on text extraction from some text.  As you might expect,
> results are pretty for very simple documents and very bad for some fancy
> ones.
> 
> Two column documents with headers and footers and text insets are
> particularly ugly.  Using the -sort option to TextExtract makes things much
> worse since lines from the insets and columns are all mixed together.
> 
> I have an idea that I could build a classifier using simple machine
> learning
> that would quickly get the idea of what is a header and footer and would be
> able to block columns together.  Given a set of non-header blocks of text,
> it should be pretty simple to discern the text flow.
> 
> Thus my problem is how to find out the locations and rough presentation
> information about blocks of text in a PDF document.  If there is an easy
> way
> to hook in during the text extraction process, that would be great.  Also,
> if there is a way to get more verbose structural information out of the
> textExtract system that would be great.
> 
> Does anybody have any suggestions?
Have a look at [1]. Mel and some others implemented an alternative version of
the TextStripper with additional features concerning the text structure.

> Ted Dunning, CTO
> DeepDyve

BR
Andreas Lehmkühler


[1] https://issues.apache.org/jira/browse/PDFBOX-521

Re: trying to do better text extraction

Reply via email to