Hi, Gesendet: Fr, 29. Jan 2010 Von: Ted Dunning<[email protected]>
> I am working on text extraction from some text. As you might expect, > results are pretty for very simple documents and very bad for some fancy > ones. > > Two column documents with headers and footers and text insets are > particularly ugly. Using the -sort option to TextExtract makes things much > worse since lines from the insets and columns are all mixed together. > > I have an idea that I could build a classifier using simple machine > learning > that would quickly get the idea of what is a header and footer and would be > able to block columns together. Given a set of non-header blocks of text, > it should be pretty simple to discern the text flow. > > Thus my problem is how to find out the locations and rough presentation > information about blocks of text in a PDF document. If there is an easy > way > to hook in during the text extraction process, that would be great. Also, > if there is a way to get more verbose structural information out of the > textExtract system that would be great. > > Does anybody have any suggestions? Have a look at [1]. Mel and some others implemented an alternative version of the TextStripper with additional features concerning the text structure. > Ted Dunning, CTO > DeepDyve BR Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX-521

