Hi Ted Most of the information you would need are stored within TextPosition objects which encapsulate small text tokens usually consisting of fragments of words. You could easily extend the PDFTextStripper class and grab the TextPosition objects by overwriting the write* methods (writeCharacters/ writeWordSeparator/writeLineSeparator/writePage). May be it would be a good idea to transform the TextPosition object to a high level representation consisting of words/lines/paragraphs and then train your classifiers on that.
Cheers, Robert [1] http://issues.apache.org/jira/browse/PDFBOX-521 ----- Ursprüngliche Mail ----- Von: "Ted Dunning" <[email protected]> An: [email protected] Gesendet: Freitag, 29. Januar 2010 00:24:56 Betreff: trying to do better text extraction I am working on text extraction from some text. As you might expect, results are pretty for very simple documents and very bad for some fancy ones. Two column documents with headers and footers and text insets are particularly ugly. Using the -sort option to TextExtract makes things much worse since lines from the insets and columns are all mixed together. I have an idea that I could build a classifier using simple machine learning that would quickly get the idea of what is a header and footer and would be able to block columns together. Given a set of non-header blocks of text, it should be pretty simple to discern the text flow. Thus my problem is how to find out the locations and rough presentation information about blocks of text in a PDF document. If there is an easy way to hook in during the text extraction process, that would be great. Also, if there is a way to get more verbose structural information out of the textExtract system that would be great. Does anybody have any suggestions? -- Ted Dunning, CTO DeepDyve

