Hi Ted

Most of the information you would need are stored within TextPosition objects 
which encapsulate small text tokens usually consisting of fragments of words. 
You could easily extend the PDFTextStripper class and grab the TextPosition 
objects by overwriting the write* methods (writeCharacters/ 
writeWordSeparator/writeLineSeparator/writePage). May be it would be a good 
idea to transform the TextPosition object to a high level representation 
consisting of words/lines/paragraphs and then train your classifiers on that. 

Cheers,
Robert

[1] http://issues.apache.org/jira/browse/PDFBOX-521


----- Ursprüngliche Mail -----
Von: "Ted Dunning" <[email protected]>
An: [email protected]
Gesendet: Freitag, 29. Januar 2010 00:24:56
Betreff: trying to do better text extraction

I am working on text extraction from some text.  As you might expect,
results are pretty for very simple documents and very bad for some fancy
ones.

Two column documents with headers and footers and text insets are
particularly ugly.  Using the -sort option to TextExtract makes things much
worse since lines from the insets and columns are all mixed together.

I have an idea that I could build a classifier using simple machine learning
that would quickly get the idea of what is a header and footer and would be
able to block columns together.  Given a set of non-header blocks of text,
it should be pretty simple to discern the text flow.

Thus my problem is how to find out the locations and rough presentation
information about blocks of text in a PDF document.  If there is an easy way
to hook in during the text extraction process, that would be great.  Also,
if there is a way to get more verbose structural information out of the
textExtract system that would be great.

Does anybody have any suggestions?

-- 
Ted Dunning, CTO
DeepDyve

Reply via email to