I am working on text extraction from some text. As you might expect, results are pretty for very simple documents and very bad for some fancy ones.
Two column documents with headers and footers and text insets are particularly ugly. Using the -sort option to TextExtract makes things much worse since lines from the insets and columns are all mixed together. I have an idea that I could build a classifier using simple machine learning that would quickly get the idea of what is a header and footer and would be able to block columns together. Given a set of non-header blocks of text, it should be pretty simple to discern the text flow. Thus my problem is how to find out the locations and rough presentation information about blocks of text in a PDF document. If there is an easy way to hook in during the text extraction process, that would be great. Also, if there is a way to get more verbose structural information out of the textExtract system that would be great. Does anybody have any suggestions? -- Ted Dunning, CTO DeepDyve

