Hi, I have a bit complicated PDF I want Tika to parse. The PDF is multi-column, made up by Adobe InDesign. It has multiple columns, footnotes. It's looking nice, but it is not processed linearly by Tika (PDFBox). The returned text starts with the second column, then I get some footnotes, then last half of the first column, etc. As a result the text returned is not in de order I read it in the PDF. I tried setting sortByPosition flag, but that results in output having line 1 of column 1, followed by line 1 of column 2, followed by line 2 of column 1, etc.
As far as I can see, sortByPosition is the only parameter I have that I can use to tune this, or am I missing something? I would love to attach the PDF I am working with, but since it is copyrighted content, I can't post it to this list. But if someone wants to have a look off-list, I can mail it. Cheers, Pieter
