Hi,

I have a bit complicated PDF I want Tika to parse. The PDF is multi-column, 
made up by Adobe InDesign. It has multiple columns, footnotes. It's looking 
nice, but it is not processed linearly by Tika (PDFBox). The returned text 
starts with the second column, then I get some footnotes, then last half of the 
first column, etc. As a result the text returned is not in de order I read it 
in the PDF. I tried setting sortByPosition flag, but that results in output 
having line 1 of column 1, followed by line 1 of column 2, followed by line 2 
of column 1, etc. 

As far as I can see, sortByPosition is the only parameter I have that I can use 
to tune this, or am I missing something? I would love to attach the PDF I am 
working with, but since it is copyrighted content, I can't post it to this 
list. But if someone wants to have a look off-list, I can mail it.

Cheers,
Pieter

Reply via email to