Zitat von Rodrigo Caniçali <[email protected]>:

Hi,
Hi Rodrigo,

I found on a mailing list of 2012-jun-14 that this problem has been already discussed, but here is pretty different.
I think I found the discussion.

I also get the warning "Did not found XRef object at specified startxref position xxx" when executing the main function of org.apache.pdfbox.ExtractText class. However, some PDF texts are ignored and are not printed on the output TXT file. These same texts are displayed by Acrobat Reader and can be copyed by the user as texts from this program.

Your document is broken and it work with Acrobat Reader, because he isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the Acrobat Reader and does not follow always the specification. So the reference is to create Acrobat Reader and not specification conformant documents. This lead to the problem that 3rd party libraries like pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing it. If you can provide use such document, we can try to find the cause of the problem and maybe fixing it.


If the option "-nonSeq" is selected, then appears a "java.io.IOException: Error: Expected a long type, actual=..." which stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this will help debugging the problem.

Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide us more informations or maybe the document, it will help use improving the parser, if possible.

We do our best to support as many document as we can, but in some cases we need to be strict to support the existing fine parsing documents. This problem is also one point on the agenda of the pdfbox 2.0.0 version.


Thanks,

Rodrigo

Best regards
Thomas

Reply via email to