Re: WARNING: Did not found XRef object at specified startxref position

Thomas Chojecki Wed, 13 Nov 2013 13:50:59 -0800

Hi Rodrigo,

it look like the startxref position (52779) is wrong and point into astream instead at the beginning of a xref table or stream. The valueinside the exception shows a compressed string and it might be thexref stream.

You can open a hex editor and jump directly to the position 52779 andlook for a object that may look like


,---

80 0 obj <<
/Type /XRef
/Index [0 424]
/Size 424
/W [1 3 1]
/Root 421 0 R
/Info 422 0 R
/ID [<14895AE8C3218939710EBBFF5EAD0E28> <14895AE8C3218939710EBBFF5EAD0E28>]
/Length 1073
/Filter /FlateDecode

stream
...
endstream
endobj

`---

If you find this object with the /Type /XRef you can go to thebeginning of it, in this case the 80 0 obj and write down the positionof this object. Then you can go to the end of the file and overwritethe startxref 52779 position with you marked position and try to parsethe document again.

This should work and indicate that the pdf creator you are using,creates wrong object positions. Pdfbox can parse only documents thatprovide correct xref tables / streams, otherwise the parser does notknow how to handle the document.


Best regards
Thomas


Zitat von Rodrigo Caniçali <[email protected]>:

Hi Thomas,

Below is the stacktrace when the option “-nonSeq” is enabled:
Loading PDF D:\Documents and Settings\05215385726\Meusdocumentos\rpf_tributos.pdfException in thread "main" java.io.IOException: Error: Expected along type, actual='!@:g8lJLDX5I'H%oMioAqC?O$d[,X]%dZ#a?Wos'
at org.apache.pdfbox.pdfparser.BaseParser.readLong(BaseParser.java:1668)
atorg.apache.pdfbox.pdfparser.BaseParser.readObjectNumber(BaseParser.java:1598)atorg.apache.pdfbox.pdfparser.NonSequentialPDFParser.parseXrefObjStream(NonSequentialPDFParser.java:460)atorg.apache.pdfbox.pdfparser.NonSequentialPDFParser.initialParse(NonSequentialPDFParser.java:358)atorg.apache.pdfbox.pdfparser.NonSequentialPDFParser.parse(NonSequentialPDFParser.java:702)
at org.apache.pdfbox.pdmodel.PDDocument.loadNonSeq(PDDocument.java:1139)
at org.apache.pdfbox.ExtractText.startExtraction(ExtractText.java:208)
at org.apache.pdfbox.ExtractText.main(ExtractText.java:85)
When that option is disabled, the following warnings are printed onEclipse console and some text of PDF document is not extracted:
Loading PDF D:\Documents and Settings\05215385726\Meusdocumentos\rpf_tributos.pdfNov 04, 2013 10:16:13 AMorg.apache.pdfbox.pdfparser.XrefTrailerResolver setStartxref
WARNING: Did not found XRef object at specified startxref position 52779
Time for loading: 0.125 seconds
Starting text extraction
Writing to D:\Documents and Settings\05215385726\Meusdocumentos\rpf_tributos.txtNov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngineprocessOperator
INFO: unsupported/disabled operation: o
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngineprocessOperator
INFO: unsupported/disabled operation: Os
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngineprocessOperator
INFO: unsupported/disabled operation: a
Nov 04, 2013 10:16:14 AM org.apache.pdfbox.util.PDFStreamEngineprocessOperator
INFO: unsupported/disabled operation: su

Thanks,

Rodrigo
Em Sábado, 2 de Novembro de 2013 10:24, Rodrigo Caniçali<[email protected]> escreveu:
Hi Thomas,

Thanks for your answer.
I am afraid the document is confidential, but I canprovide thestacktrace and find out if it is possible to generate anon-confidential example on Monday when I will be at the office again.
Best regards,
Rodrigo
Em Sábado, 2 de Novembro de 2013 5:50, Thomas Chojecki<[email protected]> escreveu:
Zitat von Rodrigo Caniçali <[email protected]>:
Hi,
Hi Rodrigo,
I found on a mailing list of 2012-jun-14 that this problem has been 
already discussed, but here is pretty different.
I think I found the discussion.
I also get the warning "Did not found XRef object at specified 
startxref position xxx" when executing the main function 
of org.apache.pdfbox.ExtractText class. However, some PDF texts are 
ignored and are not printed on the output TXT file. These same texts 
are displayed by Acrobat Reader and can be copyed by the user as 
texts from this program.
Your document is broken and it work with Acrobat Reader, because he 
isn't strict enough against the specification.

Many developer that try to create a pdf writer, test it against the 
Acrobat Reader and does not follow always the specification. So the 
reference is to create Acrobat Reader and not specification conformant 
documents. This lead to the problem that 3rd party libraries like 
pdfbox can't sometimes parse such documents.

In your case the xref table isn't there, where the parser supposing 
it. If you can provide use such document, we can try to find the cause 
of the problem and maybe fixing it.
If the option "-nonSeq" is selected, then appears a 
"java.io.IOException: Error: Expected a long type, actual=..." which 
stops the text extraction.
Maybe you can post the first three lines from the stacktrace, this 
will help debugging the problem.
Please, is there any way to make it work?
It is nearly impossible reconstructing such cases. If you can provide 
us more informations or maybe the document, it will help use improving 
the parser, if possible.

We do our best to support as many document as we can, but in some 
cases we need to be strict to support the existing fine parsing 
documents. This problem is also one point on the agenda of the pdfbox 
2.0.0 version.
Thanks,

Rodrigo
Best regards
Thomas

Re: WARNING: Did not found XRef object at specified startxref position

Reply via email to