Hi,
Please upload your file to a sharehoster.
Does it happen when the file is opened with one of the command line tools, or from code? If code, what is the smallest code that does it? Is the file local or downloaded from a server? If yes, is the downloaded file the same that you get locally?
Tilman

Am 10.01.2026 um 23:54 schrieb mountain the blue:
hi,

I have encountered an issue attempting to open a pdf file generated by
word365 with old version of pdfbox 2.x.

I was able to reproduce the same error it on latest version of pdfbox
3.0.6+ with a sample code that just tries to open such pdf file.

the error reports:
Jan 10, 2026 2:13:23 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 827917, length: 2118,
expected end position: 830035
Jan 10, 2026 2:13:23 PM org.apache.pdfbox.pdfparser.COSParser
validateStreamLength
WARNING: The end of the stream doesn't point to the correct offset, using
workaround to read the stream, stream start position: 934902, length: 1097,
expected end position: 935999
Exception in thread "main" java.io.IOException: Page tree root must be a
dictionary
at org.apache.pdfbox.pdfparser.COSParser.checkPages(COSParser.java:1416)
at org.apache.pdfbox.pdfparser.PDFParser.initialParse(PDFParser.java:120)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:171)
at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:136)
at org.apache.pdfbox.Loader.loadPDF(Loader.java:483)
at org.apache.pdfbox.Loader.loadPDF(Loader.java:359)

looking further at the code, it decides to do 'brute force' parsing ... and
does not find expected Pages entry.

What happens is that
method org.apache.pdfbox.pdfparser.PDFXrefStreamParser#parse () is called
and, at some stage records an Xref configured with offset 0 ...
such recording is later verified by
the org.apache.pdfbox.pdfparser.COSParser#validateXrefOffsets () that
cannot resolve object for such offset
(see org.apache.pdfbox.pdfparser.COSParser#findObjectKey) ... and therefore
reset the parsing ... triggering the 'brute force' approach.

The org.apache.pdfbox.pdfparser.PDFXrefStreamParser#parse () method
currently do ...

...

// second field holds the offset (type 1) or the object stream number (type 2)
long offset = parseValue(currLine, w[0], w[1]);
// third filed may hold the generation number (type1) or the index
within a object stream (type2)
int thirdValue = (int) parseValue(currLine, w[0] + w[1], w[2]);

...


*Q1*: can we add some test in the code the exclude the recording of xref if
the offset if either less than 6
(org.apache.pdfbox.pdfparser.COSParser#MINIMUM_SEARCH_OFFSET) ... or if it
is 0 ... so that pdfbox can accept such incorrect file(s) ?

ie:

// second field holds the offset (type 1) or the object stream number (type 2)
long offset = parseValue(currLine, w[0], w[1]);


*if (0 == offset){    // found some incorrect PDF file that were
showing such xref entry*

*    continue;*


*}*// third filed may hold the generation number (type1) or the index
within a object stream (type2)
int thirdValue = (int) parseValue(currLine, w[0] + w[1], w[2]);


If pdfbox cannot be change to accommodate such file ...

*Q2*: would you have any recommandation to share ?

thank you,



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to