Hi, After more testing I can confirm the issue occurs when PDFBox is parsing a stream where the token splits across this stream and the next one is the problem.
i.e. the whole token does not occur in the stream being parsed Perhaps there is a way to get all the tokens in the page content and PDFBox reads the streams as necessary rather than using the individual streams the way I am doing at the minute. In this excerpt you can clearly see where the COSDictionary is split across the stream boundary /Span <</Lang (en-GB)/MCID 8 >>BDC BT 9 0 0 9 99.3376 555.6879 Tm (Some text)Tj ET EMC /Span <</Lang endstream endobj 19 0 obj << /Length 2852 >> stream (en-GB)/MCID 9 >>BDC BT 9 0 0 9 145.7323 555.6879 Tm (Some more text)Tj ET EMC Best Wishes, Malcolm. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

