Hi,

Yes it is true that page content streams can be split. But PDFReader/PDFDebugger should be able to handle that because deep inside, PDFStreamParser() is called when a page is rendered. PDFReader/PDFDebugger do show the individual streams for debugging purpose but they're not rendering them individually. So I'm wondering how it is possible that they fail but you succeed.

Re the java warning, I get it too and didn't even bother to fix it. See
https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing
https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node

Tilman


Am 10.11.2017 um 10:04 schrieb Malcolm Vincent:
Hi Tilman,

Thanks for replying. I'll see if I can get permission from the client
to upload the file.

The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently.

I'm pretty definite now about what's happening.

The "issue" (if it is an issue) is that I was treating the streams the
same way that PDFReader does, and loading them one at a time.

It appears that this is not a safe thing to do because the streams are
not fully complete parseable entities in their own right and some
higher level token constructs - like COSDictionary for example - can
be split across stream boundaries by the Adobe PDF generators. So
although atomic tokens like int may generally be ok, more complex
things are not.

This is why my code that uses PDFbox is throwing the warnings and also
why it happens with the PDFReader / debug function in the app. Every
time I click on a stream with a dictionary that is partly in one
stream and partly in another the parser throws a warning on the
console.

I am unclear exactly how this fits with the specification - a quick
"find" has not cleared it up - but I suppose in theory since PDF is a
binary format the stream could break at any byte and any token could
be split right in the middle.

Following on from that analysis this appears to be the way to get the
tokens on a page and process them ... at least it has resolved my
problem on the PDF files I am currently processing ...

     PDPage page = my_pdf.getPage(i);
     PDFStreamParser parser = new PDFStreamParser(page);
     parser.parse();
     page.setContents(processTokens(parser.getTokens()));

where processTokens() is my worker function.

Of course this assumes that the generator has not broken atomic tokens
in the middle of the content since the PDFBox doc says streams parsed
this way are concatenated with a whitespace character between them.

For completeness here is a fragment of one of my PDFs which shows the
dictionary split across the end of one stream and the start of the
next ...

     /Span <</Lang (en-GB)/MCID 8 >>BDC
     BT
     9 0 0 9 99.3376 555.6879 Tm
     (Some text)Tj
     ET
     EMC
     /Span <</Lang
     endstream
     endobj
     19 0 obj
     <<
     /Length 2852
     >>
     stream
     (en-GB)/MCID 9 >>BDC
     BT
     9 0 0 9 145.7323 555.6879 Tm
     (Some more text)Tj
     ET
     EMC


Best Wishes,
Malcolm

On 9 November 2017 at 17:58, Tilman Hausherr <[email protected]> wrote:
Hi,

What PDFBox version are you using and can you upload the PDF to a
sharehoster? Splits between tokens shouldn't be a problem.

Tilman

PS: please don't start a new thread like you did today, this is confusing.
Answer to yourself on the list instead.


Am 09.11.2017 um 09:53 schrieb Malcolm Vincent:
Hi,

I've been using PDFBox to read and write PDFs successfully for a while
and have started running into a few issues recently.

I seem to be getting the following errors when loading PDFs generated
in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat
Reader, pdf.js and chrome).

The first one seems to be a UI thing for the PDFReader function so I'm
ignoring it.

The second and third are the problem. They are both related. I get
them when I use PDFBox in my own code as well as in the app, but since
they are warnings they do not flag up as runtime errors I can catch.

#1
Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init>
WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs
at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5.

#2
Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
parseCOSDictionaryNameValuePair
WARNING: Bad Dictionary Declaration
org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0

#3
Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser
parseCOSDictionary
WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861

I have traced the problem to the following PDF content at the end of
Page 1 Stream 1.

/Span <</Lang (en-GB)/MCID 8 >>BDC
BT
9 0 0 9 99.3376 555.6879 Tm
(text string here)Tj
ET
EMC
/Span <</Lang
endstream
endobj

The last dictionary entry seems to be incomplete.

When I go on to process the files in my own code, I iterate over the
content stream, perform my function and replace the stream content,
the stream ends up incorrect and the resulting PDFs will not load in
Acrobat Reader (although they do in chrome).

My options appear to be

(a) grep the file for this and remove or overwrite it with a string
operation before using PDFBox

(b) update the source to cope with this condition

(c) kick the PDF back as invalid - difficult since the file is a
"valid" PDF that is generated in Adobe and reads ok in Adobe

I have verified this by manually overtyping <</Lang with spaces and
then everything works perfectly in my own code and in PDFReader.

Any thoughts?

Best wishes,
Malcolm.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to