Hi Tilman, They fail because they parse the stream every time you click on it - rather than the whole page.
I haven't checked the code yet to find it, but you can tell because every time you click on (an incomplete) stream in PDFReader it throws the same exceptions again on the console. From a debugging perspective this is a great feature to have and most of the time streams seem to be generated as complete entities. Like I say - now I know you can't parse streams individually in normal use I don't have a problem. It was my logic that was at fault. Best Wishes Malcolm. On 10 November 2017 at 16:46, Tilman Hausherr <[email protected]> wrote: > Hi, > > Yes it is true that page content streams can be split. But > PDFReader/PDFDebugger should be able to handle that because deep inside, > PDFStreamParser() is called when a page is rendered. PDFReader/PDFDebugger > do show the individual streams for debugging purpose but they're not > rendering them individually. So I'm wondering how it is possible that they > fail but you succeed. > > Re the java warning, I get it too and didn't even bother to fix it. See > https://stackoverflow.com/questions/5354838/java-java-util-preferences-failing > https://stackoverflow.com/questions/16428098/groovy-shell-warning-could-not-open-create-prefs-root-node > > Tilman > > > > Am 10.11.2017 um 10:04 schrieb Malcolm Vincent: >> >> Hi Tilman, >> >> Thanks for replying. I'll see if I can get permission from the client >> to upload the file. >> >> The same behaviour occurs in PDFBox 2.0.x - I am on 2.0.8 currently. >> >> I'm pretty definite now about what's happening. >> >> The "issue" (if it is an issue) is that I was treating the streams the >> same way that PDFReader does, and loading them one at a time. >> >> It appears that this is not a safe thing to do because the streams are >> not fully complete parseable entities in their own right and some >> higher level token constructs - like COSDictionary for example - can >> be split across stream boundaries by the Adobe PDF generators. So >> although atomic tokens like int may generally be ok, more complex >> things are not. >> >> This is why my code that uses PDFbox is throwing the warnings and also >> why it happens with the PDFReader / debug function in the app. Every >> time I click on a stream with a dictionary that is partly in one >> stream and partly in another the parser throws a warning on the >> console. >> >> I am unclear exactly how this fits with the specification - a quick >> "find" has not cleared it up - but I suppose in theory since PDF is a >> binary format the stream could break at any byte and any token could >> be split right in the middle. >> >> Following on from that analysis this appears to be the way to get the >> tokens on a page and process them ... at least it has resolved my >> problem on the PDF files I am currently processing ... >> >> PDPage page = my_pdf.getPage(i); >> PDFStreamParser parser = new PDFStreamParser(page); >> parser.parse(); >> page.setContents(processTokens(parser.getTokens())); >> >> where processTokens() is my worker function. >> >> Of course this assumes that the generator has not broken atomic tokens >> in the middle of the content since the PDFBox doc says streams parsed >> this way are concatenated with a whitespace character between them. >> >> For completeness here is a fragment of one of my PDFs which shows the >> dictionary split across the end of one stream and the start of the >> next ... >> >> /Span <</Lang (en-GB)/MCID 8 >>BDC >> BT >> 9 0 0 9 99.3376 555.6879 Tm >> (Some text)Tj >> ET >> EMC >> /Span <</Lang >> endstream >> endobj >> 19 0 obj >> << >> /Length 2852 >> >> >> stream >> (en-GB)/MCID 9 >>BDC >> BT >> 9 0 0 9 145.7323 555.6879 Tm >> (Some more text)Tj >> ET >> EMC >> >> >> Best Wishes, >> Malcolm >> >> On 9 November 2017 at 17:58, Tilman Hausherr <[email protected]> >> wrote: >>> >>> Hi, >>> >>> What PDFBox version are you using and can you upload the PDF to a >>> sharehoster? Splits between tokens shouldn't be a problem. >>> >>> Tilman >>> >>> PS: please don't start a new thread like you did today, this is >>> confusing. >>> Answer to yourself on the list instead. >>> >>> >>> Am 09.11.2017 um 09:53 schrieb Malcolm Vincent: >>>> >>>> Hi, >>>> >>>> I've been using PDFBox to read and write PDFs successfully for a while >>>> and have started running into a few issues recently. >>>> >>>> I seem to be getting the following errors when loading PDFs generated >>>> in Adobe InDesign / Acrobat Distiller (the PDFs render fine in Acrobat >>>> Reader, pdf.js and chrome). >>>> >>>> The first one seems to be a UI thing for the PDFReader function so I'm >>>> ignoring it. >>>> >>>> The second and third are the problem. They are both related. I get >>>> them when I use PDFBox in my own code as well as in the app, but since >>>> they are warnings they do not flag up as runtime errors I can catch. >>>> >>>> #1 >>>> Nov 09, 2017 8:31:45 AM java.util.prefs.WindowsPreferences <init> >>>> WARNING: Could not open/create prefs root node Software\JavaSoft\Prefs >>>> at root 0x80000002. Windows RegCreateKeyEx(...) returned error code 5. >>>> >>>> #2 >>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser >>>> parseCOSDictionaryNameValuePair >>>> WARNING: Bad Dictionary Declaration >>>> org.apache.pdfbox.pdfparser.InputStreamSource@498fa7e0 >>>> >>>> #3 >>>> Nov 09, 2017 8:32:03 AM org.apache.pdfbox.pdfparser.BaseParser >>>> parseCOSDictionary >>>> WARNING: Invalid dictionary, found: '?' but expected: '/' at offset 2861 >>>> >>>> I have traced the problem to the following PDF content at the end of >>>> Page 1 Stream 1. >>>> >>>> /Span <</Lang (en-GB)/MCID 8 >>BDC >>>> BT >>>> 9 0 0 9 99.3376 555.6879 Tm >>>> (text string here)Tj >>>> ET >>>> EMC >>>> /Span <</Lang >>>> endstream >>>> endobj >>>> >>>> The last dictionary entry seems to be incomplete. >>>> >>>> When I go on to process the files in my own code, I iterate over the >>>> content stream, perform my function and replace the stream content, >>>> the stream ends up incorrect and the resulting PDFs will not load in >>>> Acrobat Reader (although they do in chrome). >>>> >>>> My options appear to be >>>> >>>> (a) grep the file for this and remove or overwrite it with a string >>>> operation before using PDFBox >>>> >>>> (b) update the source to cope with this condition >>>> >>>> (c) kick the PDF back as invalid - difficult since the file is a >>>> "valid" PDF that is generated in Adobe and reads ok in Adobe >>>> >>>> I have verified this by manually overtyping <</Lang with spaces and >>>> then everything works perfectly in my own code and in PDFReader. >>>> >>>> Any thoughts? >>>> >>>> Best wishes, >>>> Malcolm. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

