I need to analyze the distribution of contents in the different streams (I cannot provide additional details due to a confidentiality aggreement). Then I may need to change some content in the streams and rewrite them. I also wanted to preserve the original structure of (many) streams, but it is not a hard requirement.
Esteban ________________________________ De: Maruan Sahyoun <[email protected]> Enviado: lunes, 05 de febrero de 2018 04:19 p.m. Para: Esteban R Asunto: Re: Stream parsing issue in multi-stream page Hi, > Am 05.02.2018 um 17:14 schrieb Esteban R <[email protected]>: > > Thanks for your answer. But I really need to process the streams one by one > (a special requirement in my project). could you explain why this is the case? It is possible that tokens are spawning streams - so if you process one by one the parser wouldn't know about the continuation. So the result you posted initially is fine from that perspective. BR Maruan > > Anyways, your answer gave me an idea for detecting the issue: I can compare > the tokens for the individual streams with the tokens from > pdPage.getContents().... double processing, but still useful. > > Any other ideas are wellcome. > > Esteban > De: Maruan Sahyoun <[email protected]> > Enviado: lunes, 05 de febrero de 2018 03:25 p.m. > Para: [email protected] > Asunto: Re: Stream parsing issue in multi-stream page > > Hi, > > > > > Am 05.02.2018 um 15:43 schrieb Esteban R <[email protected]>: > > > > Hello. I need to rewrite a PDPage with many streams, one by one (making > > some transformations, and there is a special need to do it one stream at a > > time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins > > at the end of the first stream and ends at the begining of the next one. > > I'm using pdfbox-2.0.8. > > > > Rewriting the stream with those tokens produces a corrupted page. > > How could we re-write the page without getting a corrupted page? > > Or, at least, how can we detect this kind of failures (or this one)? > > > > Please find a simplified example here: > > http://www.filedropper.com/out3unc > > > > The first stream is: > > /F1 10 Tf > > BT > > 40 764.138 Td > > 0 -12.138 Td > > [ > > > > and the second one is: > > (CD) ] TJ > > ET > > > > In this case, running the following code: > > Iterator<PDStream> itStreams = pdPage.getContentStreams(); > > while (itStreams.hasNext()) { > > PDStream pdstream = itStreams.next(); > > PDFStreamParser parser = new > > PDFStreamParser(pdstream.toByteArray()); > > parser.parse(); > > List<Object> tokens = parser.getTokens(); > > for (Object token: tokens){ > > System.out.println("Token: "+token); > > } > > } > > > > instead of using pdPage.getContentStreams() and parsing the stream > individually use pdPage.getContents() and read all content into a byte[]. You > can then pass that to PDFStreamParser. > > That will give you this output > > Token: COSName{F1} > Token: COSInt{10} > Token: PDFOperator{Tf} > Token: PDFOperator{BT} > Token: COSInt{40} > Token: COSFloat{764.138} > Token: PDFOperator{Td} > Token: COSInt{0} > Token: COSFloat{-12.138} > Token: PDFOperator{Td} > Token: COSArray{[COSString{CD}]} > Token: PDFOperator{TJ} > Token: PDFOperator{ET} > > BR > Maruan > > > > shows: > > Token: COSName{F1} > > Token: COSInt{10} > > Token: PDFOperator{Tf} > > Token: PDFOperator{BT} > > Token: COSInt{40} > > Token: COSFloat{764.138} > > Token: PDFOperator{Td} > > Token: COSInt{0} > > Token: COSFloat{-12.138} > > Token: PDFOperator{Td} > > Token: COSArray{[]} !!!!! empty array detected, end of > > first stream > > Token: COSString{CD} !!!!! begining of second stream > > Token: COSNull{} !!!!! closing "]" > > Token: PDFOperator{TJ} > > Token: PDFOperator{ET} > > > > > > Esteban > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected]

