RE: Stream parsing issue in multi-stream page

Esteban R Mon, 05 Feb 2018 10:30:36 -0800

I need to analyze the distribution of contents in the different streams (I 
cannot provide additional details due to a confidentiality aggreement). Then I 
may need to change some content in the streams and rewrite them. I also wanted 
to preserve the original structure of (many) streams, but it is not a hard 
requirement.


Esteban
________________________________
De: Maruan Sahyoun <[email protected]>
Enviado: lunes, 05 de febrero de 2018 04:19 p.m.
Para: Esteban R
Asunto: Re: Stream parsing issue in multi-stream page

Hi,
> Am 05.02.2018 um 17:14 schrieb Esteban R <[email protected]>:
>
> Thanks for your answer. But I really need to process the streams one by one 
> (a special requirement in my project).

could you explain why this is the case? It is possible that tokens are spawning 
streams - so if you process one by one the parser wouldn't know about the 
continuation. So the result you posted initially is fine from that perspective.

BR
Maruan

>
> Anyways, your answer gave me an idea for detecting the issue: I can compare 
> the tokens for the individual streams with the tokens from 
> pdPage.getContents().... double processing, but still useful.
>
> Any other ideas are wellcome.
>
> Esteban
> De: Maruan Sahyoun <[email protected]>
> Enviado: lunes, 05 de febrero de 2018 03:25 p.m.
> Para: [email protected]
> Asunto: Re: Stream parsing issue in multi-stream page
>
> Hi,
>
>
>
> > Am 05.02.2018 um 15:43 schrieb Esteban R <[email protected]>:
> >
> > Hello. I need to rewrite a PDPage with many streams, one by one (making 
> > some transformations, and there is a special need to do it one stream at a 
> > time). Parsing (and pdfdebug) returns "wrong" tokens if one command begins 
> > at the end of the first stream and ends at the begining of the next one. 
> > I'm using pdfbox-2.0.8.
> >
> > Rewriting the stream with those tokens produces a corrupted page.
> > How could we re-write the page without getting a corrupted page?
> > Or, at least, how can we detect this kind of failures (or this one)?
> >
> > Please find a simplified example here:
> > http://www.filedropper.com/out3unc
> >
> > The first stream is:
> > /F1 10 Tf
> > BT
> > 40 764.138 Td
> > 0 -12.138 Td
> > [
> >
> > and the second one is:
> > (CD) ] TJ
> > ET
> >
> > In this case, running the following code:
> >        Iterator<PDStream> itStreams = pdPage.getContentStreams();
> >        while (itStreams.hasNext()) {
> >            PDStream pdstream = itStreams.next();
> >            PDFStreamParser parser = new 
> > PDFStreamParser(pdstream.toByteArray());
> >            parser.parse();
> >            List<Object> tokens = parser.getTokens();
> >            for (Object token: tokens){
> >                System.out.println("Token: "+token);
> >            }
> >        }
> >
>
> instead of using pdPage.getContentStreams() and parsing the stream 
> individually use pdPage.getContents() and read all content into a byte[]. You 
> can then pass that to PDFStreamParser.
>
> That will give you this output
>
> Token: COSName{F1}
> Token: COSInt{10}
> Token: PDFOperator{Tf}
> Token: PDFOperator{BT}
> Token: COSInt{40}
> Token: COSFloat{764.138}
> Token: PDFOperator{Td}
> Token: COSInt{0}
> Token: COSFloat{-12.138}
> Token: PDFOperator{Td}
> Token: COSArray{[COSString{CD}]}
> Token: PDFOperator{TJ}
> Token: PDFOperator{ET}
>
> BR
> Maruan
>
>
> > shows:
> > Token: COSName{F1}
> > Token: COSInt{10}
> > Token: PDFOperator{Tf}
> > Token: PDFOperator{BT}
> > Token: COSInt{40}
> > Token: COSFloat{764.138}
> > Token: PDFOperator{Td}
> > Token: COSInt{0}
> > Token: COSFloat{-12.138}
> > Token: PDFOperator{Td}
> > Token: COSArray{[]}                    !!!!! empty array detected, end of 
> > first stream
> > Token: COSString{CD}                 !!!!! begining of second stream
> > Token: COSNull{}                         !!!!! closing "]"
> > Token: PDFOperator{TJ}
> > Token: PDFOperator{ET}
> >
> >
> > Esteban
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

RE: Stream parsing issue in multi-stream page

Reply via email to