RE: copy entire stream of a page ignoring images

Adam Mon, 25 Oct 2010 09:44:49 -0700

I see what you mean about the font being different, but I'm not sure what 
would cause that.  The only things I can recommend is testing with a 
single page PDF, if you're not already doing this, and try the same PDF 
with a different number of images (0, 1, 2).  See if the one without any 
images comes out okay.  If it is different, compare the PDFs in the text 
editor to see what objects changed and remove code until you get output 
that looks correct.  That'll let you see exactly what line of code is 
causing the issue and by comparing the PDFs you can see what it's actually 
doing.  It will also be interesting to see if the one that has 2 images 
comes out twice as messed up as the one which only has one image.


That should at least give you enough information to determine where the 
issue lies.

---- 
Thanks,
Adam





From:
José Rodolfo Carrijo de Freitas <[email protected]>
To:
<[email protected]>
Date:
10/25/2010 05:29
Subject:
RE: copy entire stream of a page ignoring images



Adam,
You were right. With the last solution I would hit objects which were not 
guaranteed to be an image.
However, the font problem is a different one...
I was wondering if the remaining operators which identifies image 
graphical states would affect fonts and texts operators.
Accordingly with the spec, they wouldn’t affect. However if they work at a 
stack (and it seems they do), probably it is interfering in some manner.

I took a look with pdfdebugger on both streams(the original one, and the 
edited one) comparing them. It seems to be exactly as it should, excepts 
for some differences in float numbers and the image operator (with its 
name) which I extracted, yet, they print differently. I believe There are 
some problems in graphics state.

http://b.imagehost.org/view/0930/comparing-pdfs

What kind of problem that seems to be?


Atenciosamente,
José Rodolfo Carrijo de Freitas
Analista de Sistemas
Softplan - Departamento de pesquisa e desenvolvimento
Sistema da Qualidade Certificado ISO 9001:2008
(48) 3027 8000 Ramal 8359
http://www.softplan.com.br


-----Mensagem original-----
De: [email protected] [mailto:[email protected]] 
Enviada em: sexta-feira, 22 de outubro de 2010 14:46
Para: [email protected]
Assunto: Re: RES: copy entire stream of a page ignoring images

I've done very little work with images, but I remember section 8.2 
(Graphics Objects) of the PDF spec was very helpful.  Table 51 goes over 
different operators which are related to images.  I think the thing you 
will find most helpful is the information on the "Do" operator; see table 
87.  It can apply to Images, Forms, and PostScript XObjects.  You're 
probably hitting PS objects when you don't want to.  Section 8.8.2 
(PostScript XObjects) should be able to help you detect these.

For reference, I'm looking at the 1.7 spec (ISO32000), so my page numbers 
will match up to this version of the PDF spec.

Remember, referencing the official PDF specifications and looking at the 
PDF file in a good text editor is often extremely helpful in debugging 
issues.  Hope you find the above info helpful.

---- 
Thanks,
Adam





From:
José Rodolfo Carrijo de Freitas <[email protected]>
To:
<[email protected]>
Date:
10/22/2010 09:14
Subject:
RES: copy entire stream of a page ignoring images



Is there a fixed way which an image is created with tokens in a pdstream?

After parsing some documents, I ended up gathering that 
When a stream starts with PDFOperator{q} ends with PDFOperator{Q} and has
PDFOperator{Do} in the middle, it is an image.
So I extract all those tokens to remove image from the page.
So, in an stream for example, if I find this set of operators:


PDFOperator{q}, COSInt{596}, COSInt{0}, COSInt{0}, COSInt{840}, COSInt{0},
COSInt{0}, PDFOperator{cm}, COSName{Im1}, PDFOperator{Do}, PDFOperator{Q}

I'll cut them all to remove the image.
But nothing is easy as it seems, so this measure is ruining some fonts of
the page.

Is there someone who understand this better and can give me a light on 
this
problem?




Atenciosamente,
José Rodolfo Carrijo de Freitas
Analista de Sistemas
Softplan - Departamento de pesquisa e desenvolvimento
Sistema da Qualidade Certificado ISO 9001:2008
(48) 3027 8000 Ramal 8359
http://www.softplan.com.br

-----Mensagem original-----
De: José Rodolfo Carrijo de Freitas [mailto:[email protected]] 
Enviada em: sexta-feira, 22 de outubro de 2010 09:39
Para: [email protected]
Assunto: copy entire stream of a page ignoring images

Hello, 

I’m trying to write a function to copy the stream of a page to another 
page.

The thing is that it seems the PDFStreamParser is not parsing texts, cause
I´m not getting any texts on my new page.

And besides,  I´m getting a warning when opening the newpages on adobe
reader.

Have someone made a similar function, or could give me a little help here?

 

 

Ps: does someone known a pdf utility which could look at elements of a
stream?

 

 

private void copyPageWithoutImage(PDPage page, PDPage newpage) throws
IOException {

            PDStream contents = page.getContents();

            contents.getStream();

            PDFStreamParser parser = new
PDFStreamParser(contents.getStream());

            try {

                  List tokensNovos = new LinkedList();

                  Iterator<Object> iter = parser.getTokenIterator();

                  List arguments = new ArrayList();

                  while (iter.hasNext()) {

                        boolean allowNext = true;

                        Object next = iter.next();

                        Object aux2 = next;

                        if (aux2 instanceof COSName) {

                             COSName objectName2 = (COSName) aux2;

                             System.out.println(objectName2.getName());

                        }

                        if (next instanceof COSObject) {

                             arguments.add(((COSObject) 
next).getObject());

                        } else if (next instanceof PDFOperator) {

                             if (next instanceof PDFOperator) {

                                   PDFOperator op = (PDFOperator) next;

                                   String operation = op.getOperation();

                                   if (operation.equals("Do")) {

                                         if (arguments.size() > 0) {

                                               Object aux =
arguments.get(0);

                                               if (aux instanceof COSName) 

{

                                                     COSName objectName =
(COSName) aux;

                                                     PDXObject xobject =
(PDXObject) page.getResources().getXObjects().get(objectName.getName());

                                                     if (xobject 
instanceof
PDXObjectImage) {

                                                           allowNext =
false;

                                                     }

                                               }

                                         }

                                   }

                             }

                             arguments = new ArrayList();

                        } else {

                             arguments.add(next);

                        }

                        if (allowNext) {

                             tokensNovos.add(next);

                        }

                  }

 

                  PDPageContentStream contentStream = new
PDPageContentStream(this.pdf, newpage);

                  contentStream.beginText();

                  contentStream.endText();

                  contentStream.close();

                  PDStream updatedStream = newpage.getContents();

                  ContentStreamWriter tokenWriter = new
ContentStreamWriter(updatedStream.createOutputStream());

                  tokenWriter.writeTokens(tokensNovos);

                  newpage.setContents(updatedStream);

            } finally {

                  if (parser != null) {

                        parser.close();

                  }

            }

      }





〉  Click here to submit conditions 

This email and any content within or attached hereto from  Sun West 
Mortgage Company, Inc.  is confidential and/or legally privileged. The 
information is intended only for the use of the individual or entity named 
on this email. If you are not the intended recipient, you are hereby 
notified that any disclosure, copying, distribution or the taking of any 
action in reliance on the contents of this email information is strictly 
prohibited, and that the documents should be returned to this office 
immediately by email. Receipt by anyone other than the intended recipient 
is not a waiver of any privilege. Please do not include your social 
security number, account number, or any other personal or financial 
information in the content of the email. Should you have any questions, 
please call  (800) 453 7884. 




〉  Click here to submit conditions  

This email and any content within or attached hereto from  Sun West Mortgage 
Company, Inc.  is confidential and/or legally privileged. The information is 
intended only for the use of the individual or entity named on this email. If 
you are not the intended recipient, you are hereby notified that any 
disclosure, copying, distribution or the taking of any action in reliance on 
the contents of this email information is strictly prohibited, and that the 
documents should be returned to this office immediately by email. Receipt by 
anyone other than the intended recipient is not a waiver of any privilege. 
Please do not include your social security number, account number, or any other 
personal or financial information in the content of the email. Should you have 
any questions, please call  (800) 453 7884.

RE: copy entire stream of a page ignoring images

Reply via email to