RES: RES: copy entire stream of a page ignoring images

José Rodolfo Carrijo de Freitas Fri, 22 Oct 2010 10:39:55 -0700

'q' and  'Q' operator which represents graphic *state*

Atenciosamente,
José Rodolfo Carrijo de Freitas
Analista de Sistemas
Softplan - Departamento de pesquisa e desenvolvimento
Sistema da Qualidade Certificado ISO 9001:2008
(48) 3027 8000 Ramal 8359
http://www.softplan.com.br



-----Mensagem original-----
De: José Rodolfo Carrijo de Freitas [mailto:[email protected]] 
Enviada em: sexta-feira, 22 de outubro de 2010 15:30
Para: [email protected]
Assunto: RES: RES: copy entire stream of a page ignoring images

Thank you both for the help.
The pdfDebugger is an epic win, I didn’t know it. 

One question about your comment:
>You have to remove it from the content stream. Don't forget to also remove its 
>parameter the imagename (e.g. Im1).

Do I have to worry about those COSInt and COSFloat as well?
They represent scale and coordinate of the image. And what about the 'q' and  
'Q' operator which represents graphic scales? Should I get rid of them, I mean 
when they surround a 'Do'operator?

I'll take a look at the spec trying to found out those answers, but if you 
already know it, it would be nice to "hear" this information.




Thank you very much,
José Rodolfo Carrijo de Freitas


-----Mensagem original-----
De: Andreas Lehmkuehler [mailto:[email protected]] 
Enviada em: sexta-feira, 22 de outubro de 2010 15:01
Para: [email protected]
Assunto: Re: RES: copy entire stream of a page ignoring images

Hi,

Am 22.10.2010 18:45, schrieb [email protected]:
> I've done very little work with images, but I remember section 8.2
> (Graphics Objects) of the PDF spec was very helpful.  Table 51 goes over
> different operators which are related to images.  I think the thing you
> will find most helpful is the information on the "Do" operator; see table
> 87.  It can apply to Images, Forms, and PostScript XObjects.  You're
> probably hitting PS objects when you don't want to.  Section 8.8.2
> (PostScript XObjects) should be able to help you detect these.
Yes, that's correct, the "Do" operator will do the trick, in most cases....
You have to remove it from the content stream. Don't forget to also remove its 
parameter the imagename (e.g. Im1). [1] is a good example on how to do that.

> For reference, I'm looking at the 1.7 spec (ISO32000), so my page numbers
> will match up to this version of the PDF spec.
+1

> Remember, referencing the official PDF specifications and looking at the
> PDF file in a good text editor is often extremely helpful in debugging
> issues.  Hope you find the above info helpful.
The PDFDebugger bundled with PDFBox is also a helpful tool.

BR
Andreas Lehmkühler

[1] 
http://svn.apache.org/viewvc/pdfbox/trunk/pdfbox/src/main/java/org/apache/pdfbox/examples/util/RemoveAllText.java?view=log

>
>
>
> From:
> José Rodolfo Carrijo de Freitas<[email protected]>
> To:
> <[email protected]>
> Date:
> 10/22/2010 09:14
> Subject:
> RES: copy entire stream of a page ignoring images
>
>
>
> Is there a fixed way which an image is created with tokens in a pdstream?
>
> After parsing some documents, I ended up gathering that
> When a stream starts with PDFOperator{q} ends with PDFOperator{Q} and has
> PDFOperator{Do} in the middle, it is an image.
> So I extract all those tokens to remove image from the page.
> So, in an stream for example, if I find this set of operators:
>
>
> PDFOperator{q}, COSInt{596}, COSInt{0}, COSInt{0}, COSInt{840}, COSInt{0},
> COSInt{0}, PDFOperator{cm}, COSName{Im1}, PDFOperator{Do}, PDFOperator{Q}
>
> I'll cut them all to remove the image.
> But nothing is easy as it seems, so this measure is ruining some fonts of
> the page.
>
> Is there someone who understand this better and can give me a light on
> this
> problem?
>
>
>
>
> Atenciosamente,
> José Rodolfo Carrijo de Freitas
> Analista de Sistemas
> Softplan - Departamento de pesquisa e desenvolvimento
> Sistema da Qualidade Certificado ISO 9001:2008
> (48) 3027 8000 Ramal 8359
> http://www.softplan.com.br
>
> -----Mensagem original-----
> De: José Rodolfo Carrijo de Freitas [mailto:[email protected]]
> Enviada em: sexta-feira, 22 de outubro de 2010 09:39
> Para: [email protected]
> Assunto: copy entire stream of a page ignoring images
>
> Hello,
>
> I’m trying to write a function to copy the stream of a page to another
> page.
>
> The thing is that it seems the PDFStreamParser is not parsing texts, cause
> I´m not getting any texts on my new page.
>
> And besides,  I´m getting a warning when opening the newpages on adobe
> reader.
>
> Have someone made a similar function, or could give me a little help here?
>
>
>
>
>
> Ps: does someone known a pdf utility which could look at elements of a
> stream?
>
>
>
>
>
> private void copyPageWithoutImage(PDPage page, PDPage newpage) throws
> IOException {
>
>              PDStream contents = page.getContents();
>
>              contents.getStream();
>
>              PDFStreamParser parser = new
> PDFStreamParser(contents.getStream());
>
>              try {
>
>                    List tokensNovos = new LinkedList();
>
>                    Iterator<Object>  iter = parser.getTokenIterator();
>
>                    List arguments = new ArrayList();
>
>                    while (iter.hasNext()) {
>
>                          boolean allowNext = true;
>
>                          Object next = iter.next();
>
>                          Object aux2 = next;
>
>                          if (aux2 instanceof COSName) {
>
>                               COSName objectName2 = (COSName) aux2;
>
>                               System.out.println(objectName2.getName());
>
>                          }
>
>                          if (next instanceof COSObject) {
>
>                               arguments.add(((COSObject)
> next).getObject());
>
>                          } else if (next instanceof PDFOperator) {
>
>                               if (next instanceof PDFOperator) {
>
>                                     PDFOperator op = (PDFOperator) next;
>
>                                     String operation = op.getOperation();
>
>                                     if (operation.equals("Do")) {
>
>                                           if (arguments.size()>  0) {
>
>                                                 Object aux =
> arguments.get(0);
>
>                                                 if (aux instanceof COSName)
> {
>
>                                                       COSName objectName =
> (COSName) aux;
>
>                                                       PDXObject xobject =
> (PDXObject) page.getResources().getXObjects().get(objectName.getName());
>
>                                                       if (xobject
> instanceof
> PDXObjectImage) {
>
>                                                             allowNext =
> false;
>
>                                                       }
>
>                                                 }
>
>                                           }
>
>                                     }
>
>                               }
>
>                               arguments = new ArrayList();
>
>                          } else {
>
>                               arguments.add(next);
>
>                          }
>
>                          if (allowNext) {
>
>                               tokensNovos.add(next);
>
>                          }
>
>                    }
>
>
>
>                    PDPageContentStream contentStream = new
> PDPageContentStream(this.pdf, newpage);
>
>                    contentStream.beginText();
>
>                    contentStream.endText();
>
>                    contentStream.close();
>
>                    PDStream updatedStream = newpage.getContents();
>
>                    ContentStreamWriter tokenWriter = new
> ContentStreamWriter(updatedStream.createOutputStream());
>
>                    tokenWriter.writeTokens(tokensNovos);
>
>                    newpage.setContents(updatedStream);
>
>              } finally {
>
>                    if (parser != null) {
>
>                          parser.close();
>
>                    }
>
>              }
>
>        }
>
>
>
>
>
> 〉  Click here to submit conditions
>
> This email and any content within or attached hereto from  Sun West Mortgage 
> Company, Inc.  is confidential and/or legally privileged. The information is 
> intended only for the use of the individual or entity named on this email. If 
> you are not the intended recipient, you are hereby notified that any 
> disclosure, copying, distribution or the taking of any action in reliance on 
> the contents of this email information is strictly prohibited, and that the 
> documents should be returned to this office immediately by email. Receipt by 
> anyone other than the intended recipient is not a waiver of any privilege. 
> Please do not include your social security number, account number, or any 
> other personal or financial information in the content of the email. Should 
> you have any questions, please call  (800) 453 7884.

RES: RES: copy entire stream of a page ignoring images

Reply via email to