Re: Using PDFStreamParser

Tilman Hausherr Mon, 21 Dec 2015 10:10:09 -0800

Could you retry with the current version? Either get -SNAPSHOT throughmaven, or from

https://repository.apache.org/content/groups/snapshots/org/apache/pdfbox/pdfbox-app/2.0.0-SNAPSHOT/

I can't reproduce what you mean (I tested with the trunk), so either Imissed it, or (what I suspect) it is a bug that I fixed a short time ago(PDFBOX-3107). However I'm also unable to reproduce it with RC2 and RC1.


Tilman

Am 21.12.2015 um 16:02 schrieb [email protected]:

Hi!

I have a very strange behaviour while copying a file with PDFBoxs

PDFStreamParser (RC2).

I modfied RemoveAllText not to remove any text:

  public static void main( String[] args ) throws IOException
     {
...
               PDDocument document = null;
             try
             {
                 document = PDDocument.load( new File(args[0]) );
                 if( document.isEncrypted() )
                 {
                     System.err.println( "Error: Encrypted documents are
not supported for this example." );
                     System.exit( 1 );
                 }
                 for( PDPage page : document.getPages() )
                 {
                     PDFStreamParser parser = new PDFStreamParser(page);
                     parser.parse();
                     List<Object> tokens = parser.getTokens();
                     List<Object> newTokens = new ArrayList<Object>();
                     for (Object token : tokens)
                     {
                         newTokens.add( token );
                     }
                     PDStream newContents = new PDStream( document );
                     OutputStream out = newContents
.createOutputStream(COSName.FLATE_DECODE);
                     ContentStreamWriter writer = new ContentStreamWriter(
out );
                     writer.writeTokens( newTokens );
                     out.close();
                     page.setContents( newContents );
                 }
                 document.save( args[1] );
             }
             finally
             {
                 if( document != null )
                 {
                     document.close();
                 }
             }
     }

I open both PDFs with PDFDebugger and the Contents text view is equal for
both files (see second TJ!). In hex view there are differences with space
(20) an LF-Chars (0A), where eol seems to be inserted/replaced.

BT
   0 0 0 1 k
   /T1_0 1 Tf
   10 0 0 10 32.4181 265.8897 Tm
   [ (\037\036\035\034\033\032\031\030\027) -28
(\026\025\035\024\023\022\025\031\031\030\035\021) ] TJ
   /T1_1 1 Tf
   9.8 0 0 10 32.4181 253.8897 Tm
   [ (\037\036\035\034\033\032\031\030\027\026\025\024) -53 (\023\022\024)
-53 (\021\020\017\016\024) -53 (\015\023\014\013\012\011\024) -53
(\010\030\027\026\025\024) -53 (\015\007\020\017\016\024) -53
(\015\011\024) -53 (\006\025\033\005\025\004\026\003\025\002\026\024) -53
(\002\001\027\024) -53 (\177\004\025\024) -53 ... TJ

Consenquently the preview in PDFDebugger (page two!) is the same too.

Übungskarte 49 (INT 1463), Karte 1/INT 1, Begleitheft für die
Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)



But when opening the new PDF file with Adobe Reader 11.0.10.32 the text
has changed!! 1 is now ), but not für 2013!

Übungskarte 49 (INT )463), Karte )/INT ), Begleitheft für die
Kartenaufgaben im Fach Navigation für den SKS (Ausgabe 2013)

On page three Aufgabe is now Auf0abe.

I have no idea how this can happen. Is there information anywhere else
except in the TJ-Block? The file size (old 960 K, new 1041 K) is slightly
different for 81 pages.

This is the pdf
https://www.elwis.de/Freizeitschifffahrt/fuehrerscheininformationen/Navigationsaufgaben-SKS.pdf


Thanks

Hans Stemmer



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Using PDFStreamParser

Reply via email to