Re: make git ignore the timestamp embedded in PDFs

2013-05-18 Thread Andreas Leha
Hi Hannes,

thanks for taking this up and sorry for the long delay in my answer.

Johannes Sixt j...@kdbg.org writes:

 Am 14.05.2013 15:17, schrieb Andreas Leha:
 Hi all,
 
 how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
 differ only in these time stamps should be considered identical.
 ...
 What I tried is a filter:
 ,[ ~/.gitconfig ]
 | [filter pdfresetdate]
 | clean = pdfresetdate
 `
 
 This 'works' as far as the committed pdf indeed has the date reset to my
 default value.
 
 However, when I re-checkout the files, they are marked modified by git.

 I'm using cleaned files every now and then, but not on Linux. I have
 never observed this behavior recently.

 If you 'git add' the file, does it keep its modified state? Does 'git

yes.

 diff' tell a difference?

no.

Here is a complete 'session':
,
|  mkdir test
|  cd test
|  git init
|  echo '*.pdf filter=pdfresetdate'  .gitattributes
|  cp ~/PDF/score_table.pdf .
|  pdfinfo score_table.pdf
| Title:  (score_table)
| Author: (andreas)
| Creator:GPL Ghostscript 905 (ps2write)
| Producer:   GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:Fri Feb  8 15:44:47 2013
| Tagged: no
| Pages:  1
| Encrypted:  no
| Page size:  595 x 842 pts (A4)
| File size:  36989 bytes
| Optimized:  no
| PDF version:1.4
|  git add score_table.pdf
|  pdfinfo score_table.pdf
| Title:  (score_table)
| Author: (andreas)
| Creator:GPL Ghostscript 905 (ps2write)
| Producer:   GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:Fri Feb  8 15:44:47 2013
| Tagged: no
| Pages:  1
| Encrypted:  no
| Page size:  595 x 842 pts (A4)
| File size:  36989 bytes
| Optimized:  no
| PDF version:1.4
|  git commit -m test
|  pdfinfo score_table.pdf
| Title:  (score_table)
| Author: (andreas)
| Creator:GPL Ghostscript 905 (ps2write)
| Producer:   GPL Ghostscript 9.05
| CreationDate:   Fri Feb  8 15:44:47 2013
| ModDate:Fri Feb  8 15:44:47 2013
| Tagged: no
| Pages:  1
| Encrypted:  no
| Page size:  595 x 842 pts (A4)
| File size:  36989 bytes
| Optimized:  no
| PDF version:1.4
|  rm score_table.pdf
|  git checkout  score_table.pdf  
|  git status
| # On branch master
| # Changes not staged for commit:
| #   (use git add file... to update what will be committed)
| #   (use git checkout -- file... to discard changes in working directory)
| #
| #   modified:   score_table.pdf
| #
| # Untracked files:
| #   (use git add file... to include in what will be committed)
| #
| #   .gitattributes
| no changes added to commit (use git add and/or git commit -a)
|  pdfinfo score_table.pdf 
| Title:  (score_table)
| Author: (andreas)
| Creator:GPL Ghostscript 905 (ps2write)
| Producer:   GPL Ghostscript 9.05
| CreationDate:   Mon Jan  1 07:26:19 1979
| ModDate:Mon Jan  1 07:26:19 1979
| Tagged: no
| Pages:  1
| Encrypted:  no
| Page size:  595 x 842 pts (A4)
| File size:  37126 bytes
| Optimized:  no
| PDF version:1.4
|  git add score_table.pdf
|  git status
| # On branch master
| # Changes to be committed:
| #   (use git reset HEAD file... to unstage)
| #
| #   modified:   score_table.pdf
| #
| # Untracked files:
| #   (use git add file... to include in what will be committed)
| #
| #   .gitattributes
|  git diff score_table.pdf
|  
`

Regards,
Andreas

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: make git ignore the timestamp embedded in PDFs

2013-05-18 Thread Andreas Leha
Johannes Sixt j...@kdbg.org writes:

 Am 18.05.2013 09:42, schrieb Andreas Leha:
 Am 14.05.2013 15:17, schrieb Andreas Leha:
 Hi all,

 how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
 differ only in these time stamps should be considered identical.
 ...
 What I tried is a filter:
 ,[ ~/.gitconfig ]
 | [filter pdfresetdate]
 | clean = pdfresetdate
 `

 This 'works' as far as the committed pdf indeed has the date reset to my
 default value.

 However, when I re-checkout the files, they are marked modified by git.

 I'm using cleaned files every now and then, but not on Linux. I have
 never observed this behavior recently.

 If you 'git add' the file, does it keep its modified state? Does 'git
 
 yes.
 
 diff' tell a difference?
 
 no.

 I do not believe you. I'm sure that Binary files differ was
 reported.

You are correct, of course.  I had forgotten that I also had enabled a
special diff for pdf files, that reports the difference in the pdfinfo
output.

 The reason is that your pdfresetdate script is not idempotent. Look:

 $ pdfresetdate  x.pdf  y.pdf
 $ pdfresetdate  y.pdf  z.pdf
 $ md5sum x.pdf y.pdf z.pdf
 c46a7097574a035e89d1a46d93c83528  x.pdf
 8e6d942b4cc7d8a4dfe6898867573617  y.pdf
 e6333bc0f8ab9781d3e1d811a392d516  z.pdf


Thanks for that.  I had not noticed due to the non-binary diff I had
enabled.

 A file that was already cleaned by the clean filter must not be
 modified, i.e., the y.pdf and z.pdf should be identical. But they are not.

 Fix your clean filter.

I will (try to) do.  Anyway, git seems unresponsible for my issue.


Thanks for that clear analysis!

Regards,
Andreas

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


make git ignore the timestamp embedded in PDFs

2013-05-14 Thread Andreas Leha
Hi all,

how can I make git ignore the time stamp(s) in a PDF.  Two PDFs that
differ only in these time stamps should be considered identical.

Here is an example:
,
|  pdfinfo some.pdf
| Title:  R Graphics Output
| Creator:R
| Producer:   R 2.15.1
| CreationDate:   Thu Jan 24 13:43:31 2013 ==  these entries
| ModDate:Thu Jan 24 13:43:31 2013 ==  should be ignored
| Tagged: no
| Pages:  1
| Encrypted:  no
| Page size:  504 x 504 pts
| File size:  54138 bytes
| Optimized:  no
| PDF version:1.4
`


What I tried is a filter:
,[ ~/.gitconfig ]
| [filter pdfresetdate]
| clean = pdfresetdate
`

With this filter script:
,[ pdfresetdate ]
| #!/bin/bash
|
| FILEASARG=true
| if [ $# == 0 ]; then
| FILEASARG=false
| fi
|
| if $FILEASARG ; then
| FILENAME=$1
| else
| FILENAME=`mktemp`
| cat /dev/stdin  ${FILENAME}
| fi
|
| TMPFILE=`mktemp`
| TMPFILE2=`mktemp`
|
| ## dump the pdf metadata to a file and replace the dates
| pdftk $FILENAME dump_data | sed -e '{N;s/Date\nInfoValue: 
D:.*/Date\nInfoValue: D:19790101072619/}'  $TMPFILE
|
| ## update the pdf metadata
| pdftk $FILENAME update_info $TMPFILE output $TMPFILE2
|
| ## overwrite the original pdf
| mv -f $TMPFILE2 $FILENAME
|
| ## clean up
| rm -f $TMPFILE
| rm -f $TMPFILE2
| if [ -n $FILEASARG ] ; then
| cat $FILENAME
| fi
`


This 'works' as far as the committed pdf indeed has the date reset to my
default value.

However, when I re-checkout the files, they are marked modified by git.

So, my question is:  How can I make git *completely* ignore the embedded
date in the PDF?

Many thanks in advance for any help!

Regards,
Andreas


PS:
I had posted this question (without much success) here:
http://stackoverflow.com/questions/16058187/make-git-ignore-the-date-in-pdf-files
and with no answer on the git-users mailing list:
https://groups.google.com/forum/#!topic/git-users/KqtecNa3cOc

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html