I've found the following scheme helpful for identifying pdfs with 
duplicate content and annotations. (Thanks to Christiaan for his 
guidance re pdf hashes.)

Convert each pdf with embedded notes to pdf and then to pdf bundles. 
(The skimalot shell script at 
http://sourceforge.net/mailarchive/message.php?msg_id=28528620 can 
help with this.)

Identify pdfs identical both in content and annotation. I found it 
efficient to do this in several steps.

Identify pdf bundles with matching hash tags. (I used File Buddy to 
find duplicates; other tools are listed below.) This "strict" 
duplicate test will find some but not all duplicates. (See 
Christiaan's note below.) Select bundles for deletion based on 
location, modification date, or other metadata.

Identify Skim note ".skim" files with matching hash tags. This will 
find some additional duplicates. As above, select containing pdfd 
bundles for deletion. (This will find also find any matching.skim 
files outside of bundles.)

Identify duplicate Skim note ".txt" files. This will identify Skim 
note sets identical except for note position. Also, it might 
improperly identify as duplicates pdfs with different content but 
identical annotations (e.g., a "Draft" text note), so check filenames 
or open the files. As above, select containing pdfd bundles for 
deletion.

Identify pdfs with identical content (irrespective of annotation).

Identify pdfs outside of bundles with matching hashes. As above, this 
"strict" test will miss some duplicates. Select for deletion as above.

Identify pdfs outside of bundles with matching size. This might yield 
some false positives, so check filenames. It might also miss some 
pdfs that match in appearance. Select for deletion as above.

Identify pdfs inside and outside bundles with matching size. This 
permits 1) deletion of some un-annotated pdfs for which there are 
annotated duplicates, and 2) identification of pdfds with different 
Skim notes. (This step will also find pdfs  in non-pdfd bundles such 
as DEVONThink, OmniGraffle, Scrivener, and rtfd.)

Those are the basic steps. (I've used the above scheme to pare a 
collection of some 7000 pdfs down to 5000.) Beyond that, one can:

Identify pdfs that match in appearance using comparepdf 
(http://www.qtrac.eu/comparepdf.html) which does a pairwise 
comparison. The code is open source and should be extensible to an 
n-way comparison.

Identify pdfs with nearly duplicative text using a shell script based 
on pdftotext parsing and word count 
(http://us.generation-nt.com/answer/detecting-duplicate-pdf-files-word-count-approach-help-173241621.html).

Compare pdfs visually side-by-side using diffpdf 
(http://www.qtrac.eu/diffpdf.html) or various other tools.

Mac duplicate finders other than File Buddy include Find Duplicate 
Files, DupeGuru, and GrupaDupa.



humanengr

At 1:55 AM +0200 10/30/11, Christiaan Hofman wrote:
>PDF data is far from uniquely determined by it's content of 
>information. So there is no reason why the data of the same PDF 
>saved at different times will produce the exact same data (and when 
>using different programs/libraries it will be even less unique). 
>This is very different for plain text and RTF. So there's nothing 
>odd about it, it's the way it is and what you should expect.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create 
new or port existing apps to sell to consumers worldwide. Explore the 
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

Reply via email to