[Skim-app-users] finding duplicate pdfs

humanengr Tue, 03 Jan 2012 01:50:13 -0800

I've found the following scheme helpful for identifying pdfs with 
duplicate content and annotations. (Thanks to Christiaan for his 
guidance re pdf hashes.)

Convert each pdf with embedded notes to pdf and then to pdf bundles.
(The skimalot shell script at
http://sourceforge.net/mailarchive/message.php?msg_id=28528620 can
help with this.)

Identify pdfs identical both in content and annotation. I found it
efficient to do this in several steps.

Identify pdf bundles with matching hash tags. (I used File Buddy to
find duplicates; other tools are listed below.) This "strict"
duplicate test will find some but not all duplicates. (See
Christiaan's note below.) Select bundles for deletion based on
location, modification date, or other metadata.

Identify Skim note ".skim" files with matching hash tags. This will
find some additional duplicates. As above, select containing pdfd
bundles for deletion. (This will find also find any matching.skim
files outside of bundles.)

Identify duplicate Skim note ".txt" files. This will identify Skim
note sets identical except for note position. Also, it might
improperly identify as duplicates pdfs with different content but
identical annotations (e.g., a "Draft" text note), so check filenames
or open the files. As above, select containing pdfd bundles for
deletion.

Identify pdfs with identical content (irrespective of annotation).

Identify pdfs outside of bundles with matching hashes. As above, this
"strict" test will miss some duplicates. Select for deletion as above.

Identify pdfs outside of bundles with matching size. This might yield
some false positives, so check filenames. It might also miss some
pdfs that match in appearance. Select for deletion as above.

Identify pdfs inside and outside bundles with matching size. This
permits 1) deletion of some un-annotated pdfs for which there are
annotated duplicates, and 2) identification of pdfds with different
Skim notes. (This step will also find pdfs in non-pdfd bundles such
as DEVONThink, OmniGraffle, Scrivener, and rtfd.)

Those are the basic steps. (I've used the above scheme to pare a
collection of some 7000 pdfs down to 5000.) Beyond that, one can:

Identify pdfs that match in appearance using comparepdf
(http://www.qtrac.eu/comparepdf.html) which does a pairwise
comparison. The code is open source and should be extensible to an
n-way comparison.

Identify pdfs with nearly duplicative text using a shell script based
on pdftotext parsing and word count
(http://us.generation-nt.com/answer/detecting-duplicate-pdf-files-word-count-approach-help-173241621.html).

Compare pdfs visually side-by-side using diffpdf
(http://www.qtrac.eu/diffpdf.html) or various other tools.

Mac duplicate finders other than File Buddy include Find Duplicate
Files, DupeGuru, and GrupaDupa.

humanengr

At 1:55 AM +0200 10/30/11, Christiaan Hofman wrote:
>PDF data is far from uniquely determined by it's content of
>information. So there is no reason why the data of the same PDF
>saved at different times will produce the exact same data (and when
>using different programs/libraries it will be even less unique).
>This is very different for plain text and RTF. So there's nothing
>odd about it, it's the way it is and what you should expect.

------------------------------------------------------------------------------
Write once. Port to many.
Get the SDK and tools to simplify cross-platform app development. Create
new or port existing apps to sell to consumers worldwide. Explore the
Intel AppUpSM program developer opportunity. appdeveloper.intel.com/join
http://p.sf.net/sfu/intel-appdev
_______________________________________________
Skim-app-users mailing list
Skim-app-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/skim-app-users

[Skim-app-users] finding duplicate pdfs

Reply via email to