Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-28 Thread Ronald Houk
Hello,

I like Danielle's idea.  I wonder if it wouldn't be a good idea to decouple
the metadata from the data permanently.  Exiftool allows you to export the
metadata in lots of different formats like JSON.  You could export the
metadata into JSON, run the checksums and then store the photo and the JSON
file in a single tar-ball. From there you could use a JSON editor to
modify/add metadata.

 It would be simple to reintroduce the metadata into the file when needed.

On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer dcplu...@gmail.com
wrote:

 Kyle,

 It's a bit of a hack, but you could write a script to delete all the
 metadata from images with ExifTool and then run checksums on the resulting
 image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0
 ).
 exiv2 might also work. I don't think you'd want to do that every time you
 audited the files, though; generating new checksums is a faster approach.

 I haven't tried this, but I know that there's a program called ssdeep
 developed for the digital forensics community that can do piecewise hashing
 -- it hashes chunks of content and then compares the hashes for the
 different chunks to find matches, in theory. It might be able to match
 files with embedded metadata vs. files without; the use cases described on
 the forensics wiki is finding altered (truncated) files, or reuse of source
 code.  http://www.forensicswiki.org/wiki/Ssdeep

 Danielle Cunniff Plumer

 On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:

  On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com
  wrote:
 
  
  - How is your content packaged?
  - Are you talking about the SIPs or the AIPs or both?
  - Is your content in an instance of Fedora, a unix file structure,
 or
  something else?
  - Are you generating checksums on the whole package, parts of it,
  both?
  
 
  The quick answer to this is that this is a low tech operation. We're
  currently on regular filesystems where we are limited to feeding md5
  checksums into a list. I'm looking for a low tech way that makes it
 easier
  to keep track of resources across a variety of platforms in a
 decentralized
  environment and which will easily adopt to future technology transitions.
  For example, we have a bunch of stuff in Bepress and Omeka. Neither of
  those is good for preservation, so authoritative files live elsewhere as
 do
  a huge number of resources that aren't in these platforms. Filenames are
  terrible identifiers and things get moved around even if people don't
 mess
  with the files.
 
  We also are trying to come up with something that deals with different
  kinds of datasets (we're focusing on bioimaging at the moment) and fits
 in
  the workflow of campus units, each of which needs to manage tens of
  thousands of files with very little metadata on regular filesystems. Some
  of the resources are enormous in terms of size or number of members.
 
  Simply embedding an identifier in the file is a really easy way to tell
  which files have metadata and which metadata is there. In the case at
 hand,
  I could just do that and generate new checksums. But I think the generic
  problem of making better use of embedded metadata is an interesting one
 as
  it can make objects more usable and understandable once they're removed.
  For example, just this past Friday I received a request to use an image
  someone downloaded for a book. Unfortunately, he just emailed me a copy
 of
  the image, described what he wanted to do, and asked for permission but
 he
  couldn't replicate how he found it. An identifier would have been handy
 as
  would have been embedded rights info as this is not the same for all of
 our
  images. The reason we're using DOI's is that they work well for anything
  and can easily be recognized by syntax wherever they may appear.
 
  On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle 
  onei...@grace.nascom.nasa.gov
   wrote:
 
  
   The problems with 'metadata' in a lot of file formats is that they're
   just arbitrary segments -- you'd have to have a program that knew
   which segments were considered 'headers' vs. not.  It might be easier
   to have it be able to compute a separate checksum for each segment,
   so that should the modifications change their order, they'd still
   be considered valid.
  
 
  This is what I seemed to be bumping up against so I was hoping there was
 an
  easy workaround. But this is helpful information. Thanks,
 
  kyle
 




-- 
Ronald Houk
Assistant Director
Ottumwa Public Library
102 W. Fourth Street
Ottumwa, IA 52501
(641)682-7563x203
rh...@ottumwapubliclibrary.org


Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-28 Thread Ronald Houk
Also just stumbled across this on stackoverflow.

http://stackoverflow.com/questions/12115824/compute-the-hash-of-only-the-core-image-data-of-a-tiff

On Wed, Jan 28, 2015 at 10:32 AM, Ronald Houk 
rh...@ottumwapubliclibrary.org wrote:

 Hello,

 I like Danielle's idea.  I wonder if it wouldn't be a good idea to
 decouple the metadata from the data permanently.  Exiftool allows you to
 export the metadata in lots of different formats like JSON.  You could
 export the metadata into JSON, run the checksums and then store the photo
 and the JSON file in a single tar-ball. From there you could use a JSON
 editor to modify/add metadata.

  It would be simple to reintroduce the metadata into the file when needed.

 On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer dcplu...@gmail.com
 wrote:

 Kyle,

 It's a bit of a hack, but you could write a script to delete all the
 metadata from images with ExifTool and then run checksums on the resulting
 image (see
 http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0).
 exiv2 might also work. I don't think you'd want to do that every time you
 audited the files, though; generating new checksums is a faster approach.

 I haven't tried this, but I know that there's a program called ssdeep
 developed for the digital forensics community that can do piecewise
 hashing
 -- it hashes chunks of content and then compares the hashes for the
 different chunks to find matches, in theory. It might be able to match
 files with embedded metadata vs. files without; the use cases described on
 the forensics wiki is finding altered (truncated) files, or reuse of
 source
 code.  http://www.forensicswiki.org/wiki/Ssdeep

 Danielle Cunniff Plumer

 On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com
 wrote:

  On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com
  wrote:
 
  
  - How is your content packaged?
  - Are you talking about the SIPs or the AIPs or both?
  - Is your content in an instance of Fedora, a unix file structure,
 or
  something else?
  - Are you generating checksums on the whole package, parts of it,
  both?
  
 
  The quick answer to this is that this is a low tech operation. We're
  currently on regular filesystems where we are limited to feeding md5
  checksums into a list. I'm looking for a low tech way that makes it
 easier
  to keep track of resources across a variety of platforms in a
 decentralized
  environment and which will easily adopt to future technology
 transitions.
  For example, we have a bunch of stuff in Bepress and Omeka. Neither of
  those is good for preservation, so authoritative files live elsewhere
 as do
  a huge number of resources that aren't in these platforms. Filenames are
  terrible identifiers and things get moved around even if people don't
 mess
  with the files.
 
  We also are trying to come up with something that deals with different
  kinds of datasets (we're focusing on bioimaging at the moment) and fits
 in
  the workflow of campus units, each of which needs to manage tens of
  thousands of files with very little metadata on regular filesystems.
 Some
  of the resources are enormous in terms of size or number of members.
 
  Simply embedding an identifier in the file is a really easy way to tell
  which files have metadata and which metadata is there. In the case at
 hand,
  I could just do that and generate new checksums. But I think the generic
  problem of making better use of embedded metadata is an interesting one
 as
  it can make objects more usable and understandable once they're removed.
  For example, just this past Friday I received a request to use an image
  someone downloaded for a book. Unfortunately, he just emailed me a copy
 of
  the image, described what he wanted to do, and asked for permission but
 he
  couldn't replicate how he found it. An identifier would have been handy
 as
  would have been embedded rights info as this is not the same for all of
 our
  images. The reason we're using DOI's is that they work well for anything
  and can easily be recognized by syntax wherever they may appear.
 
  On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle 
  onei...@grace.nascom.nasa.gov
   wrote:
 
  
   The problems with 'metadata' in a lot of file formats is that they're
   just arbitrary segments -- you'd have to have a program that knew
   which segments were considered 'headers' vs. not.  It might be easier
   to have it be able to compute a separate checksum for each segment,
   so that should the modifications change their order, they'd still
   be considered valid.
  
 
  This is what I seemed to be bumping up against so I was hoping there
 was an
  easy workaround. But this is helpful information. Thanks,
 
  kyle
 




 --
 Ronald Houk
 Assistant Director
 Ottumwa Public Library
 102 W. Fourth Street
 Ottumwa, IA 52501
 (641)682-7563x203
 rh...@ottumwapubliclibrary.org




-- 
Ronald Houk
Assistant Director
Ottumwa Public Library
102 W. Fourth 

Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-26 Thread Scancella, John
The library of congress has several tools for making and working with bagit 
bags.

Java command line tool and library
https://github.com/LibraryOfCongress/bagit-java

a python command line tool and library
https://github.com/LibraryOfCongress/bagit-python

or a standalone java desktop application (GUI based)
https://github.com/LibraryOfCongress/bagger 

-Original Message-
From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe 
Hourcle
Sent: Saturday, January 24, 2015 10:07 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata

On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:

 Howdy all,
 
 I've been toying with the idea of embedding DOI's in all our digital 
 assets and possibly inserting/updating other metadata as well. 
 However, doing this would alter checksums created using normal methods.
 
 Is there a practical/easy way to checksum only the objects themselves 
 without the metadata? If the metadata in a tiff or other kind of file 
 is modified, it does nothing to the actual object. Since providing 
 more complete metadata within objects makes them more 
 usable/identifiable and might simplify migrations down the road, it 
 seems like this wouldn't be a bad way to go.


The only file format that I'm aware of that has a provision for this is FITS 
(Flexible Image Transport System), which was a concept of a 'CHECKSUM' and a 
'DATASUM'.  (the 'DATASUM' is the checksum for only the payload portion, the 
'CHECKSUM' includes the metadata)[1].  It's possible that there are others, but 
I suspect that most consumer file formats won't have specific provisions for 
this.

The problems with 'metadata' in a lot of file formats is that they're just 
arbitrary segments -- you'd have to have a program that knew which segments 
were considered 'headers' vs. not.  It might be easier to have it be able to 
compute a separate checksum for each segment, so that should the modifications 
change their order, they'd still be considered valid.

Of course, I personally don't like changing files if I can help it.
If it were me, I'd keep the metadata outside the file;  if you're using BagIt, 
you could easily add additional metadata outside of the data directory.[2]

If you're just doing this internally, and don't need the DOI to be attached to 
the file when it's served, you could also look into file systems that support 
arbitrary metadata.  Older Macs used to use this, where there was a 'data fork' 
and a 'resource fork', but you had to have a service that knew to only send the 
data fork.
Other OSes support forks, but some also have 'extended file attributes', which 
allows you to attach a few key/value pairs to the file.  (exact limits are 
dependent upon the OS).

-Joe


[1] http://fits.gsfc.nasa.gov/registry/checksum.html
[2] https://tools.ietf.org/html/draft-kunze-bagit ; 
http://en.wikipedia.org/wiki/BagIt


Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-26 Thread Bert Lyons
Kyle --

Although my example doesn't apply for all file formats, it does give an
example of what you're looking for:

BWFMetaEdit (
http://www.digitizationguidelines.gov/guidelines/digitize-embedding.html)
is free tool developed by Federal Agency groups to allow for the
reading/writing of metadata into the BWF and RIFF (BEXT  INFO
respectively) text chunks of WAV audio files. The salient point here is
that this approach was designed with the ability to generate and embed a
checksum of the PCM audio stream within the WAV container so that as new
metadata are added to the container, the audio can be validated against its
specific checksum, not a checksum of the entire container. In this
practice, one can generate a checksum for the audio information (the
content) and for the entire file itself (the content and the metadata).

Take a read through that and maybe it will inspire some ideas.

I know in the moving image field there is also much activity around frame
by frame checksums for moving image material so that when a file is found
to be corrupt, you can even pinpoint which frame has the corruption.

Best --

Bert


Bertram Lyons, CA
AVPreserve | www.avpreserve.com
American Folklife Center | www.loc.gov/folklife
International Association of Sound and Audiovisual Archives |
www.iasa-web.org

On Mon, Jan 26, 2015 at 6:21 AM, Scancella, John j...@loc.gov wrote:

 The library of congress has several tools for making and working with
 bagit bags.

 Java command line tool and library
 https://github.com/LibraryOfCongress/bagit-java

 a python command line tool and library
 https://github.com/LibraryOfCongress/bagit-python

 or a standalone java desktop application (GUI based)
 https://github.com/LibraryOfCongress/bagger

 -Original Message-
 From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of
 Joe Hourcle
 Sent: Saturday, January 24, 2015 10:07 PM
 To: CODE4LIB@LISTSERV.ND.EDU
 Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata

 On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote:

  Howdy all,
 
  I've been toying with the idea of embedding DOI's in all our digital
  assets and possibly inserting/updating other metadata as well.
  However, doing this would alter checksums created using normal methods.
 
  Is there a practical/easy way to checksum only the objects themselves
  without the metadata? If the metadata in a tiff or other kind of file
  is modified, it does nothing to the actual object. Since providing
  more complete metadata within objects makes them more
  usable/identifiable and might simplify migrations down the road, it
  seems like this wouldn't be a bad way to go.


 The only file format that I'm aware of that has a provision for this is
 FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM'
 and a 'DATASUM'.  (the 'DATASUM' is the checksum for only the payload
 portion, the 'CHECKSUM' includes the metadata)[1].  It's possible that
 there are others, but I suspect that most consumer file formats won't have
 specific provisions for this.

 The problems with 'metadata' in a lot of file formats is that they're just
 arbitrary segments -- you'd have to have a program that knew which segments
 were considered 'headers' vs. not.  It might be easier to have it be able
 to compute a separate checksum for each segment, so that should the
 modifications change their order, they'd still be considered valid.

 Of course, I personally don't like changing files if I can help it.
 If it were me, I'd keep the metadata outside the file;  if you're using
 BagIt, you could easily add additional metadata outside of the data
 directory.[2]

 If you're just doing this internally, and don't need the DOI to be
 attached to the file when it's served, you could also look into file
 systems that support arbitrary metadata.  Older Macs used to use this,
 where there was a 'data fork' and a 'resource fork', but you had to have a
 service that knew to only send the data fork.
 Other OSes support forks, but some also have 'extended file attributes',
 which allows you to attach a few key/value pairs to the file.  (exact
 limits are dependent upon the OS).

 -Joe


 [1] http://fits.gsfc.nasa.gov/registry/checksum.html
 [2] https://tools.ietf.org/html/draft-kunze-bagit ;
 http://en.wikipedia.org/wiki/BagIt



Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-26 Thread danielle plumer
Kyle,

It's a bit of a hack, but you could write a script to delete all the
metadata from images with ExifTool and then run checksums on the resulting
image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0).
exiv2 might also work. I don't think you'd want to do that every time you
audited the files, though; generating new checksums is a faster approach.

I haven't tried this, but I know that there's a program called ssdeep
developed for the digital forensics community that can do piecewise hashing
-- it hashes chunks of content and then compares the hashes for the
different chunks to find matches, in theory. It might be able to match
files with embedded metadata vs. files without; the use cases described on
the forensics wiki is finding altered (truncated) files, or reuse of source
code.  http://www.forensicswiki.org/wiki/Ssdeep

Danielle Cunniff Plumer

On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com
 wrote:

 
 - How is your content packaged?
 - Are you talking about the SIPs or the AIPs or both?
 - Is your content in an instance of Fedora, a unix file structure, or
 something else?
 - Are you generating checksums on the whole package, parts of it,
 both?
 

 The quick answer to this is that this is a low tech operation. We're
 currently on regular filesystems where we are limited to feeding md5
 checksums into a list. I'm looking for a low tech way that makes it easier
 to keep track of resources across a variety of platforms in a decentralized
 environment and which will easily adopt to future technology transitions.
 For example, we have a bunch of stuff in Bepress and Omeka. Neither of
 those is good for preservation, so authoritative files live elsewhere as do
 a huge number of resources that aren't in these platforms. Filenames are
 terrible identifiers and things get moved around even if people don't mess
 with the files.

 We also are trying to come up with something that deals with different
 kinds of datasets (we're focusing on bioimaging at the moment) and fits in
 the workflow of campus units, each of which needs to manage tens of
 thousands of files with very little metadata on regular filesystems. Some
 of the resources are enormous in terms of size or number of members.

 Simply embedding an identifier in the file is a really easy way to tell
 which files have metadata and which metadata is there. In the case at hand,
 I could just do that and generate new checksums. But I think the generic
 problem of making better use of embedded metadata is an interesting one as
 it can make objects more usable and understandable once they're removed.
 For example, just this past Friday I received a request to use an image
 someone downloaded for a book. Unfortunately, he just emailed me a copy of
 the image, described what he wanted to do, and asked for permission but he
 couldn't replicate how he found it. An identifier would have been handy as
 would have been embedded rights info as this is not the same for all of our
 images. The reason we're using DOI's is that they work well for anything
 and can easily be recognized by syntax wherever they may appear.

 On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle 
 onei...@grace.nascom.nasa.gov
  wrote:

 
  The problems with 'metadata' in a lot of file formats is that they're
  just arbitrary segments -- you'd have to have a program that knew
  which segments were considered 'headers' vs. not.  It might be easier
  to have it be able to compute a separate checksum for each segment,
  so that should the modifications change their order, they'd still
  be considered valid.
 

 This is what I seemed to be bumping up against so I was hoping there was an
 easy workaround. But this is helpful information. Thanks,

 kyle



Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-25 Thread Kyle Banerjee
On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com
wrote:


- How is your content packaged?
- Are you talking about the SIPs or the AIPs or both?
- Is your content in an instance of Fedora, a unix file structure, or
something else?
- Are you generating checksums on the whole package, parts of it, both?


The quick answer to this is that this is a low tech operation. We're
currently on regular filesystems where we are limited to feeding md5
checksums into a list. I'm looking for a low tech way that makes it easier
to keep track of resources across a variety of platforms in a decentralized
environment and which will easily adopt to future technology transitions.
For example, we have a bunch of stuff in Bepress and Omeka. Neither of
those is good for preservation, so authoritative files live elsewhere as do
a huge number of resources that aren't in these platforms. Filenames are
terrible identifiers and things get moved around even if people don't mess
with the files.

We also are trying to come up with something that deals with different
kinds of datasets (we're focusing on bioimaging at the moment) and fits in
the workflow of campus units, each of which needs to manage tens of
thousands of files with very little metadata on regular filesystems. Some
of the resources are enormous in terms of size or number of members.

Simply embedding an identifier in the file is a really easy way to tell
which files have metadata and which metadata is there. In the case at hand,
I could just do that and generate new checksums. But I think the generic
problem of making better use of embedded metadata is an interesting one as
it can make objects more usable and understandable once they're removed.
For example, just this past Friday I received a request to use an image
someone downloaded for a book. Unfortunately, he just emailed me a copy of
the image, described what he wanted to do, and asked for permission but he
couldn't replicate how he found it. An identifier would have been handy as
would have been embedded rights info as this is not the same for all of our
images. The reason we're using DOI's is that they work well for anything
and can easily be recognized by syntax wherever they may appear.

On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle onei...@grace.nascom.nasa.gov
 wrote:


 The problems with 'metadata' in a lot of file formats is that they're
 just arbitrary segments -- you'd have to have a program that knew
 which segments were considered 'headers' vs. not.  It might be easier
 to have it be able to compute a separate checksum for each segment,
 so that should the modifications change their order, they'd still
 be considered valid.


This is what I seemed to be bumping up against so I was hoping there was an
easy workaround. But this is helpful information. Thanks,

kyle


Re: [CODE4LIB] Checksums for objects and not embedded metadata

2015-01-24 Thread Rosalyn Metz
Kyle,

I think I can answer your question, but I would need to know a little bit
more about what you're doing before attempting to help: how are you
packaging up your objects, how are you storing the content, and how are you
generating checksums.  Even more specific than that:

   - How is your content packaged?
   - Are you talking about the SIPs or the AIPs or both?
   - Is your content in an instance of Fedora, a unix file structure, or
   something else?
   - Are you generating checksums on the whole package, parts of it, both?

Without more specific information though, the solution I would lean toward
is decouple the content files from the metadata and checksum each
separately (because right now it doesn't sound like your system is doing
that).

Rosy


On Fri, Jan 23, 2015 at 2:35 PM, Kyle Banerjee kyle.baner...@gmail.com
wrote:

 Howdy all,

 I've been toying with the idea of embedding DOI's in all our digital assets
 and possibly inserting/updating other metadata as well. However, doing this
 would alter checksums created using normal methods.

 Is there a practical/easy way to checksum only the objects themselves
 without the metadata? If the metadata in a tiff or other kind of file is
 modified, it does nothing to the actual object. Since providing more
 complete metadata within objects makes them more usable/identifiable and
 might simplify migrations down the road, it seems like this wouldn't be a
 bad way to go.

 Thanks,

 kyle



[CODE4LIB] Checksums for objects and not embedded metadata

2015-01-23 Thread Kyle Banerjee
Howdy all,

I've been toying with the idea of embedding DOI's in all our digital assets
and possibly inserting/updating other metadata as well. However, doing this
would alter checksums created using normal methods.

Is there a practical/easy way to checksum only the objects themselves
without the metadata? If the metadata in a tiff or other kind of file is
modified, it does nothing to the actual object. Since providing more
complete metadata within objects makes them more usable/identifiable and
might simplify migrations down the road, it seems like this wouldn't be a
bad way to go.

Thanks,

kyle