Re: [CODE4LIB] Checksums for objects and not embedded metadata
Hello, I like Danielle's idea. I wonder if it wouldn't be a good idea to decouple the metadata from the data permanently. Exiftool allows you to export the metadata in lots of different formats like JSON. You could export the metadata into JSON, run the checksums and then store the photo and the JSON file in a single tar-ball. From there you could use a JSON editor to modify/add metadata. It would be simple to reintroduce the metadata into the file when needed. On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer dcplu...@gmail.com wrote: Kyle, It's a bit of a hack, but you could write a script to delete all the metadata from images with ExifTool and then run checksums on the resulting image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0 ). exiv2 might also work. I don't think you'd want to do that every time you audited the files, though; generating new checksums is a faster approach. I haven't tried this, but I know that there's a program called ssdeep developed for the digital forensics community that can do piecewise hashing -- it hashes chunks of content and then compares the hashes for the different chunks to find matches, in theory. It might be able to match files with embedded metadata vs. files without; the use cases described on the forensics wiki is finding altered (truncated) files, or reuse of source code. http://www.forensicswiki.org/wiki/Ssdeep Danielle Cunniff Plumer On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com wrote: - How is your content packaged? - Are you talking about the SIPs or the AIPs or both? - Is your content in an instance of Fedora, a unix file structure, or something else? - Are you generating checksums on the whole package, parts of it, both? The quick answer to this is that this is a low tech operation. We're currently on regular filesystems where we are limited to feeding md5 checksums into a list. I'm looking for a low tech way that makes it easier to keep track of resources across a variety of platforms in a decentralized environment and which will easily adopt to future technology transitions. For example, we have a bunch of stuff in Bepress and Omeka. Neither of those is good for preservation, so authoritative files live elsewhere as do a huge number of resources that aren't in these platforms. Filenames are terrible identifiers and things get moved around even if people don't mess with the files. We also are trying to come up with something that deals with different kinds of datasets (we're focusing on bioimaging at the moment) and fits in the workflow of campus units, each of which needs to manage tens of thousands of files with very little metadata on regular filesystems. Some of the resources are enormous in terms of size or number of members. Simply embedding an identifier in the file is a really easy way to tell which files have metadata and which metadata is there. In the case at hand, I could just do that and generate new checksums. But I think the generic problem of making better use of embedded metadata is an interesting one as it can make objects more usable and understandable once they're removed. For example, just this past Friday I received a request to use an image someone downloaded for a book. Unfortunately, he just emailed me a copy of the image, described what he wanted to do, and asked for permission but he couldn't replicate how he found it. An identifier would have been handy as would have been embedded rights info as this is not the same for all of our images. The reason we're using DOI's is that they work well for anything and can easily be recognized by syntax wherever they may appear. On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle onei...@grace.nascom.nasa.gov wrote: The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. This is what I seemed to be bumping up against so I was hoping there was an easy workaround. But this is helpful information. Thanks, kyle -- Ronald Houk Assistant Director Ottumwa Public Library 102 W. Fourth Street Ottumwa, IA 52501 (641)682-7563x203 rh...@ottumwapubliclibrary.org
Re: [CODE4LIB] Checksums for objects and not embedded metadata
Also just stumbled across this on stackoverflow. http://stackoverflow.com/questions/12115824/compute-the-hash-of-only-the-core-image-data-of-a-tiff On Wed, Jan 28, 2015 at 10:32 AM, Ronald Houk rh...@ottumwapubliclibrary.org wrote: Hello, I like Danielle's idea. I wonder if it wouldn't be a good idea to decouple the metadata from the data permanently. Exiftool allows you to export the metadata in lots of different formats like JSON. You could export the metadata into JSON, run the checksums and then store the photo and the JSON file in a single tar-ball. From there you could use a JSON editor to modify/add metadata. It would be simple to reintroduce the metadata into the file when needed. On Mon, Jan 26, 2015 at 10:27 AM, danielle plumer dcplu...@gmail.com wrote: Kyle, It's a bit of a hack, but you could write a script to delete all the metadata from images with ExifTool and then run checksums on the resulting image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0). exiv2 might also work. I don't think you'd want to do that every time you audited the files, though; generating new checksums is a faster approach. I haven't tried this, but I know that there's a program called ssdeep developed for the digital forensics community that can do piecewise hashing -- it hashes chunks of content and then compares the hashes for the different chunks to find matches, in theory. It might be able to match files with embedded metadata vs. files without; the use cases described on the forensics wiki is finding altered (truncated) files, or reuse of source code. http://www.forensicswiki.org/wiki/Ssdeep Danielle Cunniff Plumer On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com wrote: - How is your content packaged? - Are you talking about the SIPs or the AIPs or both? - Is your content in an instance of Fedora, a unix file structure, or something else? - Are you generating checksums on the whole package, parts of it, both? The quick answer to this is that this is a low tech operation. We're currently on regular filesystems where we are limited to feeding md5 checksums into a list. I'm looking for a low tech way that makes it easier to keep track of resources across a variety of platforms in a decentralized environment and which will easily adopt to future technology transitions. For example, we have a bunch of stuff in Bepress and Omeka. Neither of those is good for preservation, so authoritative files live elsewhere as do a huge number of resources that aren't in these platforms. Filenames are terrible identifiers and things get moved around even if people don't mess with the files. We also are trying to come up with something that deals with different kinds of datasets (we're focusing on bioimaging at the moment) and fits in the workflow of campus units, each of which needs to manage tens of thousands of files with very little metadata on regular filesystems. Some of the resources are enormous in terms of size or number of members. Simply embedding an identifier in the file is a really easy way to tell which files have metadata and which metadata is there. In the case at hand, I could just do that and generate new checksums. But I think the generic problem of making better use of embedded metadata is an interesting one as it can make objects more usable and understandable once they're removed. For example, just this past Friday I received a request to use an image someone downloaded for a book. Unfortunately, he just emailed me a copy of the image, described what he wanted to do, and asked for permission but he couldn't replicate how he found it. An identifier would have been handy as would have been embedded rights info as this is not the same for all of our images. The reason we're using DOI's is that they work well for anything and can easily be recognized by syntax wherever they may appear. On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle onei...@grace.nascom.nasa.gov wrote: The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. This is what I seemed to be bumping up against so I was hoping there was an easy workaround. But this is helpful information. Thanks, kyle -- Ronald Houk Assistant Director Ottumwa Public Library 102 W. Fourth Street Ottumwa, IA 52501 (641)682-7563x203 rh...@ottumwapubliclibrary.org -- Ronald Houk Assistant Director Ottumwa Public Library 102 W. Fourth
Re: [CODE4LIB] Checksums for objects and not embedded metadata
The library of congress has several tools for making and working with bagit bags. Java command line tool and library https://github.com/LibraryOfCongress/bagit-java a python command line tool and library https://github.com/LibraryOfCongress/bagit-python or a standalone java desktop application (GUI based) https://github.com/LibraryOfCongress/bagger -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Saturday, January 24, 2015 10:07 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote: Howdy all, I've been toying with the idea of embedding DOI's in all our digital assets and possibly inserting/updating other metadata as well. However, doing this would alter checksums created using normal methods. Is there a practical/easy way to checksum only the objects themselves without the metadata? If the metadata in a tiff or other kind of file is modified, it does nothing to the actual object. Since providing more complete metadata within objects makes them more usable/identifiable and might simplify migrations down the road, it seems like this wouldn't be a bad way to go. The only file format that I'm aware of that has a provision for this is FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM' and a 'DATASUM'. (the 'DATASUM' is the checksum for only the payload portion, the 'CHECKSUM' includes the metadata)[1]. It's possible that there are others, but I suspect that most consumer file formats won't have specific provisions for this. The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. Of course, I personally don't like changing files if I can help it. If it were me, I'd keep the metadata outside the file; if you're using BagIt, you could easily add additional metadata outside of the data directory.[2] If you're just doing this internally, and don't need the DOI to be attached to the file when it's served, you could also look into file systems that support arbitrary metadata. Older Macs used to use this, where there was a 'data fork' and a 'resource fork', but you had to have a service that knew to only send the data fork. Other OSes support forks, but some also have 'extended file attributes', which allows you to attach a few key/value pairs to the file. (exact limits are dependent upon the OS). -Joe [1] http://fits.gsfc.nasa.gov/registry/checksum.html [2] https://tools.ietf.org/html/draft-kunze-bagit ; http://en.wikipedia.org/wiki/BagIt
Re: [CODE4LIB] Checksums for objects and not embedded metadata
Kyle -- Although my example doesn't apply for all file formats, it does give an example of what you're looking for: BWFMetaEdit ( http://www.digitizationguidelines.gov/guidelines/digitize-embedding.html) is free tool developed by Federal Agency groups to allow for the reading/writing of metadata into the BWF and RIFF (BEXT INFO respectively) text chunks of WAV audio files. The salient point here is that this approach was designed with the ability to generate and embed a checksum of the PCM audio stream within the WAV container so that as new metadata are added to the container, the audio can be validated against its specific checksum, not a checksum of the entire container. In this practice, one can generate a checksum for the audio information (the content) and for the entire file itself (the content and the metadata). Take a read through that and maybe it will inspire some ideas. I know in the moving image field there is also much activity around frame by frame checksums for moving image material so that when a file is found to be corrupt, you can even pinpoint which frame has the corruption. Best -- Bert Bertram Lyons, CA AVPreserve | www.avpreserve.com American Folklife Center | www.loc.gov/folklife International Association of Sound and Audiovisual Archives | www.iasa-web.org On Mon, Jan 26, 2015 at 6:21 AM, Scancella, John j...@loc.gov wrote: The library of congress has several tools for making and working with bagit bags. Java command line tool and library https://github.com/LibraryOfCongress/bagit-java a python command line tool and library https://github.com/LibraryOfCongress/bagit-python or a standalone java desktop application (GUI based) https://github.com/LibraryOfCongress/bagger -Original Message- From: Code for Libraries [mailto:CODE4LIB@LISTSERV.ND.EDU] On Behalf Of Joe Hourcle Sent: Saturday, January 24, 2015 10:07 PM To: CODE4LIB@LISTSERV.ND.EDU Subject: Re: [CODE4LIB] Checksums for objects and not embedded metadata On Jan 23, 2015, at 5:35 PM, Kyle Banerjee wrote: Howdy all, I've been toying with the idea of embedding DOI's in all our digital assets and possibly inserting/updating other metadata as well. However, doing this would alter checksums created using normal methods. Is there a practical/easy way to checksum only the objects themselves without the metadata? If the metadata in a tiff or other kind of file is modified, it does nothing to the actual object. Since providing more complete metadata within objects makes them more usable/identifiable and might simplify migrations down the road, it seems like this wouldn't be a bad way to go. The only file format that I'm aware of that has a provision for this is FITS (Flexible Image Transport System), which was a concept of a 'CHECKSUM' and a 'DATASUM'. (the 'DATASUM' is the checksum for only the payload portion, the 'CHECKSUM' includes the metadata)[1]. It's possible that there are others, but I suspect that most consumer file formats won't have specific provisions for this. The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. Of course, I personally don't like changing files if I can help it. If it were me, I'd keep the metadata outside the file; if you're using BagIt, you could easily add additional metadata outside of the data directory.[2] If you're just doing this internally, and don't need the DOI to be attached to the file when it's served, you could also look into file systems that support arbitrary metadata. Older Macs used to use this, where there was a 'data fork' and a 'resource fork', but you had to have a service that knew to only send the data fork. Other OSes support forks, but some also have 'extended file attributes', which allows you to attach a few key/value pairs to the file. (exact limits are dependent upon the OS). -Joe [1] http://fits.gsfc.nasa.gov/registry/checksum.html [2] https://tools.ietf.org/html/draft-kunze-bagit ; http://en.wikipedia.org/wiki/BagIt
Re: [CODE4LIB] Checksums for objects and not embedded metadata
Kyle, It's a bit of a hack, but you could write a script to delete all the metadata from images with ExifTool and then run checksums on the resulting image (see http://u88.n24.queensu.ca/exiftool/forum/index.php?topic=4902.0). exiv2 might also work. I don't think you'd want to do that every time you audited the files, though; generating new checksums is a faster approach. I haven't tried this, but I know that there's a program called ssdeep developed for the digital forensics community that can do piecewise hashing -- it hashes chunks of content and then compares the hashes for the different chunks to find matches, in theory. It might be able to match files with embedded metadata vs. files without; the use cases described on the forensics wiki is finding altered (truncated) files, or reuse of source code. http://www.forensicswiki.org/wiki/Ssdeep Danielle Cunniff Plumer On Sun, Jan 25, 2015 at 9:44 AM, Kyle Banerjee kyle.baner...@gmail.com wrote: On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com wrote: - How is your content packaged? - Are you talking about the SIPs or the AIPs or both? - Is your content in an instance of Fedora, a unix file structure, or something else? - Are you generating checksums on the whole package, parts of it, both? The quick answer to this is that this is a low tech operation. We're currently on regular filesystems where we are limited to feeding md5 checksums into a list. I'm looking for a low tech way that makes it easier to keep track of resources across a variety of platforms in a decentralized environment and which will easily adopt to future technology transitions. For example, we have a bunch of stuff in Bepress and Omeka. Neither of those is good for preservation, so authoritative files live elsewhere as do a huge number of resources that aren't in these platforms. Filenames are terrible identifiers and things get moved around even if people don't mess with the files. We also are trying to come up with something that deals with different kinds of datasets (we're focusing on bioimaging at the moment) and fits in the workflow of campus units, each of which needs to manage tens of thousands of files with very little metadata on regular filesystems. Some of the resources are enormous in terms of size or number of members. Simply embedding an identifier in the file is a really easy way to tell which files have metadata and which metadata is there. In the case at hand, I could just do that and generate new checksums. But I think the generic problem of making better use of embedded metadata is an interesting one as it can make objects more usable and understandable once they're removed. For example, just this past Friday I received a request to use an image someone downloaded for a book. Unfortunately, he just emailed me a copy of the image, described what he wanted to do, and asked for permission but he couldn't replicate how he found it. An identifier would have been handy as would have been embedded rights info as this is not the same for all of our images. The reason we're using DOI's is that they work well for anything and can easily be recognized by syntax wherever they may appear. On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle onei...@grace.nascom.nasa.gov wrote: The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. This is what I seemed to be bumping up against so I was hoping there was an easy workaround. But this is helpful information. Thanks, kyle
Re: [CODE4LIB] Checksums for objects and not embedded metadata
On Sat, Jan 24, 2015 at 11:07 AM, Rosalyn Metz rosalynm...@gmail.com wrote: - How is your content packaged? - Are you talking about the SIPs or the AIPs or both? - Is your content in an instance of Fedora, a unix file structure, or something else? - Are you generating checksums on the whole package, parts of it, both? The quick answer to this is that this is a low tech operation. We're currently on regular filesystems where we are limited to feeding md5 checksums into a list. I'm looking for a low tech way that makes it easier to keep track of resources across a variety of platforms in a decentralized environment and which will easily adopt to future technology transitions. For example, we have a bunch of stuff in Bepress and Omeka. Neither of those is good for preservation, so authoritative files live elsewhere as do a huge number of resources that aren't in these platforms. Filenames are terrible identifiers and things get moved around even if people don't mess with the files. We also are trying to come up with something that deals with different kinds of datasets (we're focusing on bioimaging at the moment) and fits in the workflow of campus units, each of which needs to manage tens of thousands of files with very little metadata on regular filesystems. Some of the resources are enormous in terms of size or number of members. Simply embedding an identifier in the file is a really easy way to tell which files have metadata and which metadata is there. In the case at hand, I could just do that and generate new checksums. But I think the generic problem of making better use of embedded metadata is an interesting one as it can make objects more usable and understandable once they're removed. For example, just this past Friday I received a request to use an image someone downloaded for a book. Unfortunately, he just emailed me a copy of the image, described what he wanted to do, and asked for permission but he couldn't replicate how he found it. An identifier would have been handy as would have been embedded rights info as this is not the same for all of our images. The reason we're using DOI's is that they work well for anything and can easily be recognized by syntax wherever they may appear. On Sat, Jan 24, 2015 at 7:06 PM, Joe Hourcle onei...@grace.nascom.nasa.gov wrote: The problems with 'metadata' in a lot of file formats is that they're just arbitrary segments -- you'd have to have a program that knew which segments were considered 'headers' vs. not. It might be easier to have it be able to compute a separate checksum for each segment, so that should the modifications change their order, they'd still be considered valid. This is what I seemed to be bumping up against so I was hoping there was an easy workaround. But this is helpful information. Thanks, kyle
Re: [CODE4LIB] Checksums for objects and not embedded metadata
Kyle, I think I can answer your question, but I would need to know a little bit more about what you're doing before attempting to help: how are you packaging up your objects, how are you storing the content, and how are you generating checksums. Even more specific than that: - How is your content packaged? - Are you talking about the SIPs or the AIPs or both? - Is your content in an instance of Fedora, a unix file structure, or something else? - Are you generating checksums on the whole package, parts of it, both? Without more specific information though, the solution I would lean toward is decouple the content files from the metadata and checksum each separately (because right now it doesn't sound like your system is doing that). Rosy On Fri, Jan 23, 2015 at 2:35 PM, Kyle Banerjee kyle.baner...@gmail.com wrote: Howdy all, I've been toying with the idea of embedding DOI's in all our digital assets and possibly inserting/updating other metadata as well. However, doing this would alter checksums created using normal methods. Is there a practical/easy way to checksum only the objects themselves without the metadata? If the metadata in a tiff or other kind of file is modified, it does nothing to the actual object. Since providing more complete metadata within objects makes them more usable/identifiable and might simplify migrations down the road, it seems like this wouldn't be a bad way to go. Thanks, kyle
[CODE4LIB] Checksums for objects and not embedded metadata
Howdy all, I've been toying with the idea of embedding DOI's in all our digital assets and possibly inserting/updating other metadata as well. However, doing this would alter checksums created using normal methods. Is there a practical/easy way to checksum only the objects themselves without the metadata? If the metadata in a tiff or other kind of file is modified, it does nothing to the actual object. Since providing more complete metadata within objects makes them more usable/identifiable and might simplify migrations down the road, it seems like this wouldn't be a bad way to go. Thanks, kyle