Re: [CODE4LIB] Image de-duping and file identification
I had a project to de duplicate many images and other files too. I wrote a little ditty in PHP but the idea can by used in any language. I have a set of tables in MySQL. give the utility a set of root directories to test and compare trawl the filestems for filename location and size and store in the first table issue sql insert into duplicatesizetable Select filesize,count(filesize) as qty from nametable group by filesize having qty1 you now have the sizes of possible duplicates only now do you crc/md5sum the files of that size update the nametable with crc/md5 values as calculated there can be false positives if two different files crc values are the same then a final bit of sql Select filesize,count(crc) as qty from nametable group by filesize,crc having qty1 I store in a table so I can leave the job and come back join the results to the nametable which gets the real duplication sizes and crc, which can now be used to guide a human to to clean the mess I give the user a table showing the n files with buttons to delete, view, ignore (you may want to keep two/more copies) for safety one can leave part of the filesystem write protected also the form oly allows one button per group http://www.archivist.info/Screenshot_Delete_duplicates.png that took a few minutes only to get the duplicates from a 9gb picture directory Dave Caroline
Re: [CODE4LIB] Image de-duping and file identification
inline: compose-unknown-contact.jpg
[CODE4LIB] Job: International Community Manager at Open Knowledge Foundation
**About the role** Someone highly articulate, enthusiastic and energetic who is willing to travel. While familiarity with email, blogs and Twitter is desirable, no specific technical knowledge is required. Being able to learn quickly, converse intelligently and evangelise convincingly are more valuable to us than detailed knowledge of open knowledge and open data policies. Duties are negotiable, but projected to include tasks such as: * Representing the Open Knowledge Foundation and its various projects and activities at events around the world * Expanding and strengthening the open knowledge community around the world - including public officials, civic society organisations, developers, data journalists and others * Organising and facilitating events, workshops and meetings about open knowledge - bringing together key stakeholders from different areas * Following key developments on mailing lists, blogs and Twitter - and inviting people and organisations to participate in relevant projects, activities and events * Blogging about open knowledge around the world - and soliciting for guest blog posts from key stakeholders * Connecting people, groups and projects with common interests - and encouraging them to collaborate * Promoting key principles and values in the open knowledge community, such as legal/technical standards for open data (e.g. http://opendefinition.org/), and the importance of open source tools and infrastructure * Building the Open Knowledge Foundation network around the world - including helping to set up and encourage others to set up local groups as well as media relations * Developing processes and systems to help support our international network, including handbooks and governance structures (such as councils), and improving these once in place * Continuously monitoring and analysing the needs of groups and potential groups around the world, and exploring how best to support them within the open knowledge community * Doing unexpected stuff spontaneously - helping to organise something you have never done before, connecting people that you have never met before, or pitching something you have never thought of before This role sits within the Network unit of the the Open Knowledge Foundation. **Person specification** We are looking for someone self-driven, organised and a good communicator. This person should be comfortable running a number of projects at the same time, speaking at events and travelling - sometimes at short notice - and great at empathising and engaging with community members at all levels, from lawyers and civil servants to activist developers. **Location** We will consider applicants based anywhere in the world; however a mild preference is given to those close to one of our hubs in London, Berlin or Cambridge. **Pay, availability closing date** The rate is negotiable based on experience. This full-time position is available immediately. The closing date for applications is 25th March 2013. **How to apply** To apply please send a cover letter highlighting relevant experience, your CV and a 30-second video explaining your interest in the role to j...@okfn.org. Brought to you by code4lib jobs: http://jobs.code4lib.org/job/6917/
Re: [CODE4LIB] Image de-duping and file identification
On Wed, Mar 20, 2013 at 2:22 AM, chris fitzpatrick chrisfitz...@gmail.comwrote: Anyone please correct me if this is wrong. A md5/sha1 file hash would also not get any image derivatives, like crops or they added text or tweaked the contrast or photoshopped their cat into the shot... If you really wanted to geek out, you could look into some machine learning techniques to build a classifier that groups the images for you, which might be more a PhD project for someone Agreed. BTW, exiftool might be very useful for detecting photos manipulated in this way because the original create time shouldn't be touched plus there are some other data points you'd be able to use for comparison. YMMV depending on software used to manipulate the images. Picasa is very good at finding similar images. I would have suggested that earlier except I have no idea how it would perform on 300K photos. It works quite well in the 20K-30K range though it really seems designed to work with sets up to several thousand which makes sense given who they aim it at. But I hate that it mangles metadata since that makes it difficult to use for tagging unless you don't care about the original metadata and it is graphically oriented -- I'm pretty sure that it would be far more efficient to use metadata than to have picasa try to figure things out and then list out what it thought were dups. A less sexy but really good strategy would also be to use AWS Mechanical Turk, which I think seems like a really good way to get some basic image annotation. Good luck! My guess is that you'd get better results faster and cheaper just going with a combination of image metadata and talking to the researcher a bit. The problem with MT is that they won't actually know what they're looking at and you're likely to just get inconsistent keywords that are all over the place (i.e. garbage). Using metadata, you can associate equipment and times with which groups, places, events, etc. You need a little back and forth to get you started, but it should be more consistent so people can do things like actually drill through the images. kyle
[CODE4LIB] AdaCamp in San Francisco, 8-9 June 2013
My colleague Merrilee Proffitt asked me to post this to Code4LIb, as she is going to apply to attend this event and she would love see other tech-savvy library women at this event. Roy AdaCamp[1] is an Ada Initiative event focused on increasing women’s participation in open technology and culture. It will be a 200 person unconference in San Francisco on June 8–9, 2013. AdaCamp SF has two tracks. The main track is for significantly female-identified people, with a simultaneous workshop for allies. We use an inclusive definition of “woman” and “female” and we welcome trans women, genderqueer women, and non-binary people who are significantly female-identified. Attendees will be selected based on experience in open tech/culture, experience or knowledge of feminism and advocacy, ability to collaborate with others, and any rare or notable experience or background that would add to AdaCamp. A limited number of travel assistance grants are available to applicants before April 12. AdaCamp has a registration fee, but it is need-based and self-selected, with a completely free option. You do not need to go through any process to choose the free registration fee. [1] http://sf.adacamp.org/apply/