Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread Dave Caroline
I had a project to de duplicate many images and other files too.
I wrote a little ditty in PHP but the idea can by used in any language.


I have a set of tables in MySQL.
give the utility a set of root directories to test and compare
trawl the filestems for filename location and size and store in the first table
issue sql insert into duplicatesizetable Select
filesize,count(filesize) as qty from nametable group by filesize
having qty1
you now have the sizes of possible duplicates
only now do you crc/md5sum the files of that size
update the nametable with crc/md5 values as calculated
there can be false positives if two different files crc values are the same
then a final bit of sql
Select filesize,count(crc) as qty from nametable group by filesize,crc
having qty1
I store in a table so I can leave the job and come back

join the results to the nametable

which gets the real duplication sizes and crc, which can now be used
to guide a human to to clean the mess

I give the user a table showing the n files with buttons to delete,
view, ignore (you may want to keep two/more copies)
for safety one can leave part of the filesystem write protected also
the form oly allows one button per group

http://www.archivist.info/Screenshot_Delete_duplicates.png
that took a few minutes only to get the duplicates from a 9gb picture directory

Dave Caroline


Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread chris fitzpatrick
inline: compose-unknown-contact.jpg

[CODE4LIB] Job: International Community Manager at Open Knowledge Foundation

2013-03-20 Thread jobs
**About the role**  
Someone highly articulate, enthusiastic and energetic who is willing to
travel. While familiarity with email, blogs and Twitter is desirable, no
specific technical knowledge is required. Being able to learn quickly,
converse intelligently and evangelise convincingly are more valuable to us
than detailed knowledge of open knowledge and open data policies.

  
Duties are negotiable, but projected to include tasks such as:

  * Representing the Open Knowledge Foundation and its various projects and 
activities at events around the world
  * Expanding and strengthening the open knowledge community around the world - 
including public officials, civic society organisations, developers, data 
journalists and others
  * Organising and facilitating events, workshops and meetings about open 
knowledge - bringing together key stakeholders from different areas
  * Following key developments on mailing lists, blogs and Twitter - and 
inviting people and organisations to participate in relevant projects, 
activities and events
  * Blogging about open knowledge around the world - and soliciting for guest 
blog posts from key stakeholders
  * Connecting people, groups and projects with common interests - and 
encouraging them to collaborate
  * Promoting key principles and values in the open knowledge community, such 
as legal/technical standards for open data (e.g. http://opendefinition.org/), 
and the importance of open source tools and infrastructure
  * Building the Open Knowledge Foundation network around the world - including 
helping to set up and encourage others to set up local groups as well as media 
relations
  * Developing processes and systems to help support our international network, 
including handbooks and governance structures (such as councils), and improving 
these once in place
  * Continuously monitoring and analysing the needs of groups and potential 
groups around the world, and exploring how best to support them within the open 
knowledge community
  * Doing unexpected stuff spontaneously - helping to organise something you 
have never done before, connecting people that you have never met before, or 
pitching something you have never thought of before
This role sits within the Network unit of the the Open Knowledge Foundation.

  
**Person specification**  
We are looking for someone self-driven, organised and a good communicator.
This person should be comfortable running a number of projects at the same
time, speaking at events and travelling - sometimes at short notice - and
great at empathising and engaging with community members at all levels, from
lawyers and civil servants to activist developers.

  
**Location**  
We will consider applicants based anywhere in the world; however a mild
preference is given to those close to one of our hubs in London, Berlin or
Cambridge.

  
**Pay, availability  closing date**  
The rate is negotiable based on experience. This full-time position is
available immediately. The closing date for applications is 25th March 2013.

  
**How to apply**  
To apply please send a cover letter highlighting relevant experience, your CV
and a 30-second video explaining your interest in the role to j...@okfn.org.



Brought to you by code4lib jobs: http://jobs.code4lib.org/job/6917/


Re: [CODE4LIB] Image de-duping and file identification

2013-03-20 Thread Kyle Banerjee
On Wed, Mar 20, 2013 at 2:22 AM, chris fitzpatrick
chrisfitz...@gmail.comwrote:

 Anyone please correct me if this is wrong. A md5/sha1 file hash would also
 not get any image derivatives, like crops or they added text or tweaked the
 contrast or photoshopped their cat into the shot...


 If you really wanted to geek out, you could look into some machine
 learning techniques to build a classifier that groups the images for you,
 which might be more a PhD project for someone


Agreed. BTW, exiftool might be very useful for detecting photos manipulated
in this way because the original create time shouldn't be touched plus
there are some other data points you'd be able to use for comparison. YMMV
depending on software used to manipulate the images.

Picasa is very good at finding similar images. I would have suggested that
earlier except I have no idea how it would perform on 300K photos. It works
quite well in the 20K-30K range though it really seems designed to work
with sets up to several thousand which makes sense given who they aim it
at. But I hate that it mangles metadata since that makes it difficult to
use for tagging unless you don't care about the original metadata and it is
graphically oriented -- I'm pretty sure that it would be far more efficient
to use metadata than to have picasa try to figure things out and then list
out what it thought were dups.


 A less sexy but really good strategy would also be to use AWS Mechanical
 Turk, which I think seems like a really good way to get some basic  image
 annotation.
 Good luck!


My guess is that you'd get better results faster and cheaper just going
with a combination of image metadata and talking to the researcher a bit.
The problem with MT is that they won't actually know what they're looking
at and you're likely to just get inconsistent keywords that are all over
the place (i.e. garbage). Using metadata, you can associate equipment and
times with which groups, places, events, etc. You need a little back and
forth to get you started, but it should be more consistent so people can do
things like actually drill through the images.

kyle


[CODE4LIB] AdaCamp in San Francisco, 8-9 June 2013

2013-03-20 Thread Roy Tennant
My colleague Merrilee Proffitt asked me to post this to Code4LIb, as
she is going to apply to attend this event and she would love see
other tech-savvy library women at this event.
Roy

AdaCamp[1] is an Ada Initiative event focused on increasing women’s
participation in open technology and culture. It will be a 200 person
unconference in San Francisco on June 8–9, 2013.

AdaCamp SF has two tracks. The main track is for significantly
female-identified people, with a simultaneous workshop for allies. We
use an inclusive definition of “woman” and “female” and we welcome
trans women, genderqueer women, and non-binary people who are
significantly female-identified.

Attendees will be selected based on experience in open tech/culture,
experience or knowledge of feminism and advocacy, ability to
collaborate with others, and any rare or notable experience or
background that would add to AdaCamp. A limited number of travel
assistance grants are available to applicants before April 12. AdaCamp
has a registration fee, but it is need-based and self-selected, with a
completely free option. You do not need to go through any process to
choose the free registration fee.

[1] http://sf.adacamp.org/apply/