Guillaume,


A few thoughts on file metadata cleanup drive:



1)      https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and 
Category:Files with no machine-readable 
license<https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license>
 show ~78k files without  machine-readable license. Couple years back we had a 
big push to make sure that all files on commons have licenses and we managed to 
fix all the files without them (they were mostly files where license was lost 
somehow or was not using one of the standard templates. Ever since we have a 
bot checking the database from time to time and adding files without license to 
Category:Media without a license: needs history 
check<https://commons.wikimedia.org/wiki/Category:Media_without_a_license:_needs_history_check>.
 New uploads get {{no license}} template and have a week to add one and old 
uploads, which likely lost a license somehow are processed manually. There are 
29 files there now, all the other files on Commons do have a license or are 
tagged with {{no license}} or similar template. So all the files in 
Category:Files with no machine-readable 
license<https://commons.wikimedia.org/wiki/Category:Files_with_no_machine-readable_license>
 need work to be done with licenses, not files. I do not know what 
machine-readable metadata is needed but I can help with adding them.

2)      Your number of files missing machine-readable metadata on Commons: 
~533,000,  seems a bit low. According to 
Special:MostTranscludedPages<https://commons.wikimedia.org/wiki/Special:MostTranscludedPages>
 there are 24,136,218 files with licenses ({{License template 
tag<https://commons.wikimedia.org/wiki/Template:License_template_tag>}}‏‎), and 
23,452,741 files with infobox templates ({{Information}} or {{Infobox template 
tag<https://commons.wikimedia.org/wiki/Template:Infobox_template_tag>‏‎}}, so I 
would expect 683,477 files without any infobox templates.

3)      As I mentioned on 
Commons:Bots/Work_requests#An_example_pattern<https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#An_example_pattern>
 I would like to first give the original uploaders a chance to fix the files. 
We can do that by writing a standard message, which without any threat of 
deletion, ask for help with bringing their files up to current standards. We 
should have one message per uploader with a list of all the files that need 
infoboxes. We should also advise them on the use of VisualFileChange gadget or 
requesting specific tasks to be done by bots at Commons:Bots/Work requests. 
VisualFileChange gadget by user:Rillke does have an option “Prepend text, 
notify uploaders” which does almost what I need (one message per uploader), but 
I would prefer a python code.

4)      At some point I started adding such files to [[Category:Media missing 
infobox 
template<https://commons.wikimedia.org/wiki/Category:Media_missing_infobox_template>]]
 for better tracking and started sub-categorizing them into

a.       Files with OTRS

b.      Files with {{information}} template which have some parsing issues

c.       Files with {{PD-Art}} which should use {{Artwork}} template and where 
the name of the uploader, upload date, and even source might not be relevant

d.      Files using PD license, like PD-old (except PD-Author or PD-User): for 
those files it might also the name of the uploader, upload date, and even 
source might not be relevant

It might be easier to add infoboxes for different groups of files. For example 
Magnus' 
add_information.php<http://toolserver.org/%7Emagnus/add_information.php> tool 
does not work well for artworks. We also seem to have users that specialize in 
different subjects and it might be easier to get their attention with smaller 
groups of files of one type.



Jarek T.

(user:Jarekt)



-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Guillaume Paumier
Sent: Thursday, December 11, 2014 2:16 PM
To: Coordination of technology deployments across languages/projects; Wikimedia 
Commons Discussion List
Subject: [Commons-l] File metadata cleanup drive: We now have numbers for 
Commons



Greetings,



As many of you are aware, we're currently in the process of collectively adding 
machine-readable metadata to many files and templates that don't have them, 
both on Commons and on all other Wikimedia wikis with local uploads [1,2]. This 
makes it much easier to see and re-use multimedia files consistently with best 
practices for attribution across a variety of channels (offline, PDF exports, 
mobile platforms, MediaViewer, WikiWand, etc.)



In October, I created a dashboard to track how many files were missing the 
machine-readable markers on each wiki [3]. Unfortunately, due to the size of 
Commons, I needed to find another way to count them there.



Yesterday, I finished to implement the script for Commons, and started to run 
it. As of today, we have accurate numbers for the quantity of files missing 
machine-readable metadata on Commons: ~533,000, out of

~24 million [4]. It may seem like a lot, but I personally think it's a great 
testament to the dedication of the Commons community.



Now that we have numbers, we can work on going through those files and fixing 
them. Many of them are missing the {{information}} template, but many of those 
are also part of a batch: either they were uploaded by the same user, or they 
were mass-uploaded by a bot. In either case, this makes it easier to parse the 
information and add the {{information}} template automatically with a bot, thus 
avoiding painful manual work.



I invite you to take a look at the list of files at 
https://tools.wmflabs.org/mrmetadata/commons/commons/index.html and see if you 
can find such groups and patterns.



Once you identify a pattern, you're encouraged to add a section to the Bot 
Requests page on Commons, so that a bot owner can fix them:

https://commons.wikimedia.org/wiki/Commons:Bots/Work_requests#Adding_the_Information_template_to_files_that_don.27t_have_it



I believe we can make a lot of progress rapidly if we dive into the list of 
files and fix all the groups we can find. The list and statistics will be 
updated daily so it'll be easy to see our progress.



Let me know if you'd like to help but are unsure how!



[1] https://meta.wikimedia.org/wiki/File_metadata_cleanup_drive

[2] 
https://blog.wikimedia.org/2014/11/07/cleaning-up-file-metadata-for-humans-and-robots/

[3] https://tools.wmflabs.org/mrmetadata/

[4] https://tools.wmflabs.org/mrmetadata/commons/commons/index.html



--

Guillaume Paumier



_______________________________________________

Commons-l mailing list

[email protected]<mailto:[email protected]>

https://lists.wikimedia.org/mailman/listinfo/commons-l
_______________________________________________
Wikitech-ambassadors mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-ambassadors

Reply via email to