Re: [Wikimedia-l] [Commons-l] Data mining for media archives

Fæ Thu, 06 Feb 2014 21:24:31 -0800

On 7 February 2014 04:04, Samuel Klein <meta...@gmail.com> wrote:
> That's just beautiful.  Thank you, Fae & Faebot.
>
> I see that job filtered for mobile uploads without EXIF data.
> What obstacles do you envision for running such a service for all images?
>> https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM


Technically, it could probably run for a subset of recently uploaded
images in real-time. For a focus on finding copyright problems,
results would be made more meaningful if a white-list/pre-filter were
in place to ignore uploads from reliable sources, well established
user accounts or where the EXIF data or templates applied made it
highly unlikely to be a problem file (for example using templates
showing it was an upload as part of a recognized wiki-project like WLM
which has its own review process). From my experience with the mobile
upload categories, I would expect a "file duplicate/possible copyvio
to check" tag or report to be able to hit more than 90% successful at
identifying a file that will get deleted as a policy violation, or
unnecessary inferior duplicate/crop. With a little more wizardry, it
should be possible to "red-flag" some of the files as TV screen shots,
similar to previously deleted images, or even close matches to
black-listed files (such as accepted DMCA take-downs or known spam
files).

Other obstacles are less technical:

1. Faebot works without using the Tineye API, the API being quite
restrictive in the number of queries. Many thousands of queries a day
would require special permission from Tineye as even their
"commercial" access appears too limited for the volume we might
expect.

2. In reality, very few volunteers use Ogre's uploads from new
accounts report and I have had almost no spontaneous feedback on my
mobile uploads report. To make the output appealing, it may be better
to either make a special dashboard, or use bot-placed-tags for "likely
copyright issue" at the time of upload so that the flag gets used by
new page patrol-ers in their reports and tools.

3. Volunteer time and making this a priority -- I have an interesting
backlog of content creation, geo-location and potential GLAM projects,
which are more glamorous and fun than fiddling with image-matching and
copyright checking. To make a Tineye based 'similarityBot' work well,
would probably take non-trivial research, testing, development
time/code review, community consultation, report-writing, maintenance
and bug-fixing... so this might be a candidate for a grant proposal
with an element of paid dev time. I previously thought I might get a
proposal together over the summer, along with more reading up on the
Tineye API and possibly a bit more testing, but my thoughts on this
are tentative right now.

4. Many of the highest number matches (100+) in Tineye are for images
that are obviously public domain, such as photographs of well known
19th century paintings and at the same time, probably 50%+ of obvious
copyright violations are those with just 3 or fewer matches on Tineye.
Pulling the Tineye results in a more intelligent way is possible, for
example Tineye can tell you if another version of the image in on a
Wikimedia project (with a licence that probably applies to the
uploaded image) or if it is hosted by a source that we recognize and
can check the licence on, such as being on Flickr at a higher
resolution and All Rights Reserved. Building a more intelligent bot is
possible, but comes with an increasing maintenance headache as
external websites continually change, including any APIs we might
connect to and Tineye itself.

Fae
-- 
fae...@gmail.com http://j.mp/faewm

_______________________________________________
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] [Commons-l] Data mining for media archives

Reply via email to