Re: [Wikimedia-l] [Commons-l] Data mining for media archives

2014-02-06 Thread
On 7 February 2014 04:04, Samuel Klein  wrote:
> That's just beautiful.  Thank you, Fae & Faebot.
>
> I see that job filtered for mobile uploads without EXIF data.
> What obstacles do you envision for running such a service for all images?
>> https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM

Technically, it could probably run for a subset of recently uploaded
images in real-time. For a focus on finding copyright problems,
results would be made more meaningful if a white-list/pre-filter were
in place to ignore uploads from reliable sources, well established
user accounts or where the EXIF data or templates applied made it
highly unlikely to be a problem file (for example using templates
showing it was an upload as part of a recognized wiki-project like WLM
which has its own review process). From my experience with the mobile
upload categories, I would expect a "file duplicate/possible copyvio
to check" tag or report to be able to hit more than 90% successful at
identifying a file that will get deleted as a policy violation, or
unnecessary inferior duplicate/crop. With a little more wizardry, it
should be possible to "red-flag" some of the files as TV screen shots,
similar to previously deleted images, or even close matches to
black-listed files (such as accepted DMCA take-downs or known spam
files).

Other obstacles are less technical:

1. Faebot works without using the Tineye API, the API being quite
restrictive in the number of queries. Many thousands of queries a day
would require special permission from Tineye as even their
"commercial" access appears too limited for the volume we might
expect.

2. In reality, very few volunteers use Ogre's uploads from new
accounts report and I have had almost no spontaneous feedback on my
mobile uploads report. To make the output appealing, it may be better
to either make a special dashboard, or use bot-placed-tags for "likely
copyright issue" at the time of upload so that the flag gets used by
new page patrol-ers in their reports and tools.

3. Volunteer time and making this a priority -- I have an interesting
backlog of content creation, geo-location and potential GLAM projects,
which are more glamorous and fun than fiddling with image-matching and
copyright checking. To make a Tineye based 'similarityBot' work well,
would probably take non-trivial research, testing, development
time/code review, community consultation, report-writing, maintenance
and bug-fixing... so this might be a candidate for a grant proposal
with an element of paid dev time. I previously thought I might get a
proposal together over the summer, along with more reading up on the
Tineye API and possibly a bit more testing, but my thoughts on this
are tentative right now.

4. Many of the highest number matches (100+) in Tineye are for images
that are obviously public domain, such as photographs of well known
19th century paintings and at the same time, probably 50%+ of obvious
copyright violations are those with just 3 or fewer matches on Tineye.
Pulling the Tineye results in a more intelligent way is possible, for
example Tineye can tell you if another version of the image in on a
Wikimedia project (with a licence that probably applies to the
uploaded image) or if it is hosted by a source that we recognize and
can check the licence on, such as being on Flickr at a higher
resolution and All Rights Reserved. Building a more intelligent bot is
possible, but comes with an increasing maintenance headache as
external websites continually change, including any APIs we might
connect to and Tineye itself.

Fae
-- 
fae...@gmail.com http://j.mp/faewm

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] [Commons-l] Data mining for media archives

2014-02-06 Thread Samuel Klein
That's just beautiful.  Thank you, Fae & Faebot.

I see that job filtered for mobile uploads without EXIF data.
What obstacles do you envision for running such a service for all images?

On Thu, Feb 6, 2014 at 7:59 PM, Fæ  wrote:
> On 6 Feb 2014 22:40, "Samuel Klein"  wrote:
> ...
>> Are we doing any commons analysis like this at the moment?
>> Is any similarity-analysis done on upload to help uploaders identify
>> copies of the same image that already exist online?  Or to flag
>> potential copyvios for reviewers
>
> Yes O:-)
> Checkout Faebot's work with Tineye here:
> https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
> 



-- 
Samuel Klein  @metasj   w:user:sj  +1 617 529 4266

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] [Commons-l] Data mining for media archives

2014-02-06 Thread
On 6 Feb 2014 22:40, "Samuel Klein"  wrote:
...
> Are we doing any commons analysis like this at the moment?
> Is any similarity-analysis done on upload to help uploaders identify
> copies of the same image that already exist online?  Or to flag
> potential copyvios for reviewers

Yes O:-)
Checkout Faebot's work with Tineye here:
https://commons.m.wikimedia.org/wiki/User:Faebot/SandboxM
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] [Textbook-l] Textbooks Which Borrow Heavily from Wikipedia

2014-02-06 Thread Federico Leva (Nemo)

Samuel Klein, 06/02/2014 22:41:

How could they improve attribution?


What Phoebe said. A link to each history page *might* be enough but, 
especially if they're ebooks, a full list of names costs little (even 
though it can be ugly).



What download formats or APIs would we like to see to enable reposting
to Wikibooks, or better cross-platform collaboration?


Making books or ebooks out of wiki pages is not a trivial task. If a 
publisher does so for us, fantastic! Even just giving "us", i.e. the 
public, a copy of said ebooks in a free format, for free, would be a 
gain. For instance, they could just upload them all as ePub on 
archive.org. Then, if they have a continuous production and update, 
Wikibooks could establish some interlinking, telling users that there is 
an ebook version at X.



Is anyone on wikibooks currently working on importing such materials,
in Tamil or English or other languages?


Do they have any original content? In that case it would be nice if 
their "fork" shared the sources with us, so that the content can be 
remixed. I doubt they use wikitext and at least for some time we won't 
have an integrated HTML import in MediaWiki, but having TeX or DocBook 
sources, or whatever, would be great.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] Data mining for media archives

2014-02-06 Thread Samuel Klein
John Resig has just published some excellent data analysis combining
TinEye, image archives, and image clustering and deduplication to
identify identical and similar images across a large corpus.

http://ejohn.org/research/computer-vision-photo-archives/

Are we doing any commons analysis like this at the moment?
Is any similarity-analysis done on upload to help uploaders identify
copies of the same image that already exist online?  Or to flag
potential copyvios for reviewers?

I'm sure TinEye would be glad to give us high-volume API access to
enable that sort of cross-referencing.

SJ

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Textbooks Which Borrow Heavily from Wikipedia

2014-02-06 Thread phoebe ayers
On Thu, Feb 6, 2014 at 1:41 PM, Samuel Klein  wrote:

> I'm meeting with the Boundless team tomorrow.
>
> How could they improve attribution?


Looking at the g-book that James linked (and without paying for a download
etc) I don't see any particular attribution at all in the book itself. The
inside front cover should have a publisher credit (boundless), a date, and
a) authors/editors; b) a list of sources where they've taken info from...
the wikipedia articles (perma-urls), other sources. Also, license??

Without that (and any other useful info: place of publication, URL, ISBN,
etc) not only is it not attributed for our purposes, it's a nightmare for
any hapless library cataloger who might want to add it to a library
collection :P

-- phoebe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Textbooks Which Borrow Heavily from Wikipedia

2014-02-06 Thread David Gerard
On 6 February 2014 21:41, Samuel Klein  wrote:

> I'm meeting with the Boundless team tomorrow.


Excellent!


> How could they improve attribution?
> What download formats or APIs would we like to see to enable reposting
> to Wikibooks, or better cross-platform collaboration?


Yeah, this is it. Our entire raison d'etre is: "Use our stuff! Please!"

*But* we want collaboration and upstreamed fixes and so forth in
return. What can we do to make that easier?


- d.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Textbooks Which Borrow Heavily from Wikipedia

2014-02-06 Thread Samuel Klein
I'm meeting with the Boundless team tomorrow.

How could they improve attribution?
What download formats or APIs would we like to see to enable reposting
to Wikibooks, or better cross-platform collaboration?

Is anyone on wikibooks currently working on importing such materials,
in Tamil or English or other languages?

SJ

On Tue, Nov 26, 2013 at 2:26 AM, James Heilman  wrote:
> Have come across a collection of basic college textbooks that appear to be
> more or less based on text from Wikipedia. There are 21 of them. The
> company claims that they are being used by more than 2 million students.
>
> They are under a CC BY SA license and if you follow the links seen here
> http://books.google.ca/books?id=7avpQBAJ&pg=PA2058 they do eventually
> attribute Wikipedia.
>
> They are being offered for free on amazon.com
> http://www.amazon.com/s/ref=nb_sb_noss_2?url=search-alias%3Daps&field-keywords=Boudless
> and
> are being sold for $19.99 on their website. https://www.boundless.com/
>
> So the question is should we have a response? I think this could generate
> position press for our movement. Attribution could be better (I would
> consider theirs to be borderline). Additionally should we be adding this
> textbooks to Wikiversity or Wikibooks to make sure they stay free available?
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
> 



-- 
Samuel Klein  @metasj   w:user:sj  +1 617 529 4266

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Botopedia?

2014-02-06 Thread Anders Wennersten

Thanks Sam, your answer warms my soul!

And you summarize my key points excellent, (and clearer than I managed 
myself)


@Gerard: Our visions are very close and I support yours in general. On a 
more concrete level it seems we have some different views, it could be 
misundertandings from my side, it could be that we think of different 
article subject segments or even that we have different perspective on 
what can be feasible at different point in time.  My strong belief (and 
life experience) is that is in the meeting of different perspectives, 
like ours in this case, that really bright concepts and solutions turns 
up! And unfortunately a mail list is not really working for an exchange 
of ideas and concepts, so I wonder over possibilities to have some time 
a IRL gathering to really discuss through these issues and reach new 
enlightenments. I am open to anywhere anyplace, Wikimania could be one 
opportunity if it does not put this a bit far away. Or could we create a 
special subtrack at Wikimania for this??


Anders

 




Samuel Klein skrev 2014-02-06 21:29:

@Anders: I seem to have unintentionally derailed your excellent
thread.   My apologies; I've taken responses to that subthread
offline.  To return to your main point: we do need  'A strategy for
semi-automated article generation; and inclusion of Wikidata'.

Anders Wennersten writes:
< [we] will not be able to achieve our goal without... technical expertise
< (like knowledge in Lua, how to write datainterface to external dataproviders)

And it is important to attract and expand this sort of expertise.  Not
only through local chapter support but through collaboration across
different project-communities, as you say.

@Gerard: I second your vision for Wikidata.  It is a natural place to
cultivate tools for large-scale creation and enhancement of
information.  And for now it seems open to experimentation, being
bold, trying and reverting things.


Wikidata is a wiki. You indicate that the official sources
need work. Wikidata is a good place to work on this.

+1 !

Sam.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 




___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Botopedia?

2014-02-06 Thread Samuel Klein
@Anders: I seem to have unintentionally derailed your excellent
thread.   My apologies; I've taken responses to that subthread
offline.  To return to your main point: we do need  'A strategy for
semi-automated article generation; and inclusion of Wikidata'.

Anders Wennersten writes:
< [we] will not be able to achieve our goal without... technical expertise
< (like knowledge in Lua, how to write datainterface to external dataproviders)

And it is important to attract and expand this sort of expertise.  Not
only through local chapter support but through collaboration across
different project-communities, as you say.

@Gerard: I second your vision for Wikidata.  It is a natural place to
cultivate tools for large-scale creation and enhancement of
information.  And for now it seems open to experimentation, being
bold, trying and reverting things.

> Wikidata is a wiki. You indicate that the official sources
> need work. Wikidata is a good place to work on this.

+1 !

Sam.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Invitation to WMF January 2014 Metrics & Activities Meeting: Thursday, February 6, 19:00 UTC

2014-02-06 Thread Praveena Maharaj
REMINDER: This meeting starts in 30 minutes.


On Thu, Jan 30, 2014 at 3:29 PM, Praveena Maharaj wrote:

> Dear all,
>
> The next WMF metrics and activities meeting will take place on Thursday,
> February 6, 2014 at 7:00 PM UTC (11 AM PST). The IRC channel is
> #wikimedia-office on irc.freenode.net and the meeting will be broadcast
> as a live YouTube stream.
>
> The current structure of the meeting is:
>
> * Review of key metrics including the monthly report card, but also
> specialized reports and analytic
> * Review of financials
> * Welcoming recent hires
> * Brief presentations on recent projects, with a focus on highest priority
> initiatives
> * Update and Q&A with the Executive Director, if available
>
> Please review
> https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings for
> further information about how to participate.
>
> We'll post the video recording publicly after the meeting.
>
> Thank you,
> Praveena
>
> --
> Praveena Maharaj
> Executive Assistant to the VP of Engineering and Product Development
> +1 (415) 839 6885 ext. 6689
> www.wikimedia.org
>



-- 
Praveena Maharaj
Executive Assistant to the VP of Engineering and Product Development
+1 (415) 839 6885 ext. 6689
www.wikimedia.org
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


[Wikimedia-l] Language Engineering IRC Office Hour on February 12, 2014 (Wednesday) at 1700 UTC

2014-02-06 Thread Runa Bhattacharjee
[x-posted]

Hello,

The Wikimedia Language Engineering team will be hosting the monthly IRC
office hour on February 12, 2014 (Wednesday) at 1700 UTC/ 0900 PDT on
#wikimedia-office.

This time we would be talking about the recent changes made to the
Universal Language Selector (ULS)  - the MediaWiki extension that provides
unified language configuration[1] - and the impact on the Wikimedia wikis.
We look forward to addressing any questions you may have about this. Please
see below for the event details.

Questions can also be sent to me before the event. See you all at the IRC
office hour!

Thanks
Runa

[1] https://www.mediawiki.org/wiki/Universal_Language_Selector

Event Details:
==

# Date: February 12, 2014

# Time: 1700-1800 UTC, 0900-1000 PDT (
http://www.timeanddate.com/worldclock/fixedtime.html?iso=20140212T1700)

# IRC channel: #wikimedia-office on irc.freenode.net

Agenda:
==
1. Universal Language Selector (ULS) update and developments
2. Q & A

-- 
Language Engineering - Outreach and QA Coordinator
Wikimedia Foundation
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,