Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-22 Thread Steven Walling
On Thu, Nov 21, 2013 at 12:37 AM, WereSpielChequers 
werespielchequ...@gmail.com wrote:

 Typo correction and vandalism reversion are certainly both entries to
 editing, and it isn't just anti-vandalism where the opportunities have
 declined in recent years. Typos are getting harder to find, especially in
 stable widely read articles. Yes you can find plenty of typos by checking
 new pages and recent changes, but I doubt our  5 edits a month editors are
 going to internal maintenance pages like that. I suspect they are readers
 who fix things they come across. It would be interesting to survey a sample
 of them I suspect we'd find many who are reading Wikipedia just as much as
 they used to, but if they only edit when they spot a mistake then of course
 they will now be editing less frequently. And of course none of that is
 actually bad, any more than is the loss of large numbers of vandals who
 used to get into the 5 edits a month band for at least the month in which
 they did their spree and were blocked..

 The difficulty of getting precise measurements of community health makes
 it a fascinating topic, and with many known factors altering edit levels in
 sometimes poorly understood ways we need to be wary of oversimplifications.
 No-one really knows what would have happened if the many edit filters
 installed in the last four years had instead been coded as anti vandalism
 bots, clearly our edit count would now be much higher, but whether it would
 currently be higher or lower than in 2009 when the edit filters were
 introduced is unknown. Nor should we fret that we shifted so much of our
 anti-vandalism work from very quick reversion to not accepting edits.
 However it isn't sensible to  benchmark community health against past edit
 levels, we should really be comparing community activity against readership
 levels. If we do that there is a disconnect between our readership which
 for years has grown faster than the internet and our community which is
 broadly stable. To some extent this can be considered a success for Vector
 and the shift of our default from a skin optimised for editing to one
 optimised for reading. Of course if we want to increase editing levels we
 always have the option of defaulting new accounts to Monobook instead of
 Vector. My suspicion is also that the rise of the mobile device, especially
 amongst the young, is turning us from an interactive medium into more of a
 broadcast one. It is also likely to be contributing to the greying of the
 pedia.

 I am trying to list the major known and probable causes of changes of the
 fall in the raw editing levels in a page on
 wiki
 https://en.wikipedia.org/wiki/User:WereSpielChequers/Going_off_the_boil%3F
 ,
 feedback welcome.


Holy smokes this thread has gotten off topic, but I'll bite. ;)

Making articles that need spelling and grammar fixes easily available to
new editors is precisely what we're doing with GettingStarted, our software
system for introducing newly-registered people to editing. (Docs at
https://en.wikipedia.org/wiki/Wikipedia:GettingStarted and
https://www.mediawiki.org/wiki/Onboarding_new_Wikipedians). We're currently
getting thousands of new people to make their first typo fix a month on
English Wikipedia, and we're moving to other Wikipedias soon.

In English Wikipedia it's quite easy for us to do so, since there's a large
category of articles needing copyediting. In other Wikipedias, it's not
easy, because there is no such category. If you want to help us help
newbies, the best thing you could do is create a copyediting category on
your Wikipedia and link it to the appropriate Wikidata item
(either Q8235695 or Q9137504).

As a side point: when we examine first-time editors contributions, these
days it's rare to find someone start out by correcting vandalism, probably
because now bots and users of tools like Huggle or Twinkle catch it all so
fast. It's so small a number that when we examine samples of new
contributors in our qualitative research,[1][2] we just put it in the Other
category of edit types.

Steven

1.
https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/Qualitative_analysis
2.
https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/OB6/Contribution_quality_and_type
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-22 Thread The Cunctator
Also, vandalism had always been a red herring, kind of like the terrorism
that justifies the TSA security theater and NBA surveillance or the Red
Scare. It's a wrong-headed obsession that weakens community.
On Nov 22, 2013 2:06 PM, Steven Walling steven.wall...@gmail.com wrote:

 On Thu, Nov 21, 2013 at 12:37 AM, WereSpielChequers 
 werespielchequ...@gmail.com wrote:

  Typo correction and vandalism reversion are certainly both entries to
  editing, and it isn't just anti-vandalism where the opportunities have
  declined in recent years. Typos are getting harder to find, especially in
  stable widely read articles. Yes you can find plenty of typos by checking
  new pages and recent changes, but I doubt our  5 edits a month editors
 are
  going to internal maintenance pages like that. I suspect they are readers
  who fix things they come across. It would be interesting to survey a
 sample
  of them I suspect we'd find many who are reading Wikipedia just as much
 as
  they used to, but if they only edit when they spot a mistake then of
 course
  they will now be editing less frequently. And of course none of that is
  actually bad, any more than is the loss of large numbers of vandals who
  used to get into the 5 edits a month band for at least the month in which
  they did their spree and were blocked..
 
  The difficulty of getting precise measurements of community health
 makes
  it a fascinating topic, and with many known factors altering edit levels
 in
  sometimes poorly understood ways we need to be wary of
 oversimplifications.
  No-one really knows what would have happened if the many edit filters
  installed in the last four years had instead been coded as anti vandalism
  bots, clearly our edit count would now be much higher, but whether it
 would
  currently be higher or lower than in 2009 when the edit filters were
  introduced is unknown. Nor should we fret that we shifted so much of our
  anti-vandalism work from very quick reversion to not accepting edits.
  However it isn't sensible to  benchmark community health against past
 edit
  levels, we should really be comparing community activity against
 readership
  levels. If we do that there is a disconnect between our readership which
  for years has grown faster than the internet and our community which is
  broadly stable. To some extent this can be considered a success for
 Vector
  and the shift of our default from a skin optimised for editing to one
  optimised for reading. Of course if we want to increase editing levels we
  always have the option of defaulting new accounts to Monobook instead of
  Vector. My suspicion is also that the rise of the mobile device,
 especially
  amongst the young, is turning us from an interactive medium into more of
 a
  broadcast one. It is also likely to be contributing to the greying of the
  pedia.
 
  I am trying to list the major known and probable causes of changes of the
  fall in the raw editing levels in a page on
  wiki
 
 https://en.wikipedia.org/wiki/User:WereSpielChequers/Going_off_the_boil%3F
  ,
  feedback welcome.
 

 Holy smokes this thread has gotten off topic, but I'll bite. ;)

 Making articles that need spelling and grammar fixes easily available to
 new editors is precisely what we're doing with GettingStarted, our software
 system for introducing newly-registered people to editing. (Docs at
 https://en.wikipedia.org/wiki/Wikipedia:GettingStarted and
 https://www.mediawiki.org/wiki/Onboarding_new_Wikipedians). We're
 currently
 getting thousands of new people to make their first typo fix a month on
 English Wikipedia, and we're moving to other Wikipedias soon.

 In English Wikipedia it's quite easy for us to do so, since there's a large
 category of articles needing copyediting. In other Wikipedias, it's not
 easy, because there is no such category. If you want to help us help
 newbies, the best thing you could do is create a copyediting category on
 your Wikipedia and link it to the appropriate Wikidata item
 (either Q8235695 or Q9137504).

 As a side point: when we examine first-time editors contributions, these
 days it's rare to find someone start out by correcting vandalism, probably
 because now bots and users of tools like Huggle or Twinkle catch it all so
 fast. It's so small a number that when we examine samples of new
 contributors in our qualitative research,[1][2] we just put it in the Other
 category of edit types.

 Steven

 1.

 https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/Qualitative_analysis
 2.

 https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/OB6/Contribution_quality_and_type
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread
On 19 November 2013 20:44, Samuel Klein meta...@gmail.com wrote:
 Aside @Fae: the tineye crew are curious  quite pro-freeculture, I bet they
 would be glad to help design a bot that uses their API to check image
 copyvios.

This is an area this spins off from my little experiments with better
management of uploads to Commons from mobile devices. I would like to
look at this again and perhaps get a funding proposal together (or
partnership with Tineye if they are up for it), It is one of several
creative back-burner volunteer projects that I hope to have time to
dig into again next year.

Fae

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread The Cunctator
Yes, let's keep on pushing for policies that drive away editors!
On Nov 20, 2013 2:10 AM, Fæ fae...@gmail.com wrote:

 On 19 November 2013 20:44, Samuel Klein meta...@gmail.com wrote:
  Aside @Fae: the tineye crew are curious  quite pro-freeculture, I bet
 they
  would be glad to help design a bot that uses their API to check image
  copyvios.

 This is an area this spins off from my little experiments with better
 management of uploads to Commons from mobile devices. I would like to
 look at this again and perhaps get a funding proposal together (or
 partnership with Tineye if they are up for it), It is one of several
 creative back-burner volunteer projects that I hope to have time to
 dig into again next year.

 Fae

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Martijn Hoekstra
On Nov 20, 2013 1:13 PM, The Cunctator cuncta...@gmail.com wrote:

 Yes, let's keep on pushing for policies that drive away editors!

I'm not sure exactly what kind of policy you are getting at here. Could you
elaborate a little?

 On Nov 20, 2013 2:10 AM, Fæ fae...@gmail.com wrote:

  On 19 November 2013 20:44, Samuel Klein meta...@gmail.com wrote:
   Aside @Fae: the tineye crew are curious  quite pro-freeculture, I bet
  they
   would be glad to help design a bot that uses their API to check image
   copyvios.
 
  This is an area this spins off from my little experiments with better
  management of uploads to Commons from mobile devices. I would like to
  look at this again and perhaps get a funding proposal together (or
  partnership with Tineye if they are up for it), It is one of several
  creative back-burner volunteer projects that I hope to have time to
  dig into again next year.
 
  Fae
 
  ___
  Wikimedia-l mailing list
  Wikimedia-l@lists.wikimedia.org
  Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
  mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 07:13 AM, The Cunctator wrote:
 Yes, let's keep on pushing for policies that drive away editors!

Let's be clear here: contributions that are copyright violations are not
desirable to begin with.  If someone is driven away because they cannot
cut and paste from random websites anymore, I'm not sure that this could
reasonably be taken to be a bad thing.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 8:31 AM, Marc A. Pelletier wrote:

On 11/20/2013 07:13 AM, The Cunctator wrote:

Yes, let's keep on pushing for policies that drive away editors!

Let's be clear here: contributions that are copyright violations are not
desirable to begin with.  If someone is driven away because they cannot
cut and paste from random websites anymore, I'm not sure that this could
reasonably be taken to be a bad thing.
Not that I encourage us to be permissive about copyright infringement, 
but there are two potential aspects here. You've touched on the first, 
which is contributors who do the copying - if they are willing to 
change, that's fine, although I'm skeptical about the value of editors 
who don't know any better and certainly repeat offenders should be 
highly unwelcome. But the second aspect is the loss of tasks other 
editors may be able to participate in, if there's potential for 
overautomation of the review process and corresponding loss of human 
judgment (What needs to be removed and what could be fixed just by 
citing the source? How thorough a rewrite is necessary to avoid 
plagiarizing source text?). An essential part of collaboration is, after 
all, reviewing each other's work. From the terseness of the comment, it 
might be alluding to either aspect or both.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread The Cunctator
There's also been discussion of automatically deleting content from
contributors contributor from their own writing.
On Nov 20, 2013 8:31 AM, Marc A. Pelletier m...@uberbox.org wrote:

 On 11/20/2013 07:13 AM, The Cunctator wrote:
  Yes, let's keep on pushing for policies that drive away editors!

 Let's be clear here: contributions that are copyright violations are not
 desirable to begin with.  If someone is driven away because they cannot
 cut and paste from random websites anymore, I'm not sure that this could
 reasonably be taken to be a bad thing.

 -- Marc


 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 11:59 AM, Michael Snow wrote:
 An essential part of collaboration is, after all, reviewing each other's
 work. From the terseness of the comment, it might be alluding to either
 aspect or both.

That's actually an interesting question that has been lurking beneath
all the editing is going down nervousness.

How much of that 'editing' was, in fact, busy work made immaterial by
technical advantage (bots, extensions, abusefilter)?  The number of
antivandalism edits a /human/ has to do in a day has most certainly come
down a *lot* since c. 2006; this no doubt contributed to a large - now
diminishing - fraction of total edits.

It's not clear to me that the number of *productive* edits has been
going down all that much (if at all) in the past several years; the
proportion of edits that were tedious and repetitive clearly has.

Are you arguing that there is *value* in volunteers spending time on
work that could be automated?  Except for artificially driving up edit
counts, that is time (and effort) that would be better spent pretty much
anywhere else!

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Richard Symonds
Not quite: I would argue that anti-vandalism work is a gateway drug to
the rest of the project. Just a hunch, though.
On Nov 20, 2013 5:21 PM, Marc A. Pelletier m...@uberbox.org wrote:

 On 11/20/2013 11:59 AM, Michael Snow wrote:
  An essential part of collaboration is, after all, reviewing each other's
  work. From the terseness of the comment, it might be alluding to either
  aspect or both.

 That's actually an interesting question that has been lurking beneath
 all the editing is going down nervousness.

 How much of that 'editing' was, in fact, busy work made immaterial by
 technical advantage (bots, extensions, abusefilter)?  The number of
 antivandalism edits a /human/ has to do in a day has most certainly come
 down a *lot* since c. 2006; this no doubt contributed to a large - now
 diminishing - fraction of total edits.

 It's not clear to me that the number of *productive* edits has been
 going down all that much (if at all) in the past several years; the
 proportion of edits that were tedious and repetitive clearly has.

 Are you arguing that there is *value* in volunteers spending time on
 work that could be automated?  Except for artificially driving up edit
 counts, that is time (and effort) that would be better spent pretty much
 anywhere else!

 -- Marc


 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 9:20 AM, Marc A. Pelletier wrote:

That's actually an interesting question that has been lurking beneath
all the editing is going down nervousness.

How much of that 'editing' was, in fact, busy work made immaterial by
technical advantage (bots, extensions, abusefilter)?  The number of
antivandalism edits a /human/ has to do in a day has most certainly come
down a *lot* since c. 2006; this no doubt contributed to a large - now
diminishing - fraction of total edits.

It's not clear to me that the number of *productive* edits has been
going down all that much (if at all) in the past several years; the
proportion of edits that were tedious and repetitive clearly has.

Are you arguing that there is *value* in volunteers spending time on
work that could be automated?  Except for artificially driving up edit
counts, that is time (and effort) that would be better spent pretty much
anywhere else!
A lot of work that gets automated is not necessarily difficult for 
humans, just time-consuming. But volunteer time is not a resource we get 
to allocate or control; the volunteers do. Simple tasks can help recruit 
or retain contributors--providing a way to ease people into 
participation, or a break to prevent burnout between tackling more 
challenging projects. And while that time and effort might appear more 
valuable if spent on other tasks, there's no guarantee that it in fact 
would be.


For tasks that most contributors find unpleasant (dealing with certain 
types of vandalism, perhaps), automation is clearly the way to go. But 
repetition does not necessarily equal tedium in all circumstances or for 
all people. Nor do we need to apply some business-type evaluation of 
what constitutes productive effort, at least in the context of 
volunteer work. If a task simply makes someone feel productive, their 
own evaluation is what matters, and it can help them feel more engaged 
and part of the community.


My general point is that opportunities for automation are best 
considered with our overall mission in mind, not just the speed or 
efficiency of a particular workflow. In certain situations, automation 
that creates more work rather than removing it (such as by identifying 
potential tasks and feeding them to editors) might be preferable. And 
some of our tools already use such an approach, which is a good thing.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 01:06 PM, Richard Symonds wrote:
 Not quite: I would argue that anti-vandalism work is a gateway drug to
 the rest of the project. Just a hunch, though.

I'm pretty sure that typo correction fills pretty much the same niche,
though.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 01:13 PM, Michael Snow wrote:
 My general point is that opportunities for automation are best
 considered with our overall mission in mind, not just the speed or
 efficiency of a particular workflow. In certain situations, automation
 that creates more work rather than removing it (such as by identifying
 potential tasks and feeding them to editors) might be preferable. And
 some of our tools already use such an approach, which is a good thing.

That's an interesting approach, but I'm not sure how constructive it is
in the long run.  I suppose it depends greatly on whether one considers
our mission to be 'building an encyclopedia to share in the sum[...]' or
'having an encyclopedia to share in the sum[...]' (I'm not sure if I
make the subtle distinction here clear).

Perhaps another way of putting it is to ask whether the
encyclopedia-building community is the means or the ends.  To my eyes,
having more contributors is not valuable unless it has better
encyclopedia as a direct consequence.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 10:52 AM, Marc A. Pelletier wrote:

Perhaps another way of putting it is to ask whether the
encyclopedia-building community is the means or the ends.  To my eyes,
having more contributors is not valuable unless it has better
encyclopedia as a direct consequence.
I believe the mission is sufficiently large in scope that having more 
people involved is fundamentally desirable in general. Although to 
circle back to an earlier point in the discussion, that doesn't require 
that we accept involvement that is counterproductive. Maintaining our 
standards is a way of acknowledging that the number of people involved 
is not itself the end goal.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Andrew Gray
It could use abuse-filter tags, just not in an entirely standard way:

* Bot scans edit X
* Script flags it as a problem
* Bot makes edit X+1 to page (perhaps adding copyvio template?) which
triggers an abusefilter rule for (if this bot and does such-and-such
an edit) and tags it.

The offending edit itself won't be tagged, but the page history will
and it can probably be spotted quite easily from there.

A.

On 19 November 2013 01:07, Matthew Flaschen mflasc...@wikimedia.org wrote:
 On 11/16/2013 09:04 AM, Anthony Cole wrote:

 The problem of false positives from mirrors doesn't exist if we scan edits
 as they are made.


 Agreed.  However, that example is a legal, attributed (at least on the talk
 page) copy from a third-party freely licensed text, not a false positive
 copy from a Wikipedia mirror.

 Maggie says
 herehttps://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Emergency_block_of_an_editor_with_which_I_have_been_previously_involvedthat
 copyright bots populate
 WP:SCV https://en.wikipedia.org/wiki/Wikipedia:SCV So a
 similarly-configured bot could scan recent changes and tag suspected
 copyvios in watchlists and page histories like suspected vandalism is
 currently tagged.


 The suspected vandalism checks that actually tag the edit (e.g. Tag:
 possible vandalism)  are based on AbuseFilter checks.  These are relatively
 fast determinations that consider the text of the edit (e.g. regexes for
 strings of curse words, or meaningless repeating characters), and
 comparisons to the previous version (blanked the section, blanked the page).

 As far as I know, regular AbuseFilter rules can not hit a database or web
 search to check for copyright violations.  An extension could in theory do
 this.  But there would possibly be performance problems, since AbuseFilter
 runs on the actual server (not just some bot's computer) on every edit.

 It is possible for a bot to scan every edit; it just can't use AbuseFilter
 tags.

 Matt Flaschen

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Samuel Klein
Aside @Fae: the tineye crew are curious  quite pro-freeculture, I bet they
would be glad to help design a bot that uses their API to check image
copyvios.
On Nov 13, 2013 6:48 AM, Fæ fae...@gmail.com wrote:

 On 13 November 2013 07:40, James Heilman jmh...@gmail.com wrote:
 ...
  Our biggest issue is copyright infringement.
 ...

 Thanks for raising this James.

 Yes, this is an issue but if you are gunning for elephants this month,
 I really don't think the copyright elephant is the biggest one in the
 herd.

 As a practical example of the tools we already have in place,
 yesterday I was facilitating an edit-a-thon for women in science with
 King's College London and we had one of the example stubs we had
 created on the English Wikipedia up on a projector. Within literally
 *minutes* of creation it had been (correctly) flagged by a bot as a
 possible copyright violation as some of the text had been cut  past
 from King's own website; one of the participants quickly re-wrote it
 using their own words. As the communications manager was sitting next
 to me at the time, no doubt she found this rather reassuring, even
 though in parallel she was asking about how best to officially
 release text. :-)

 We have a more complex problem with how images uploaded to Wikimedia
 Commons can be flagged where they match images found elsewhere on the
 internet, this is something that may be done by a future bot but we
 might need to partner with someone like Google Images or Tineye to
 make this truly effective. Having run my own experimental bots on this
 area, I would love to see this become a funded project.

 PS with regard to OTRS verification, we could do with better standards
 for verification, at the moment volunteers like myself are left to use
 our own judgement about what checks to make. I tend to double check
 text or images being released with Google, just in case, as well as
 doing whois checks on email domains. These sorts of checks could
 become part of OTRS guidelines and would make the reliability of OTRS
 tickets a notch higher.

 Cheers,
 Fae
 --
 fae...@gmail.com http://j.mp/faewm

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Federico Leva (Nemo)

Samuel Klein, 19/11/2013 21:44:

Aside @Fae: the tineye crew are curious  quite pro-freeculture, I bet they
would be glad to help design a bot that uses their API to check image
copyvios.


How to make them include the whole Commons dataset into their own, to 
start with?


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Matthew Flaschen

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


https://en.wikipedia.org/wiki/User:MadmanBot is still running on English 
Wikipedia, which uses the same Yahoo APIs 
(http://www.uberbox.org/~marc/csb.pl).


It might be possible to run it on Italian Wikipedia as well, even 
without generating a new key.  The operator seems to be 
https://en.wikipedia.org/wiki/User:Madman


Matt Flaschen


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Federico Leva (Nemo)

Matthew Flaschen, 20/11/2013 06:05:

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of
text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a
while.


https://en.wikipedia.org/wiki/User:MadmanBot is still running on English
Wikipedia, which uses the same Yahoo APIs
(http://www.uberbox.org/~marc/csb.pl).

It might be possible to run it on Italian Wikipedia as well, even
without generating a new key.  The operator seems to be
https://en.wikipedia.org/wiki/User:Madman


That bot links a code (Coren's) that asks a key. So either the user is 
paying one himself, or he got the WMF's one some time ago: in both 
cases, he can't give it to more people.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-18 Thread Matthew Flaschen

On 11/16/2013 09:04 AM, Anthony Cole wrote:

The problem of false positives from mirrors doesn't exist if we scan edits
as they are made.


Agreed.  However, that example is a legal, attributed (at least on the 
talk page) copy from a third-party freely licensed text, not a false 
positive copy from a Wikipedia mirror.



Maggie says 
herehttps://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Emergency_block_of_an_editor_with_which_I_have_been_previously_involvedthat
copyright bots populate
WP:SCV https://en.wikipedia.org/wiki/Wikipedia:SCV So a
similarly-configured bot could scan recent changes and tag suspected
copyvios in watchlists and page histories like suspected vandalism is
currently tagged.


The suspected vandalism checks that actually tag the edit (e.g. Tag: 
possible vandalism)  are based on AbuseFilter checks.  These are 
relatively fast determinations that consider the text of the edit (e.g. 
regexes for strings of curse words, or meaningless repeating 
characters), and comparisons to the previous version (blanked the 
section, blanked the page).


As far as I know, regular AbuseFilter rules can not hit a database or 
web search to check for copyright violations.  An extension could in 
theory do this.  But there would possibly be performance problems, since 
AbuseFilter runs on the actual server (not just some bot's computer) on 
every edit.


It is possible for a bot to scan every edit; it just can't use 
AbuseFilter tags.


Matt Flaschen

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Anthony Cole
The problem of false positives from mirrors doesn't exist if we scan edits
as they are made.

Maggie says 
herehttps://en.wikipedia.org/wiki/Wikipedia:Administrators%27_noticeboard#Emergency_block_of_an_editor_with_which_I_have_been_previously_involvedthat
copyright bots populate
WP:SCV https://en.wikipedia.org/wiki/Wikipedia:SCV So a
similarly-configured bot could scan recent changes and tag suspected
copyvios in watchlists and page histories like suspected vandalism is
currently tagged. Ideally the edit summary would contain a url to the
suspected source. Maggie points out that those copyright bots were blocked
for a while (from scanning Google I presume) due to a negative impact on
Google, and this problem was solved by someone writing a cheque.

False positives won't be a problem, unless they're more than, say, 50%. If
a recent changes patroller can't confirm the copyvio, they can let it go.
But an editor whose contributions list is peppered with such warnings
would stand out like a sore thumb.

https://en.wikipedia.org/wiki/Wikipedia:SCV

Anthony Cole http://en.wikipedia.org/wiki/User_talk:Anthonyhcole
Memberships secretary
Wiki Project Med Foundationhttp://meta.wikimedia.org/wiki/Wiki_Project_Med


On Sat, Nov 16, 2013 at 3:36 AM, rupert THURNER rupert.thur...@gmail.comwrote:

 Salut florence, i obviously need to improve my English :) Marco suggested
 human checking to avoid false positives and some annotation that it
 happened. In my eyes the cited case is a verbatim copy of some compatible
 license text which could be used as an example to demonstrate what he ment.
 I did not see such a thing up to now and would not be 100% sure how to do
 it correctly. So i asked.

 Rupert
 Am 15.11.2013 13:44 schrieb Florence Devouard anthe...@yahoo.com:

  H
 
  Rupert,
 
  The case you mention is unrelated to any copyright infringement (the book
  is explicitely published under cc by sa. So there is no copyvio). Its
  mention here is like hair falling in soup.
 
  Now, I think there is a developing personal feud between you and Iolenda.
  It sincerely saddens me to see two people I appreciate come to such a
  situation. Would you both consider talking to each other on Skype or
  something like this ? Alternatively, find someone neutral and nice to
 help
  fix things so that you can come to a mutual understanding ?
 
  I understand that you both see things differently, but ultimately, you
  both are here to make things move on.
 
  Flo
 
 
  On 11/14/13 2:36 PM, rupert THURNER wrote:
 
  There is such a case in http://en.m.wikipedia.org/
  wiki/Education_in_Cameroon,
  reference is on the talk page. would you be so kind to mark or refer to
 it
  correctly?
 
  rupert
  Am 13.11.2013 12:46 schrieb Marco Chiesa chiesa.ma...@gmail.com:
 
   On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna cmcke...@sucs.org
  wrote:
 
 
  The problem isn't that we're waiting for perfection. We're waiting for
 
  the
 
  proportion of false positives and false negatives to fall to a level
 
  where
 
  don't overwhelm the true positives.
 
 
   To avoid false positives from mirrors, the best option is to compare
 a
  text
  as soon as it is saved. Also, you exclude certain websites from the
  comparison because you know they're the mirrors, you exclude rollbacks,
  ...
  Then, it is better to have a human checking that it is really a copyvio
  (it
  could well be a public domain text, or another Wikipedia article).
 
  Marco
  ___
  Wikimedia-l mailing list
  Wikimedia-l@lists.wikimedia.org
  Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
  mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
 
  ___
  Wikimedia-l mailing list
  Wikimedia-l@lists.wikimedia.org
  Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
  mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
 
 
 
 
  ___
  Wikimedia-l mailing list
  Wikimedia-l@lists.wikimedia.org
  Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
  mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Marc A. Pelletier
On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:
 Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
 https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py has been
 stopped when search engines changed their limits and Lusum has been
 waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.

I haven't been in charge of that key in quite some time, but I think I
still have the apropriate credentials to generate one for a copyright
violation bot.

I can look into it if you want.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Marc A. Pelletier
On 11/13/2013 04:41 PM, Tobias wrote:
 I think the community has done a very good job in the past 12 years when
 it comes to copyright. It is important to see that we are a community
 site – nothing is ever going to be perfect, and certainly we are not
 free of any copyright violations. But we are dealing with them in a very
 responsible way and I would say that our current efforts are sufficient.

I think that's the best way of summing it up.  Sufficient is a vague
metric, and leaves room for improvement, but the nutshell is that the
community /does/ take copyright violations seriously and deploys very
good efforts to curtail it.

Do some slip through?  Yes, without doubt.  Are they eliminated with
prejudice the second they are noticed?  Yes.

The Wikimedia projects are no worse than any other collected works when
it comes to copyright infringement and indeed tends to handle it with
more vigilance than the other sites in the top 10 (proactively, rather
than reactively).

Could we do better?  No doubt.  Is improvement so desperately critical
that we should drop everything else to concentrate on that?  Not a
chance.  And I speak as the author and (for a long time) maintainer of
one of the most visible and used copyright violation detection tool used
on our project (CorenSearchBot, now handled by MadmanBot and - last I
heard - used on around a dozen projects).

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Federico Leva (Nemo)

Marc A. Pelletier, 16/11/2013 16:34:

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


I haven't been in charge of that key in quite some time, but I think I
still have the apropriate credentials to generate one for a copyright
violation bot.

I can look into it if you want.


It would be awesome! Thank you.

Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-15 Thread Florence Devouard

H

Rupert,

The case you mention is unrelated to any copyright infringement (the 
book is explicitely published under cc by sa. So there is no copyvio). 
Its mention here is like hair falling in soup.


Now, I think there is a developing personal feud between you and 
Iolenda. It sincerely saddens me to see two people I appreciate come to 
such a situation. Would you both consider talking to each other on Skype 
or something like this ? Alternatively, find someone neutral and nice to 
help fix things so that you can come to a mutual understanding ?


I understand that you both see things differently, but ultimately, you 
both are here to make things move on.


Flo


On 11/14/13 2:36 PM, rupert THURNER wrote:

There is such a case in http://en.m.wikipedia.org/wiki/Education_in_Cameroon,
reference is on the talk page. would you be so kind to mark or refer to it
correctly?

rupert
Am 13.11.2013 12:46 schrieb Marco Chiesa chiesa.ma...@gmail.com:


On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna cmcke...@sucs.org wrote:



The problem isn't that we're waiting for perfection. We're waiting for

the

proportion of false positives and false negatives to fall to a level

where

don't overwhelm the true positives.



To avoid false positives from mirrors, the best option is to compare a text
as soon as it is saved. Also, you exclude certain websites from the
comparison because you know they're the mirrors, you exclude rollbacks, ...
Then, it is better to have a human checking that it is really a copyvio (it
could well be a public domain text, or another Wikipedia article).

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe





___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-15 Thread rupert THURNER
Salut florence, i obviously need to improve my English :) Marco suggested
human checking to avoid false positives and some annotation that it
happened. In my eyes the cited case is a verbatim copy of some compatible
license text which could be used as an example to demonstrate what he ment.
I did not see such a thing up to now and would not be 100% sure how to do
it correctly. So i asked.

Rupert
Am 15.11.2013 13:44 schrieb Florence Devouard anthe...@yahoo.com:

 H

 Rupert,

 The case you mention is unrelated to any copyright infringement (the book
 is explicitely published under cc by sa. So there is no copyvio). Its
 mention here is like hair falling in soup.

 Now, I think there is a developing personal feud between you and Iolenda.
 It sincerely saddens me to see two people I appreciate come to such a
 situation. Would you both consider talking to each other on Skype or
 something like this ? Alternatively, find someone neutral and nice to help
 fix things so that you can come to a mutual understanding ?

 I understand that you both see things differently, but ultimately, you
 both are here to make things move on.

 Flo


 On 11/14/13 2:36 PM, rupert THURNER wrote:

 There is such a case in http://en.m.wikipedia.org/
 wiki/Education_in_Cameroon,
 reference is on the talk page. would you be so kind to mark or refer to it
 correctly?

 rupert
 Am 13.11.2013 12:46 schrieb Marco Chiesa chiesa.ma...@gmail.com:

  On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna cmcke...@sucs.org
 wrote:


 The problem isn't that we're waiting for perfection. We're waiting for

 the

 proportion of false positives and false negatives to fall to a level

 where

 don't overwhelm the true positives.


  To avoid false positives from mirrors, the best option is to compare a
 text
 as soon as it is saved. Also, you exclude certain websites from the
 comparison because you know they're the mirrors, you exclude rollbacks,
 ...
 Then, it is better to have a human checking that it is really a copyvio
 (it
 could well be a public domain text, or another Wikipedia article).

 Marco
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe




 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread rupert THURNER
There is such a case in http://en.m.wikipedia.org/wiki/Education_in_Cameroon,
reference is on the talk page. would you be so kind to mark or refer to it
correctly?

rupert
Am 13.11.2013 12:46 schrieb Marco Chiesa chiesa.ma...@gmail.com:

 On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna cmcke...@sucs.org wrote:

 
  The problem isn't that we're waiting for perfection. We're waiting for
 the
  proportion of false positives and false negatives to fall to a level
 where
  don't overwhelm the true positives.
 
 
 To avoid false positives from mirrors, the best option is to compare a text
 as soon as it is saved. Also, you exclude certain websites from the
 comparison because you know they're the mirrors, you exclude rollbacks, ...
 Then, it is better to have a human checking that it is really a copyvio (it
 could well be a public domain text, or another Wikipedia article).

 Marco
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread Andrew Lih
FYI, on the last Wikipedia Weekly podcast, we talked with Sage Ross about
the plagiarism issue, and he walked through the study with some very
interesting insights. Video here, and the discussion started at 11 minutes,
30 seconds into the podcast.

https://www.youtube.com/watch?v=IOgYytn2JRk

-Andrew



On Wed, Nov 13, 2013 at 4:03 AM, Steven Walling steven.wall...@gmail.comwrote:

 On Tue, Nov 12, 2013 at 11:40 PM, James Heilman jmh...@gmail.com wrote:

  The Wikimedia Foundation needs to wake up and deal with the real tech
  elephant in the room. Our primary issue is not a lack of FLOW, a lack
 of a
  visual editor, or a lack of a rapidly expanding education program.
 
  Our biggest issue is copyright infringement. We have had the Indian
  program, we have had issues with the Education program, and I have today
  come across a user who has made nearly 20,000 edits to 1,742 article
 since
  2006 which appear to be nearly all copy and pasted from the sources he
 has
  used.
  https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
  This
  has seriously shaken my faith in Wikipedia.
 
  This is especially devastating as there is a tech solution that would
 have
  prevented it. The efforts are being worked on by volunteers here
  https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
  least March of 2012. We NEED all tech resource at the foundation thrown
 at
  this project. Other less important project like FLOW and the visual
 editor
  need to be put on hold to develop this tool.
 

 Relevant info on the subject of copyvio is the recent plagiarism study by
 the Education Program team. They looked different types of users (students,
 newbies, experienced editors, admins) and compared them. Results were
 published on Meta at

 https://meta.wikimedia.org/wiki/Research:Plagiarism_on_the_English_Wikipediaand
 also discussed in the last WMF Metrics  Activities meeting:
 https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-11-07

 AFAIK this is the best data we have about how often different kinds of
 editors close paraphrase or outright copy/paste.
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread Laura Hale
On Thu, Nov 14, 2013 at 4:47 PM, Andrew Lih andrew@gmail.com wrote:

 FYI, on the last Wikipedia Weekly podcast, we talked with Sage Ross about
 the plagiarism issue, and he walked through the study with some very
 interesting insights. Video here, and the discussion started at 11 minutes,
 30 seconds into the podcast.


I've done a study of student contributions on English Wikinews.  We've
found that only about 15% of student submissions have a copyright issue.
 The level of plagiarism and copyright problems is about the same for
regular contributors, new contributors and student contributors on English
Wikinews with that range of 10 to 15%.  This is an issue we have to be on
top of because nothing gets published on the project without being reviewed
for this issue.

Sincerely,
Laura Hale

-- 
twitter: purplepopple
blog: ozziesport.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Gerard Meijssen
Hoi,
Seriously we should never ever be ruled be panic.What you see is bad, no
doubt but the notion that we should dump everything because of the latest
issue to come along is way overboard.

   - by stopping the flow on projects like Visual Editor you break
   dependencies for the work of many developers
   - what you have noticed is for only one Wikipedia not all of them
   - we do need more mature discussion software what we have is horrible
   - such dramatics only have you go away and upset others it does not
   solve things
   - the dramatics detract me from your message
   - my hobby horse needs more attention too and I think my argument is
   better ...

Anyway, it would be nice when someone looks at the tool with an eye of
making it happen and making it scale. When it doesn't it becomes a less
attractive option to pursue.
Thanks,
  GerardM


On 13 November 2013 08:40, James Heilman jmh...@gmail.com wrote:

 The Wikimedia Foundation needs to wake up and deal with the real tech
 elephant in the room. Our primary issue is not a lack of FLOW, a lack of a
 visual editor, or a lack of a rapidly expanding education program.

 Our biggest issue is copyright infringement. We have had the Indian
 program, we have had issues with the Education program, and I have today
 come across a user who has made nearly 20,000 edits to 1,742 article since
 2006 which appear to be nearly all copy and pasted from the sources he has
 used.
 https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
 This
 has seriously shaken my faith in Wikipedia.

 This is especially devastating as there is a tech solution that would have
 prevented it. The efforts are being worked on by volunteers here
 https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
 least March of 2012. We NEED all tech resource at the foundation thrown at
 this project. Other less important project like FLOW and the visual editor
 need to be put on hold to develop this tool.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Matthew Flaschen

On 11/13/2013 02:40 AM, James Heilman wrote:

The Wikimedia Foundation needs to wake up and deal with the real tech
elephant in the room. Our primary issue is not a lack of FLOW, a lack of a
visual editor, or a lack of a rapidly expanding education program.

Our biggest issue is copyright infringement.


I don't really agree with that.  It is a serious issue, but I would put 
NPOV (in the face of active threats such as companies paying for 
publicity on Wikipedia) and growing the editor community higher.


We also have solutions to address it (not perfectly, true), both 
preventing the problem and dealing with it after the fact


* MadmanBot (https://en.wikipedia.org/wiki/User:MadmanBot) (mentioned at 
Wikipedia:TurnItIn, and a major technical tool against copyright 
infringement).

* Clear policies against copyright infringement
* Dealing with copyright violations 
(https://en.wikipedia.org/wiki/Wikipedia:Text_Copyright_Violations_101)
* Finally, the DMCA ensures the foundation is not liable as long as they 
promptly respond to notifications (which of course we want them to anyway).



We have had the Indian program, we have had issues with the Education program, 
and I have today
come across a user who has made nearly 20,000 edits to 1,742 article since
2006 which appear to be nearly all copy and pasted from the sources he has
used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
This has seriously shaken my faith in Wikipedia.


That is indeed disturbing, and I'm glad you found it.


This is especially devastating as there is a tech solution that would have
prevented it. The efforts are being worked on by volunteers here
https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
least March of 2012. We NEED all tech resource at the foundation thrown at
this project. Other less important project like FLOW and the visual editor
need to be put on hold to develop this tool.


I don't agree that all tech resources should be used for this.  However, 
there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project).


A significant problem with TurnItIn is that is proprietary, and can not 
be customized by anyone in the movement.  The fact that it is 
proprietary also means it can never be port of the main infrastructure, 
nor run on Wikimedia Labs.


Matt Flaschen

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Steven Walling
On Tue, Nov 12, 2013 at 11:40 PM, James Heilman jmh...@gmail.com wrote:

 The Wikimedia Foundation needs to wake up and deal with the real tech
 elephant in the room. Our primary issue is not a lack of FLOW, a lack of a
 visual editor, or a lack of a rapidly expanding education program.

 Our biggest issue is copyright infringement. We have had the Indian
 program, we have had issues with the Education program, and I have today
 come across a user who has made nearly 20,000 edits to 1,742 article since
 2006 which appear to be nearly all copy and pasted from the sources he has
 used.
 https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
 This
 has seriously shaken my faith in Wikipedia.

 This is especially devastating as there is a tech solution that would have
 prevented it. The efforts are being worked on by volunteers here
 https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
 least March of 2012. We NEED all tech resource at the foundation thrown at
 this project. Other less important project like FLOW and the visual editor
 need to be put on hold to develop this tool.


Relevant info on the subject of copyvio is the recent plagiarism study by
the Education Program team. They looked different types of users (students,
newbies, experienced editors, admins) and compared them. Results were
published on Meta at
https://meta.wikimedia.org/wiki/Research:Plagiarism_on_the_English_Wikipediaand
also discussed in the last WMF Metrics  Activities meeting:
https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-11-07

AFAIK this is the best data we have about how often different kinds of
editors close paraphrase or outright copy/paste.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 8:40 AM, James Heilman jmh...@gmail.com wrote:


 Our biggest issue is copyright infringement. We have had the Indian
 program, we have had issues with the Education program, and I have today
 come across a user who has made nearly 20,000 edits to 1,742 article since
 2006 which appear to be nearly all copy and pasted from the sources he has
 used.
 https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
 This
 has seriously shaken my faith in Wikipedia.


Back in 2007 we found out a user on it.wp, a former sysop, with more than
40,000 edits that used to copy-paste from his sources, often outdated. He
was banned, and the community made a great effort to cleanup the articles
he contributed to (and damn it was hard, because those articles had a long
history after his edits). And in the following years, we had other similar
cases, you can find a selection here:
https://it.wikipedia.org/wiki/Progetto:Cococo/Controlli_conclusi
There are bots that go and look whether a newly inserted block of text is
already present somewhere else, it doesn't find everything  (of course it
won't find things copied from a printed book), but sooner or later serial
copyviolers get caught, and the fall from hero to zero is sooo quick.

At the end of the day, I think copyvios have always been taken seriously,
so that I don't remember big problems with that, while there have always
been more problems with libel, privacy, and editor retention.


Marco (Cruccone)
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Lodewijk
Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
around for ages, the English Wikipedia is just less aware of them. Often,
copypasting in the same language is caught easily - between different
languages is much harder and persistent. There are many people, including
experienced editors, that think translating from random sources is OK. It
is no new problem, and chapters have indeed been working on getting this
understanding of what free licenses really mean more widely accepted in the
general audience. Not something that is easily measured of course.
Technical solutions sound great, but are only catching a small amount
inside the same language.

Steven: I understand this research was limited to the English Wikipedia
(where most of the plagiarism will be in the same language). It would not
strike me out of the realm of realism to assume this might be very
different for other languages than English. It also says little about the
problem in general of course.

For those who don't want to click on links to get information, it basically
says (simplification alert) that they don't have any indication that the US
 Canada education program makes the plagiarism problem on the English
Wikipedia any worse than it already is.

Anyway: I think this problem is more prominently there in non-English
communities, and that technical solutions are not going to be the answer
there. An educational answer is more likely to be successful, focusing on
explaining people how Wikipedia works and doesn't work, and what are do's
and don'ts. This doesn't have to be an education program like executed in
the US, but basically all outreach programs as executed by chapters, user
groups, thematic organizations or groups of volunteers can contribute to
this. This is already happening in most countries.

In some countries (like Germany ;-) ) politicians are doing the work for
us, explaining how evil plagiarism is and how it works by firing government
ministers over it :)

Best,
Lodewijk




2013/11/13 Marco Chiesa chiesa.ma...@gmail.com

 On Wed, Nov 13, 2013 at 8:40 AM, James Heilman jmh...@gmail.com wrote:

 
  Our biggest issue is copyright infringement. We have had the Indian
  program, we have had issues with the Education program, and I have today
  come across a user who has made nearly 20,000 edits to 1,742 article
 since
  2006 which appear to be nearly all copy and pasted from the sources he
 has
  used.
  https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
  This
  has seriously shaken my faith in Wikipedia.
 

 Back in 2007 we found out a user on it.wp, a former sysop, with more than
 40,000 edits that used to copy-paste from his sources, often outdated. He
 was banned, and the community made a great effort to cleanup the articles
 he contributed to (and damn it was hard, because those articles had a long
 history after his edits). And in the following years, we had other similar
 cases, you can find a selection here:
 https://it.wikipedia.org/wiki/Progetto:Cococo/Controlli_conclusi
 There are bots that go and look whether a newly inserted block of text is
 already present somewhere else, it doesn't find everything  (of course it
 won't find things copied from a printed book), but sooner or later serial
 copyviolers get caught, and the fall from hero to zero is sooo quick.

 At the end of the day, I think copyvios have always been taken seriously,
 so that I don't remember big problems with that, while there have always
 been more problems with libel, privacy, and editor retention.


 Marco (Cruccone)
 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Federico Leva (Nemo)

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program 
https://www.mediawiki.org/wiki/Manual:Pywikibot/copyright.py has been 
stopped when search engines changed their limits and Lusum has been 
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Philippe Beaudette
On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen 
matthew.flasc...@gatech.edu wrote:

 A significant problem with TurnItIn is that is proprietary, and can not be
 customized by anyone in the movement.  The fact that it is proprietary also
 means it can never be port of the main infrastructure, nor run on Wikimedia
 Labs.


Another significant issue is the False Positive factor that is created by
our overwhelming popularity.  Frankly, we're mirrored all over the place.
And tools like Turnitin find the mirrors too.  It's not an easy problem to
solve.  I was on the team that looked at this a couple of years back - it's
just not simple, and there are complex challenges.


*Philippe Beaudette * \\  Director, Community Advocacy \\ Wikimedia
Foundation, Inc.
 T: 1-415-839-6885 x6643 |  phili...@wikimedia.org  |  :
@Philippewikihttps://twitter.com/Philippewiki
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Matthew Flaschen

On 11/13/2013 05:16 AM, Philippe Beaudette wrote:

On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen 
matthew.flasc...@gatech.edu wrote:


A significant problem with TurnItIn is that is proprietary, and can not be
customized by anyone in the movement.  The fact that it is proprietary also
means it can never be port of the main infrastructure, nor run on Wikimedia
Labs.



Another significant issue is the False Positive factor that is created by
our overwhelming popularity.  Frankly, we're mirrored all over the place.
And tools like Turnitin find the mirrors too.  It's not an easy problem to
solve.  I was on the team that looked at this a couple of years back - it's
just not simple, and there are complex challenges.


Yes, an intelligent solution would take into account when the mirror was 
first indexed (or ideally first published), and when the Wikipedia 
article was edited, to reduce false positives requiring manual intervention.


Matt Flaschen


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Gerard Meijssen
Hoi
I know several authors who publish and use their original text to publish
on Wikipedia as well.. This is another source of false positives because
they have the copyright to the original source... To recognise this you
have to be even more sophisticated.

The point I want to make is that having a tool that is KNOWN to be
deficient in specific ways can still be a huge advantage over not having a
tool at all. So PLEASE lets not make perfection the enemy of the good.
Thanks,
   GerardM


On 13 November 2013 11:23, Matthew Flaschen matthew.flasc...@gatech.eduwrote:

 On 11/13/2013 05:16 AM, Philippe Beaudette wrote:

 On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen 
 matthew.flasc...@gatech.edu wrote:

  A significant problem with TurnItIn is that is proprietary, and can not
 be
 customized by anyone in the movement.  The fact that it is proprietary
 also
 means it can never be port of the main infrastructure, nor run on
 Wikimedia
 Labs.



 Another significant issue is the False Positive factor that is created
 by
 our overwhelming popularity.  Frankly, we're mirrored all over the place.
 And tools like Turnitin find the mirrors too.  It's not an easy problem to
 solve.  I was on the team that looked at this a couple of years back -
 it's
 just not simple, and there are complex challenges.


 Yes, an intelligent solution would take into account when the mirror was
 first indexed (or ideally first published), and when the Wikipedia article
 was edited, to reduce false positives requiring manual intervention.

 Matt Flaschen



 ___
 Wikimedia-l mailing list
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen gerard.meijs...@gmail.com
 wrote:

 Hoi
 I know several authors who publish and use their original text to publish
 on Wikipedia as well.. This is another source of false positives because
 they have the copyright to the original source... To recognise this you
 have to be even more sophisticated.


Actually, we consider these as copyvios, we delete the text straight away,
and we tell the editor if you're the author write to OTRS. Of course, if
the text is already somewhere else under a compatible free-license, we
don't need this. Until you can't be sure that User:MrX is actually the
physical person MrX, we need to protect the author's right.

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Chris McKenna

On Wed, 13 Nov 2013, Marco Chiesa wrote:


On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen gerard.meijs...@gmail.com

wrote:



Hoi
I know several authors who publish and use their original text to publish
on Wikipedia as well.. This is another source of false positives because
they have the copyright to the original source... To recognise this you
have to be even more sophisticated.



Actually, we consider these as copyvios, we delete the text straight away,
and we tell the editor if you're the author write to OTRS. Of course, if
the text is already somewhere else under a compatible free-license, we
don't need this. Until you can't be sure that User:MrX is actually the
physical person MrX, we need to protect the author's right.



But an automated tool can not know whether OTRS verification has happened 
or not.



Chris McKenna

cmcke...@sucs.org
www.sucs.org/~cmckenna


The essential things in life are seen not with the eyes,
but with the heart

Antoine de Saint Exupery


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Chris McKenna

On Wed, 13 Nov 2013, Gerard Meijssen wrote:


The point I want to make is that having a tool that is KNOWN to be
deficient in specific ways can still be a huge advantage over not having a
tool at all. So PLEASE lets not make perfection the enemy of the good.


The problem isn't that we're waiting for perfection. We're waiting for the 
proportion of false positives and false negatives to fall to a level where 
don't overwhelm the true positives.



Chris McKenna

cmcke...@sucs.org
www.sucs.org/~cmckenna


The essential things in life are seen not with the eyes,
but with the heart

Antoine de Saint Exupery


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 12:36 PM, Chris McKenna cmcke...@sucs.org wrote:


 But an automated tool can not know whether OTRS verification has happened
 or not.

 We put something like {{OTRS verified}} in the article's talk page,
something saying: Part of the text comes from website X, ticket 1234567890.
And if the author wants to use his work for many articles, we tell him/her
to put the template in all his/her articles' talk page.
Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna cmcke...@sucs.org wrote:


 The problem isn't that we're waiting for perfection. We're waiting for the
 proportion of false positives and false negatives to fall to a level where
 don't overwhelm the true positives.


To avoid false positives from mirrors, the best option is to compare a text
as soon as it is saved. Also, you exclude certain websites from the
comparison because you know they're the mirrors, you exclude rollbacks, ...
Then, it is better to have a human checking that it is really a copyvio (it
could well be a public domain text, or another Wikipedia article).

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread
On 13 November 2013 07:40, James Heilman jmh...@gmail.com wrote:
...
 Our biggest issue is copyright infringement.
...

Thanks for raising this James.

Yes, this is an issue but if you are gunning for elephants this month,
I really don't think the copyright elephant is the biggest one in the
herd.

As a practical example of the tools we already have in place,
yesterday I was facilitating an edit-a-thon for women in science with
King's College London and we had one of the example stubs we had
created on the English Wikipedia up on a projector. Within literally
*minutes* of creation it had been (correctly) flagged by a bot as a
possible copyright violation as some of the text had been cut  past
from King's own website; one of the participants quickly re-wrote it
using their own words. As the communications manager was sitting next
to me at the time, no doubt she found this rather reassuring, even
though in parallel she was asking about how best to officially
release text. :-)

We have a more complex problem with how images uploaded to Wikimedia
Commons can be flagged where they match images found elsewhere on the
internet, this is something that may be done by a future bot but we
might need to partner with someone like Google Images or Tineye to
make this truly effective. Having run my own experimental bots on this
area, I would love to see this become a funded project.

PS with regard to OTRS verification, we could do with better standards
for verification, at the moment volunteers like myself are left to use
our own judgement about what checks to make. I tend to double check
text or images being released with Google, just in case, as well as
doing whois checks on email domains. These sorts of checks could
become part of OTRS guidelines and would make the reliability of OTRS
tickets a notch higher.

Cheers,
Fae
-- 
fae...@gmail.com http://j.mp/faewm

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Quim Gil
On 11/13/2013 12:37 AM, Matthew Flaschen wrote:
 However,
 there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project).

Any technical project able to identify small tasks and mentors available
are welcome to join Wikimedia's Google Code-in team at

https://www.mediawiki.org/wiki/Google_Code-In

GCI will start next week and will last until the beginning of January.
Hundreds of young students will scan our tasks and will eventually
complete some of them.

It is a program ideal for small projects, like the bots or gadgets used
by editors.

-- 
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Nathan
On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk lodew...@effeietsanders.orgwrote:

 Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
 around for ages, the English Wikipedia is just less aware of them.



Not sure if you meant this how it sounds, but the English Wikipedia
community is acutely aware of copyright problems and have undertaken many,
many large and complicated cleanup tasks of the sort Marco described.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread George Herbert
On Wed, Nov 13, 2013 at 3:48 AM, Fæ fae...@gmail.com wrote:

 ...
 PS with regard to OTRS verification, we could do with better standards
 for verification,


We are not attempting to perform a complete and unassailable verification;
imagining that we can is folly.

The point is, we need someone who credibly is the author or rightsholder,
and with whom we have an audit trail of their claims and identity (email
address we corresponded with, etc).

When it comes down to it, we have no idea if an email is associated with
the given person, that the alleged sender of a certified letter really is
that person, or that the John Doe that came in to the office and showed
valid government issued ID with a claim of copyright violation is the same
John Doe who wrote the original material.  There's no way for us to confirm
in any reasonable manner.

If there is an attempt at identity theft that is discovered, that audit
trail is available to investigators with proper legal authorization etc.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Michael Snow

On 11/13/2013 10:39 AM, Nathan wrote:

On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk lodew...@effeietsanders.orgwrote:

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
around for ages, the English Wikipedia is just less aware of them.

Not sure if you meant this how it sounds, but the English Wikipedia
community is acutely aware of copyright problems and have undertaken many,
many large and complicated cleanup tasks of the sort Marco described.
I think he meant that the English Wikipedia community is less aware of 
the fact that we face these sorts of large-scale challenges in many 
other languages as well. In other words, the antecedent to them is 
issues on the Dutch/Italian/etc. Wikipedia, rather than copyright 
issues generally. Most people participating in other languages are 
reasonably aware when major concerns surface from the English Wikipedia; 
people participating only in English often haven't a clue about the 
concerns being dealt with in other languages.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Nathan
On Wed, Nov 13, 2013 at 1:48 PM, Michael Snow wikipe...@frontier.comwrote:

 On 11/13/2013 10:39 AM, Nathan wrote:

 On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk lodew...@effeietsanders.org
 wrote:

 Marco: I agree, we had also issues on the Dutch Wikipedia - these have
 been
 around for ages, the English Wikipedia is just less aware of them.

 Not sure if you meant this how it sounds, but the English Wikipedia
 community is acutely aware of copyright problems and have undertaken many,
 many large and complicated cleanup tasks of the sort Marco described.

 I think he meant that the English Wikipedia community is less aware of the
 fact that we face these sorts of large-scale challenges in many other
 languages as well. In other words, the antecedent to them is issues on
 the Dutch/Italian/etc. Wikipedia, rather than copyright issues
 generally. Most people participating in other languages are reasonably
 aware when major concerns surface from the English Wikipedia; people
 participating only in English often haven't a clue about the concerns being
 dealt with in other languages.

 --Michael Snow


That makes sense, thanks for clearing that up for me.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Tobias

On 11/13/2013 08:40 AM, James Heilman wrote:

Our biggest issue is copyright infringement.


When it comes to copyright infringement, among all community sites on 
the Internet, Wikipedia is one of the best to handle it. Many websites 
don't even bother with copyright unless they get a DMCA Takedown notice. 
We on the other hand have voluntary contributors checking pages and 
raising flags whenever there is even a suspicion of a copyright violation.


This seems to be highly effective in many cases. A few days ago, I wrote 
an email to a photographer, whose photos had been uploaded to Commons. 
He said I was the third to ask him whether he really had uploaded those 
images (which he had).


Unquestionably, there are also many instances where the systems fails 
and where lots of copyrighted material gets uploaded. Back in 2005, we 
had a case similar to the one you described in German Wikipedia, where 
various IPs copied content from old books. It is a big mess to clean up, 
but it can be done. And luckily the cases of massive copyvios are quite 
rare.


I think the community has done a very good job in the past 12 years when 
it comes to copyright. It is important to see that we are a community 
site – nothing is ever going to be perfect, and certainly we are not 
free of any copyright violations. But we are dealing with them in a very 
responsible way and I would say that our current efforts are sufficient.



Tobias


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Martin Rulsch
 Unquestionably, there are also many instances where the systems fails and
 where lots of copyrighted material gets uploaded. Back in 2005, we had a
 case similar to the one you described in German Wikipedia, where various
 IPs copied content from old books. It is a big mess to clean up, but it can
 be done. And luckily the cases of massive copyvios are quite rare.


For further information see
https://de.wikipedia.org/wiki/Wikipedia:Archiv/DDR-URV/Presseinfo (German).

Cheers
Martin
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

[Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-12 Thread James Heilman
The Wikimedia Foundation needs to wake up and deal with the real tech
elephant in the room. Our primary issue is not a lack of FLOW, a lack of a
visual editor, or a lack of a rapidly expanding education program.

Our biggest issue is copyright infringement. We have had the Indian
program, we have had issues with the Education program, and I have today
come across a user who has made nearly 20,000 edits to 1,742 article since
2006 which appear to be nearly all copy and pasted from the sources he has
used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
This
has seriously shaken my faith in Wikipedia.

This is especially devastating as there is a tech solution that would have
prevented it. The efforts are being worked on by volunteers here
https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
least March of 2012. We NEED all tech resource at the foundation thrown at
this project. Other less important project like FLOW and the visual editor
need to be put on hold to develop this tool.

-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe