Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-22 Thread The Cunctator
Also, vandalism had always been a red herring, kind of like the terrorism
that justifies the TSA security theater and NBA surveillance or the Red
Scare. It's a wrong-headed obsession that weakens community.
On Nov 22, 2013 2:06 PM, "Steven Walling"  wrote:

> On Thu, Nov 21, 2013 at 12:37 AM, WereSpielChequers <
> werespielchequ...@gmail.com> wrote:
>
> > Typo correction and vandalism reversion are certainly both entries to
> > editing, and it isn't just anti-vandalism where the opportunities have
> > declined in recent years. Typos are getting harder to find, especially in
> > stable widely read articles. Yes you can find plenty of typos by checking
> > new pages and recent changes, but I doubt our  5 edits a month editors
> are
> > going to internal maintenance pages like that. I suspect they are readers
> > who fix things they come across. It would be interesting to survey a
> sample
> > of them I suspect we'd find many who are reading Wikipedia just as much
> as
> > they used to, but if they only edit when they spot a mistake then of
> course
> > they will now be editing less frequently. And of course none of that is
> > actually bad, any more than is the loss of large numbers of vandals who
> > used to get into the 5 edits a month band for at least the month in which
> > they did their spree and were blocked..
> >
> > The difficulty of getting precise measurements of "community health"
> makes
> > it a fascinating topic, and with many known factors altering edit levels
> in
> > sometimes poorly understood ways we need to be wary of
> oversimplifications.
> > No-one really knows what would have happened if the many edit filters
> > installed in the last four years had instead been coded as anti vandalism
> > bots, clearly our edit count would now be much higher, but whether it
> would
> > currently be higher or lower than in 2009 when the edit filters were
> > introduced is unknown. Nor should we fret that we shifted so much of our
> > anti-vandalism work from very quick reversion to not accepting edits.
> > However it isn't sensible to  benchmark community health against past
> edit
> > levels, we should really be comparing community activity against
> readership
> > levels. If we do that there is a disconnect between our readership which
> > for years has grown faster than the internet and our community which is
> > broadly stable. To some extent this can be considered a success for
> Vector
> > and the shift of our default from a skin optimised for editing to one
> > optimised for reading. Of course if we want to increase editing levels we
> > always have the option of defaulting new accounts to Monobook instead of
> > Vector. My suspicion is also that the rise of the mobile device,
> especially
> > amongst the young, is turning us from an interactive medium into more of
> a
> > broadcast one. It is also likely to be contributing to the greying of the
> > pedia.
> >
> > I am trying to list the major known and probable causes of changes of the
> > fall in the raw editing levels in a page on
> > wiki<
> >
> https://en.wikipedia.org/wiki/User:WereSpielChequers/Going_off_the_boil%3F
> > >,
> > feedback welcome.
> >
>
> Holy smokes this thread has gotten off topic, but I'll bite. ;)
>
> Making articles that need spelling and grammar fixes easily available to
> new editors is precisely what we're doing with GettingStarted, our software
> system for introducing newly-registered people to editing. (Docs at
> https://en.wikipedia.org/wiki/Wikipedia:GettingStarted and
> https://www.mediawiki.org/wiki/Onboarding_new_Wikipedians). We're
> currently
> getting thousands of new people to make their first typo fix a month on
> English Wikipedia, and we're moving to other Wikipedias soon.
>
> In English Wikipedia it's quite easy for us to do so, since there's a large
> category of articles needing copyediting. In other Wikipedias, it's not
> easy, because there is no such category. If you want to help us help
> newbies, the best thing you could do is create a copyediting category on
> your Wikipedia and link it to the appropriate Wikidata item
> (either Q8235695 or Q9137504).
>
> As a side point: when we examine first-time editors contributions, these
> days it's rare to find someone start out by correcting vandalism, probably
> because now bots and users of tools like Huggle or Twinkle catch it all so
> fast. It's so small a number that when we examine samples of new
> contributors in our qualitative research,[1][2] we just put it in the Other
> category of edit types.
>
> Steven
>
> 1.
>
> https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/Qualitative_analysis
> 2.
>
> https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/OB6/Contribution_quality_and_type
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-22 Thread Steven Walling
On Thu, Nov 21, 2013 at 12:37 AM, WereSpielChequers <
werespielchequ...@gmail.com> wrote:

> Typo correction and vandalism reversion are certainly both entries to
> editing, and it isn't just anti-vandalism where the opportunities have
> declined in recent years. Typos are getting harder to find, especially in
> stable widely read articles. Yes you can find plenty of typos by checking
> new pages and recent changes, but I doubt our  5 edits a month editors are
> going to internal maintenance pages like that. I suspect they are readers
> who fix things they come across. It would be interesting to survey a sample
> of them I suspect we'd find many who are reading Wikipedia just as much as
> they used to, but if they only edit when they spot a mistake then of course
> they will now be editing less frequently. And of course none of that is
> actually bad, any more than is the loss of large numbers of vandals who
> used to get into the 5 edits a month band for at least the month in which
> they did their spree and were blocked..
>
> The difficulty of getting precise measurements of "community health" makes
> it a fascinating topic, and with many known factors altering edit levels in
> sometimes poorly understood ways we need to be wary of oversimplifications.
> No-one really knows what would have happened if the many edit filters
> installed in the last four years had instead been coded as anti vandalism
> bots, clearly our edit count would now be much higher, but whether it would
> currently be higher or lower than in 2009 when the edit filters were
> introduced is unknown. Nor should we fret that we shifted so much of our
> anti-vandalism work from very quick reversion to not accepting edits.
> However it isn't sensible to  benchmark community health against past edit
> levels, we should really be comparing community activity against readership
> levels. If we do that there is a disconnect between our readership which
> for years has grown faster than the internet and our community which is
> broadly stable. To some extent this can be considered a success for Vector
> and the shift of our default from a skin optimised for editing to one
> optimised for reading. Of course if we want to increase editing levels we
> always have the option of defaulting new accounts to Monobook instead of
> Vector. My suspicion is also that the rise of the mobile device, especially
> amongst the young, is turning us from an interactive medium into more of a
> broadcast one. It is also likely to be contributing to the greying of the
> pedia.
>
> I am trying to list the major known and probable causes of changes of the
> fall in the raw editing levels in a page on
> wiki<
> https://en.wikipedia.org/wiki/User:WereSpielChequers/Going_off_the_boil%3F
> >,
> feedback welcome.
>

Holy smokes this thread has gotten off topic, but I'll bite. ;)

Making articles that need spelling and grammar fixes easily available to
new editors is precisely what we're doing with GettingStarted, our software
system for introducing newly-registered people to editing. (Docs at
https://en.wikipedia.org/wiki/Wikipedia:GettingStarted and
https://www.mediawiki.org/wiki/Onboarding_new_Wikipedians). We're currently
getting thousands of new people to make their first typo fix a month on
English Wikipedia, and we're moving to other Wikipedias soon.

In English Wikipedia it's quite easy for us to do so, since there's a large
category of articles needing copyediting. In other Wikipedias, it's not
easy, because there is no such category. If you want to help us help
newbies, the best thing you could do is create a copyediting category on
your Wikipedia and link it to the appropriate Wikidata item
(either Q8235695 or Q9137504).

As a side point: when we examine first-time editors contributions, these
days it's rare to find someone start out by correcting vandalism, probably
because now bots and users of tools like Huggle or Twinkle catch it all so
fast. It's so small a number that when we examine samples of new
contributors in our qualitative research,[1][2] we just put it in the Other
category of edit types.

Steven

1.
https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/Qualitative_analysis
2.
https://meta.wikimedia.org/wiki/Research:Onboarding_new_Wikipedians/OB6/Contribution_quality_and_type
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-21 Thread WereSpielChequers
Typo correction and vandalism reversion are certainly both entries to
editing, and it isn't just anti-vandalism where the opportunities have
declined in recent years. Typos are getting harder to find, especially in
stable widely read articles. Yes you can find plenty of typos by checking
new pages and recent changes, but I doubt our  5 edits a month editors are
going to internal maintenance pages like that. I suspect they are readers
who fix things they come across. It would be interesting to survey a sample
of them I suspect we'd find many who are reading Wikipedia just as much as
they used to, but if they only edit when they spot a mistake then of course
they will now be editing less frequently. And of course none of that is
actually bad, any more than is the loss of large numbers of vandals who
used to get into the 5 edits a month band for at least the month in which
they did their spree and were blocked..

The difficulty of getting precise measurements of "community health" makes
it a fascinating topic, and with many known factors altering edit levels in
sometimes poorly understood ways we need to be wary of oversimplifications.
No-one really knows what would have happened if the many edit filters
installed in the last four years had instead been coded as anti vandalism
bots, clearly our edit count would now be much higher, but whether it would
currently be higher or lower than in 2009 when the edit filters were
introduced is unknown. Nor should we fret that we shifted so much of our
anti-vandalism work from very quick reversion to not accepting edits.
However it isn't sensible to  benchmark community health against past edit
levels, we should really be comparing community activity against readership
levels. If we do that there is a disconnect between our readership which
for years has grown faster than the internet and our community which is
broadly stable. To some extent this can be considered a success for Vector
and the shift of our default from a skin optimised for editing to one
optimised for reading. Of course if we want to increase editing levels we
always have the option of defaulting new accounts to Monobook instead of
Vector. My suspicion is also that the rise of the mobile device, especially
amongst the young, is turning us from an interactive medium into more of a
broadcast one. It is also likely to be contributing to the greying of the
pedia.

I am trying to list the major known and probable causes of changes of the
fall in the raw editing levels in a page on
wiki<https://en.wikipedia.org/wiki/User:WereSpielChequers/Going_off_the_boil%3F>,
feedback welcome.


Jonathan


> --
>
> Message: 6
> Date: Wed, 20 Nov 2013 13:45:17 -0500
> From: "Marc A. Pelletier" 
> To: wikimedia-l@lists.wikimedia.org
> Subject: Re: [Wikimedia-l] Copyright infringement - The real elephant
> in the room
> Message-ID: <528d033d.6060...@uberbox.org>
> Content-Type: text/plain; charset=UTF-8
>
> On 11/20/2013 01:06 PM, Richard Symonds wrote:
> > Not quite: I would argue that anti-vandalism work is a "gateway drug" to
> > the rest of the project. Just a hunch, though.
>
> I'm pretty sure that typo correction fills pretty much the same niche,
> though.
>
> -- Marc
>
>
>
> End of Wikimedia-l Digest, Vol 116, Issue 32
> 
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
<mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe>

Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 10:52 AM, Marc A. Pelletier wrote:

Perhaps another way of putting it is to ask whether the
encyclopedia-building community is the means or the ends.  To my eyes,
having "more contributors" is not valuable unless it has "better
encyclopedia" as a direct consequence.
I believe the mission is sufficiently large in scope that having more 
people involved is fundamentally desirable in general. Although to 
circle back to an earlier point in the discussion, that doesn't require 
that we accept involvement that is counterproductive. Maintaining our 
standards is a way of acknowledging that the number of people involved 
is not itself the end goal.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread David Gerard
On 20 November 2013 18:52, Marc A. Pelletier  wrote:

> Perhaps another way of putting it is to ask whether the
> encyclopedia-building community is the means or the ends.  To my eyes,
> having "more contributors" is not valuable unless it has "better
> encyclopedia" as a direct consequence.


I think it's not a sufficient condition, but that it is a necessary one.

Think LibreOffice and their Easy Hacks list, for example - simple
things a C++ coder could achieve even if unfamiliar with the (huge,
hideous) code base:

https://wiki.documentfoundation.org/Development/Easy_Hacks

I know the standard en:wp {{welcome}} message used to suggest things
that needed attention ... though frankly, many of them don't get
attention because they're stultifyingly boring (most things in
[[Category:Cleanup]] are never leaving it).


- d.

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 01:13 PM, Michael Snow wrote:
> My general point is that opportunities for automation are best
> considered with our overall mission in mind, not just the speed or
> efficiency of a particular workflow. In certain situations, automation
> that creates more work rather than removing it (such as by identifying
> potential tasks and feeding them to editors) might be preferable. And
> some of our tools already use such an approach, which is a good thing.

That's an interesting approach, but I'm not sure how constructive it is
in the long run.  I suppose it depends greatly on whether one considers
our mission to be 'building an encyclopedia to share in the sum[...]' or
'having an encyclopedia to share in the sum[...]' (I'm not sure if I
make the subtle distinction here clear).

Perhaps another way of putting it is to ask whether the
encyclopedia-building community is the means or the ends.  To my eyes,
having "more contributors" is not valuable unless it has "better
encyclopedia" as a direct consequence.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 01:06 PM, Richard Symonds wrote:
> Not quite: I would argue that anti-vandalism work is a "gateway drug" to
> the rest of the project. Just a hunch, though.

I'm pretty sure that typo correction fills pretty much the same niche,
though.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 9:20 AM, Marc A. Pelletier wrote:

That's actually an interesting question that has been lurking beneath
all the "editing is going down" nervousness.

How much of that 'editing' was, in fact, busy work made immaterial by
technical advantage (bots, extensions, abusefilter)?  The number of
antivandalism edits a /human/ has to do in a day has most certainly come
down a *lot* since c. 2006; this no doubt contributed to a large - now
diminishing - fraction of total edits.

It's not clear to me that the number of *productive* edits has been
going down all that much (if at all) in the past several years; the
proportion of edits that were tedious and repetitive clearly has.

Are you arguing that there is *value* in volunteers spending time on
work that could be automated?  Except for artificially driving up edit
counts, that is time (and effort) that would be better spent pretty much
anywhere else!
A lot of work that gets automated is not necessarily difficult for 
humans, just time-consuming. But volunteer time is not a resource we get 
to allocate or control; the volunteers do. Simple tasks can help recruit 
or retain contributors--providing a way to ease people into 
participation, or a break to prevent burnout between tackling more 
challenging projects. And while that time and effort might appear more 
"valuable" if spent on other tasks, there's no guarantee that it in fact 
would be.


For tasks that most contributors find unpleasant (dealing with certain 
types of vandalism, perhaps), automation is clearly the way to go. But 
repetition does not necessarily equal tedium in all circumstances or for 
all people. Nor do we need to apply some business-type evaluation of 
what constitutes "productive" effort, at least in the context of 
volunteer work. If a task simply makes someone feel productive, their 
own evaluation is what matters, and it can help them feel more engaged 
and part of the community.


My general point is that opportunities for automation are best 
considered with our overall mission in mind, not just the speed or 
efficiency of a particular workflow. In certain situations, automation 
that creates more work rather than removing it (such as by identifying 
potential tasks and feeding them to editors) might be preferable. And 
some of our tools already use such an approach, which is a good thing.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Richard Symonds
Not quite: I would argue that anti-vandalism work is a "gateway drug" to
the rest of the project. Just a hunch, though.
On Nov 20, 2013 5:21 PM, "Marc A. Pelletier"  wrote:

> On 11/20/2013 11:59 AM, Michael Snow wrote:
> > An essential part of collaboration is, after all, reviewing each other's
> > work. From the terseness of the comment, it might be alluding to either
> > aspect or both.
>
> That's actually an interesting question that has been lurking beneath
> all the "editing is going down" nervousness.
>
> How much of that 'editing' was, in fact, busy work made immaterial by
> technical advantage (bots, extensions, abusefilter)?  The number of
> antivandalism edits a /human/ has to do in a day has most certainly come
> down a *lot* since c. 2006; this no doubt contributed to a large - now
> diminishing - fraction of total edits.
>
> It's not clear to me that the number of *productive* edits has been
> going down all that much (if at all) in the past several years; the
> proportion of edits that were tedious and repetitive clearly has.
>
> Are you arguing that there is *value* in volunteers spending time on
> work that could be automated?  Except for artificially driving up edit
> counts, that is time (and effort) that would be better spent pretty much
> anywhere else!
>
> -- Marc
>
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 11:59 AM, Michael Snow wrote:
> An essential part of collaboration is, after all, reviewing each other's
> work. From the terseness of the comment, it might be alluding to either
> aspect or both.

That's actually an interesting question that has been lurking beneath
all the "editing is going down" nervousness.

How much of that 'editing' was, in fact, busy work made immaterial by
technical advantage (bots, extensions, abusefilter)?  The number of
antivandalism edits a /human/ has to do in a day has most certainly come
down a *lot* since c. 2006; this no doubt contributed to a large - now
diminishing - fraction of total edits.

It's not clear to me that the number of *productive* edits has been
going down all that much (if at all) in the past several years; the
proportion of edits that were tedious and repetitive clearly has.

Are you arguing that there is *value* in volunteers spending time on
work that could be automated?  Except for artificially driving up edit
counts, that is time (and effort) that would be better spent pretty much
anywhere else!

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread The Cunctator
There's also been discussion of automatically deleting content from
contributors contributor from their own writing.
On Nov 20, 2013 8:31 AM, "Marc A. Pelletier"  wrote:

> On 11/20/2013 07:13 AM, The Cunctator wrote:
> > Yes, let's keep on pushing for policies that drive away editors!
>
> Let's be clear here: contributions that are copyright violations are not
> desirable to begin with.  If someone is driven away because they cannot
> cut and paste from random websites anymore, I'm not sure that this could
> reasonably be taken to be a bad thing.
>
> -- Marc
>
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Michael Snow

On 11/20/2013 8:31 AM, Marc A. Pelletier wrote:

On 11/20/2013 07:13 AM, The Cunctator wrote:

Yes, let's keep on pushing for policies that drive away editors!

Let's be clear here: contributions that are copyright violations are not
desirable to begin with.  If someone is driven away because they cannot
cut and paste from random websites anymore, I'm not sure that this could
reasonably be taken to be a bad thing.
Not that I encourage us to be permissive about copyright infringement, 
but there are two potential aspects here. You've touched on the first, 
which is contributors who do the copying - if they are willing to 
change, that's fine, although I'm skeptical about the value of editors 
who "don't know any better" and certainly repeat offenders should be 
highly unwelcome. But the second aspect is the loss of tasks other 
editors may be able to participate in, if there's potential for 
overautomation of the review process and corresponding loss of human 
judgment (What needs to be removed and what could be fixed just by 
citing the source? How thorough a rewrite is necessary to avoid 
plagiarizing source text?). An essential part of collaboration is, after 
all, reviewing each other's work. From the terseness of the comment, it 
might be alluding to either aspect or both.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Marc A. Pelletier
On 11/20/2013 07:13 AM, The Cunctator wrote:
> Yes, let's keep on pushing for policies that drive away editors!

Let's be clear here: contributions that are copyright violations are not
desirable to begin with.  If someone is driven away because they cannot
cut and paste from random websites anymore, I'm not sure that this could
reasonably be taken to be a bad thing.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread Martijn Hoekstra
On Nov 20, 2013 1:13 PM, "The Cunctator"  wrote:
>
> Yes, let's keep on pushing for policies that drive away editors!

I'm not sure exactly what kind of policy you are getting at here. Could you
elaborate a little?

> On Nov 20, 2013 2:10 AM, "Fæ"  wrote:
>
> > On 19 November 2013 20:44, Samuel Klein  wrote:
> > > Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet
> > they
> > > would be glad to help design a bot that uses their API to check image
> > > copyvios.
> >
> > This is an area this spins off from my little experiments with better
> > management of uploads to Commons from mobile devices. I would like to
> > look at this again and perhaps get a funding proposal together (or
> > partnership with Tineye if they are up for it), It is one of several
> > creative back-burner volunteer projects that I hope to have time to
> > dig into again next year.
> >
> > Fae
> >
> > ___
> > Wikimedia-l mailing list
> > Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > 
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread The Cunctator
Yes, let's keep on pushing for policies that drive away editors!
On Nov 20, 2013 2:10 AM, "Fæ"  wrote:

> On 19 November 2013 20:44, Samuel Klein  wrote:
> > Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet
> they
> > would be glad to help design a bot that uses their API to check image
> > copyvios.
>
> This is an area this spins off from my little experiments with better
> management of uploads to Commons from mobile devices. I would like to
> look at this again and perhaps get a funding proposal together (or
> partnership with Tineye if they are up for it), It is one of several
> creative back-burner volunteer projects that I hope to have time to
> dig into again next year.
>
> Fae
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-20 Thread
On 19 November 2013 20:44, Samuel Klein  wrote:
> Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they
> would be glad to help design a bot that uses their API to check image
> copyvios.

This is an area this spins off from my little experiments with better
management of uploads to Commons from mobile devices. I would like to
look at this again and perhaps get a funding proposal together (or
partnership with Tineye if they are up for it), It is one of several
creative back-burner volunteer projects that I hope to have time to
dig into again next year.

Fae

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Federico Leva (Nemo)

Matthew Flaschen, 20/11/2013 06:05:

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of
text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
 has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a
while.


https://en.wikipedia.org/wiki/User:MadmanBot is still running on English
Wikipedia, which uses the same Yahoo APIs
(http://www.uberbox.org/~marc/csb.pl).

It might be possible to run it on Italian Wikipedia as well, even
without generating a new key.  The operator seems to be
https://en.wikipedia.org/wiki/User:Madman


That bot links a code (Coren's) that asks a key. So either the user is 
paying one himself, or he got the WMF's one some time ago: in both 
cases, he can't give it to more people.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Matthew Flaschen

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
 has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


https://en.wikipedia.org/wiki/User:MadmanBot is still running on English 
Wikipedia, which uses the same Yahoo APIs 
(http://www.uberbox.org/~marc/csb.pl).


It might be possible to run it on Italian Wikipedia as well, even 
without generating a new key.  The operator seems to be 
https://en.wikipedia.org/wiki/User:Madman


Matt Flaschen


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Federico Leva (Nemo)

Samuel Klein, 19/11/2013 21:44:

Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they
would be glad to help design a bot that uses their API to check image
copyvios.


How to make them include the whole Commons dataset into their own, to 
start with?


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Samuel Klein
Aside @Fae: the tineye crew are curious & quite pro-freeculture, I bet they
would be glad to help design a bot that uses their API to check image
copyvios.
On Nov 13, 2013 6:48 AM, "Fæ"  wrote:

> On 13 November 2013 07:40, James Heilman  wrote:
> ...
> > Our biggest issue is copyright infringement.
> ...
>
> Thanks for raising this James.
>
> Yes, this is an issue but if you are gunning for elephants this month,
> I really don't think the copyright elephant is the biggest one in the
> herd.
>
> As a practical example of the tools we already have in place,
> yesterday I was facilitating an edit-a-thon for women in science with
> King's College London and we had one of the example stubs we had
> created on the English Wikipedia up on a projector. Within literally
> *minutes* of creation it had been (correctly) flagged by a bot as a
> possible copyright violation as some of the text had been cut & past
> from King's own website; one of the participants quickly re-wrote it
> using their own words. As the communications manager was sitting next
> to me at the time, no doubt she found this rather reassuring, even
> though in parallel she was asking about how best to "officially"
> release text. :-)
>
> We have a more complex problem with how images uploaded to Wikimedia
> Commons can be flagged where they match images found elsewhere on the
> internet, this is something that may be done by a future bot but we
> might need to partner with someone like Google Images or Tineye to
> make this truly effective. Having run my own experimental bots on this
> area, I would love to see this become a funded project.
>
> PS with regard to OTRS verification, we could do with better standards
> for verification, at the moment volunteers like myself are left to use
> our own judgement about what checks to make. I tend to double check
> text or images being released with Google, just in case, as well as
> doing "whois" checks on email domains. These sorts of checks could
> become part of OTRS guidelines and would make the reliability of OTRS
> tickets a notch higher.
>
> Cheers,
> Fae
> --
> fae...@gmail.com http://j.mp/faewm
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-19 Thread Andrew Gray
It could use abuse-filter tags, just not in an entirely standard way:

* Bot scans edit X
* Script flags it as a problem
* Bot makes edit X+1 to page (perhaps adding copyvio template?) which
triggers an abusefilter rule for (if this bot and does such-and-such
an edit) and tags it.

The offending edit itself won't be tagged, but the page history will
and it can probably be spotted quite easily from there.

A.

On 19 November 2013 01:07, Matthew Flaschen  wrote:
> On 11/16/2013 09:04 AM, Anthony Cole wrote:
>>
>> The problem of false positives from mirrors doesn't exist if we scan edits
>> as they are made.
>
>
> Agreed.  However, that example is a legal, attributed (at least on the talk
> page) copy from a third-party freely licensed text, not a false positive
> copy from a Wikipedia mirror.
>
>> Maggie says
>> herethat
>> copyright bots populate
>> WP:SCV  So a
>> similarly-configured bot could scan recent changes and tag suspected
>> copyvios in watchlists and page histories like suspected vandalism is
>> currently tagged.
>
>
> The suspected vandalism checks that actually tag the edit (e.g. "Tag:
> possible vandalism")  are based on AbuseFilter checks.  These are relatively
> fast determinations that consider the text of the edit (e.g. regexes for
> strings of curse words, or meaningless repeating characters), and
> comparisons to the previous version (blanked the section, blanked the page).
>
> As far as I know, regular AbuseFilter rules can not hit a database or web
> search to check for copyright violations.  An extension could in theory do
> this.  But there would possibly be performance problems, since AbuseFilter
> runs on the actual server (not just some bot's computer) on every edit.
>
> It is possible for a bot to scan every edit; it just can't use AbuseFilter
> tags.
>
> Matt Flaschen
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 



-- 
- Andrew Gray
  andrew.g...@dunelm.org.uk

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-18 Thread Matthew Flaschen

On 11/16/2013 09:04 AM, Anthony Cole wrote:

The problem of false positives from mirrors doesn't exist if we scan edits
as they are made.


Agreed.  However, that example is a legal, attributed (at least on the 
talk page) copy from a third-party freely licensed text, not a false 
positive copy from a Wikipedia mirror.



Maggie says 
herethat
copyright bots populate
WP:SCV  So a
similarly-configured bot could scan recent changes and tag suspected
copyvios in watchlists and page histories like suspected vandalism is
currently tagged.


The suspected vandalism checks that actually tag the edit (e.g. "Tag: 
possible vandalism")  are based on AbuseFilter checks.  These are 
relatively fast determinations that consider the text of the edit (e.g. 
regexes for strings of curse words, or meaningless repeating 
characters), and comparisons to the previous version (blanked the 
section, blanked the page).


As far as I know, regular AbuseFilter rules can not hit a database or 
web search to check for copyright violations.  An extension could in 
theory do this.  But there would possibly be performance problems, since 
AbuseFilter runs on the actual server (not just some bot's computer) on 
every edit.


It is possible for a bot to scan every edit; it just can't use 
AbuseFilter tags.


Matt Flaschen

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Federico Leva (Nemo)

Marc A. Pelletier, 16/11/2013 16:34:

On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:

Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
 has been
stopped when search engines changed their limits and Lusum has been
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


I haven't been "in charge" of that key in quite some time, but I think I
still have the apropriate credentials to generate one for a copyright
violation bot.

I can look into it if you want.


It would be awesome! Thank you.

Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Marc A. Pelletier
On 11/13/2013 04:41 PM, Tobias wrote:
> I think the community has done a very good job in the past 12 years when
> it comes to copyright. It is important to see that we are a community
> site – nothing is ever going to be perfect, and certainly we are not
> free of any copyright violations. But we are dealing with them in a very
> responsible way and I would say that our current efforts are sufficient.

I think that's the best way of summing it up.  "Sufficient" is a vague
metric, and leaves room for improvement, but the nutshell is that the
community /does/ take copyright violations seriously and deploys very
good efforts to curtail it.

Do some slip through?  Yes, without doubt.  Are they eliminated with
prejudice the second they are noticed?  Yes.

The Wikimedia projects are no worse than any other collected works when
it comes to copyright infringement and indeed tends to handle it with
more vigilance than the other sites in the top 10 (proactively, rather
than reactively).

Could we do better?  No doubt.  Is improvement so desperately critical
that we should drop everything else to concentrate on that?  Not a
chance.  And I speak as the author and (for a long time) maintainer of
one of the most visible and used copyright violation detection tool used
on our project (CorenSearchBot, now handled by MadmanBot and - last I
heard - used on around a dozen projects).

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Marc A. Pelletier
On 11/13/2013 04:57 AM, Federico Leva (Nemo) wrote:
> Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program
>  has been
> stopped when search engines changed their limits and Lusum has been
> waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.

I haven't been "in charge" of that key in quite some time, but I think I
still have the apropriate credentials to generate one for a copyright
violation bot.

I can look into it if you want.

-- Marc


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-16 Thread Anthony Cole
The problem of false positives from mirrors doesn't exist if we scan edits
as they are made.

Maggie says 
herethat
copyright bots populate
WP:SCV  So a
similarly-configured bot could scan recent changes and tag suspected
copyvios in watchlists and page histories like suspected vandalism is
currently tagged. Ideally the edit summary would contain a url to the
suspected source. Maggie points out that those copyright bots were blocked
for a while (from scanning Google I presume) due to a negative impact on
Google, and this problem was solved by someone writing a cheque.

False positives won't be a problem, unless they're more than, say, 50%. If
a recent changes patroller can't confirm the copyvio, they can let it go.
But an editor whose "contributions" list is peppered with such warnings
would stand out like a sore thumb.



Anthony Cole 
Memberships secretary
Wiki Project Med Foundation


On Sat, Nov 16, 2013 at 3:36 AM, rupert THURNER wrote:

> Salut florence, i obviously need to improve my English :) Marco suggested
> human checking to avoid false positives and some annotation that it
> happened. In my eyes the cited case is a verbatim copy of some compatible
> license text which could be used as an example to demonstrate what he ment.
> I did not see such a thing up to now and would not be 100% sure how to do
> it correctly. So i asked.
>
> Rupert
> Am 15.11.2013 13:44 schrieb "Florence Devouard" :
>
> > H
> >
> > Rupert,
> >
> > The case you mention is unrelated to any copyright infringement (the book
> > is explicitely published under cc by sa. So there is no copyvio). Its
> > mention here is like hair falling in soup.
> >
> > Now, I think there is a developing personal feud between you and Iolenda.
> > It sincerely saddens me to see two people I appreciate come to such a
> > situation. Would you both consider talking to each other on Skype or
> > something like this ? Alternatively, find someone neutral and nice to
> help
> > fix things so that you can come to a mutual understanding ?
> >
> > I understand that you both see things differently, but ultimately, you
> > both are here to make things move on.
> >
> > Flo
> >
> >
> > On 11/14/13 2:36 PM, rupert THURNER wrote:
> >
> >> There is such a case in http://en.m.wikipedia.org/
> >> wiki/Education_in_Cameroon,
> >> reference is on the talk page. would you be so kind to mark or refer to
> it
> >> correctly?
> >>
> >> rupert
> >> Am 13.11.2013 12:46 schrieb "Marco Chiesa" :
> >>
> >>  On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna 
> >>> wrote:
> >>>
> >>>
>  The problem isn't that we're waiting for perfection. We're waiting for
> 
> >>> the
> >>>
>  proportion of false positives and false negatives to fall to a level
> 
> >>> where
> >>>
>  don't overwhelm the true positives.
> 
> 
>   To avoid false positives from mirrors, the best option is to compare
> a
> >>> text
> >>> as soon as it is saved. Also, you exclude certain websites from the
> >>> comparison because you know they're the mirrors, you exclude rollbacks,
> >>> ...
> >>> Then, it is better to have a human checking that it is really a copyvio
> >>> (it
> >>> could well be a public domain text, or another Wikipedia article).
> >>>
> >>> Marco
> >>> ___
> >>> Wikimedia-l mailing list
> >>> Wikimedia-l@lists.wikimedia.org
> >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >>> 
> >>>
> >> ___
> >> Wikimedia-l mailing list
> >> Wikimedia-l@lists.wikimedia.org
> >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >> 
> >>
> >>
> >
> >
> > ___
> > Wikimedia-l mailing list
> > Wikimedia-l@lists.wikimedia.org
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > 
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-15 Thread rupert THURNER
Salut florence, i obviously need to improve my English :) Marco suggested
human checking to avoid false positives and some annotation that it
happened. In my eyes the cited case is a verbatim copy of some compatible
license text which could be used as an example to demonstrate what he ment.
I did not see such a thing up to now and would not be 100% sure how to do
it correctly. So i asked.

Rupert
Am 15.11.2013 13:44 schrieb "Florence Devouard" :

> H
>
> Rupert,
>
> The case you mention is unrelated to any copyright infringement (the book
> is explicitely published under cc by sa. So there is no copyvio). Its
> mention here is like hair falling in soup.
>
> Now, I think there is a developing personal feud between you and Iolenda.
> It sincerely saddens me to see two people I appreciate come to such a
> situation. Would you both consider talking to each other on Skype or
> something like this ? Alternatively, find someone neutral and nice to help
> fix things so that you can come to a mutual understanding ?
>
> I understand that you both see things differently, but ultimately, you
> both are here to make things move on.
>
> Flo
>
>
> On 11/14/13 2:36 PM, rupert THURNER wrote:
>
>> There is such a case in http://en.m.wikipedia.org/
>> wiki/Education_in_Cameroon,
>> reference is on the talk page. would you be so kind to mark or refer to it
>> correctly?
>>
>> rupert
>> Am 13.11.2013 12:46 schrieb "Marco Chiesa" :
>>
>>  On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna 
>>> wrote:
>>>
>>>
 The problem isn't that we're waiting for perfection. We're waiting for

>>> the
>>>
 proportion of false positives and false negatives to fall to a level

>>> where
>>>
 don't overwhelm the true positives.


  To avoid false positives from mirrors, the best option is to compare a
>>> text
>>> as soon as it is saved. Also, you exclude certain websites from the
>>> comparison because you know they're the mirrors, you exclude rollbacks,
>>> ...
>>> Then, it is better to have a human checking that it is really a copyvio
>>> (it
>>> could well be a public domain text, or another Wikipedia article).
>>>
>>> Marco
>>> ___
>>> Wikimedia-l mailing list
>>> Wikimedia-l@lists.wikimedia.org
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>> 
>>>
>> ___
>> Wikimedia-l mailing list
>> Wikimedia-l@lists.wikimedia.org
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>> 
>>
>>
>
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-15 Thread Florence Devouard

H

Rupert,

The case you mention is unrelated to any copyright infringement (the 
book is explicitely published under cc by sa. So there is no copyvio). 
Its mention here is like hair falling in soup.


Now, I think there is a developing personal feud between you and 
Iolenda. It sincerely saddens me to see two people I appreciate come to 
such a situation. Would you both consider talking to each other on Skype 
or something like this ? Alternatively, find someone neutral and nice to 
help fix things so that you can come to a mutual understanding ?


I understand that you both see things differently, but ultimately, you 
both are here to make things move on.


Flo


On 11/14/13 2:36 PM, rupert THURNER wrote:

There is such a case in http://en.m.wikipedia.org/wiki/Education_in_Cameroon,
reference is on the talk page. would you be so kind to mark or refer to it
correctly?

rupert
Am 13.11.2013 12:46 schrieb "Marco Chiesa" :


On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna  wrote:



The problem isn't that we're waiting for perfection. We're waiting for

the

proportion of false positives and false negatives to fall to a level

where

don't overwhelm the true positives.



To avoid false positives from mirrors, the best option is to compare a text
as soon as it is saved. Also, you exclude certain websites from the
comparison because you know they're the mirrors, you exclude rollbacks, ...
Then, it is better to have a human checking that it is really a copyvio (it
could well be a public domain text, or another Wikipedia article).

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 






___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread Laura Hale
On Thu, Nov 14, 2013 at 4:47 PM, Andrew Lih  wrote:

> FYI, on the last Wikipedia Weekly podcast, we talked with Sage Ross about
> the plagiarism issue, and he walked through the study with some very
> interesting insights. Video here, and the discussion started at 11 minutes,
> 30 seconds into the podcast.
>

I've done a study of student contributions on English Wikinews.  We've
found that only about 15% of student submissions have a copyright issue.
 The level of plagiarism and copyright problems is about the same for
regular contributors, new contributors and student contributors on English
Wikinews with that range of 10 to 15%.  This is an issue we have to be on
top of because nothing gets published on the project without being reviewed
for this issue.

Sincerely,
Laura Hale

-- 
twitter: purplepopple
blog: ozziesport.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread Andrew Lih
FYI, on the last Wikipedia Weekly podcast, we talked with Sage Ross about
the plagiarism issue, and he walked through the study with some very
interesting insights. Video here, and the discussion started at 11 minutes,
30 seconds into the podcast.

https://www.youtube.com/watch?v=IOgYytn2JRk

-Andrew



On Wed, Nov 13, 2013 at 4:03 AM, Steven Walling wrote:

> On Tue, Nov 12, 2013 at 11:40 PM, James Heilman  wrote:
>
> > The Wikimedia Foundation needs to wake up and deal with the "real tech
> > elephant in the room". Our primary issue is not a lack of FLOW, a lack
> of a
> > visual editor, or a lack of a rapidly expanding education program.
> >
> > Our biggest issue is copyright infringement. We have had the Indian
> > program, we have had issues with the Education program, and I have today
> > come across a user who has made nearly 20,000 edits to 1,742 article
> since
> > 2006 which appear to be nearly all copy and pasted from the sources he
> has
> > used.
> > https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
> > This
> > has seriously shaken my faith in Wikipedia.
> >
> > This is especially devastating as there is a tech solution that would
> have
> > prevented it. The efforts are being worked on by volunteers here
> > https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
> > least March of 2012. We NEED all tech resource at the foundation thrown
> at
> > this project. Other less important project like FLOW and the visual
> editor
> > need to be put on hold to develop this tool.
> >
>
> Relevant info on the subject of copyvio is the recent plagiarism study by
> the Education Program team. They looked different types of users (students,
> newbies, experienced editors, admins) and compared them. Results were
> published on Meta at
>
> https://meta.wikimedia.org/wiki/Research:Plagiarism_on_the_English_Wikipediaand
> also discussed in the last WMF Metrics & Activities meeting:
> https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-11-07
>
> AFAIK this is the best data we have about how often different kinds of
> editors close paraphrase or outright copy/paste.
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-14 Thread rupert THURNER
There is such a case in http://en.m.wikipedia.org/wiki/Education_in_Cameroon,
reference is on the talk page. would you be so kind to mark or refer to it
correctly?

rupert
Am 13.11.2013 12:46 schrieb "Marco Chiesa" :

> On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna  wrote:
>
> >
> > The problem isn't that we're waiting for perfection. We're waiting for
> the
> > proportion of false positives and false negatives to fall to a level
> where
> > don't overwhelm the true positives.
> >
> >
> To avoid false positives from mirrors, the best option is to compare a text
> as soon as it is saved. Also, you exclude certain websites from the
> comparison because you know they're the mirrors, you exclude rollbacks, ...
> Then, it is better to have a human checking that it is really a copyvio (it
> could well be a public domain text, or another Wikipedia article).
>
> Marco
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Martin Rulsch
> Unquestionably, there are also many instances where the systems fails and
> where lots of copyrighted material gets uploaded. Back in 2005, we had a
> case similar to the one you described in German Wikipedia, where various
> IPs copied content from old books. It is a big mess to clean up, but it can
> be done. And luckily the cases of massive copyvios are quite rare.
>
>
For further information see
https://de.wikipedia.org/wiki/Wikipedia:Archiv/DDR-URV/Presseinfo (German).

Cheers
Martin
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Tobias

On 11/13/2013 08:40 AM, James Heilman wrote:

Our biggest issue is copyright infringement.


When it comes to copyright infringement, among all community sites on 
the Internet, Wikipedia is one of the best to handle it. Many websites 
don't even bother with copyright unless they get a DMCA Takedown notice. 
We on the other hand have voluntary contributors checking pages and 
raising flags whenever there is even a suspicion of a copyright violation.


This seems to be highly effective in many cases. A few days ago, I wrote 
an email to a photographer, whose photos had been uploaded to Commons. 
He said I was the third to ask him whether he really had uploaded those 
images (which he had).


Unquestionably, there are also many instances where the systems fails 
and where lots of copyrighted material gets uploaded. Back in 2005, we 
had a case similar to the one you described in German Wikipedia, where 
various IPs copied content from old books. It is a big mess to clean up, 
but it can be done. And luckily the cases of massive copyvios are quite 
rare.


I think the community has done a very good job in the past 12 years when 
it comes to copyright. It is important to see that we are a community 
site – nothing is ever going to be perfect, and certainly we are not 
free of any copyright violations. But we are dealing with them in a very 
responsible way and I would say that our current efforts are sufficient.



Tobias


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Nathan
On Wed, Nov 13, 2013 at 1:48 PM, Michael Snow wrote:

> On 11/13/2013 10:39 AM, Nathan wrote:
>
>> On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk 
>> wrote:
>>
>>> Marco: I agree, we had also issues on the Dutch Wikipedia - these have
>>> been
>>> around for ages, the English Wikipedia is just less aware of them.
>>>
>> Not sure if you meant this how it sounds, but the English Wikipedia
>> community is acutely aware of copyright problems and have undertaken many,
>> many large and complicated cleanup tasks of the sort Marco described.
>>
> I think he meant that the English Wikipedia community is less aware of the
> fact that we face these sorts of large-scale challenges in many other
> languages as well. In other words, the antecedent to "them" is "issues on
> the Dutch/Italian/etc. Wikipedia", rather than "copyright issues"
> generally. Most people participating in other languages are reasonably
> aware when major concerns surface from the English Wikipedia; people
> participating only in English often haven't a clue about the concerns being
> dealt with in other languages.
>
> --Michael Snow
>
>
That makes sense, thanks for clearing that up for me.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Michael Snow

On 11/13/2013 10:39 AM, Nathan wrote:

On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk wrote:

Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
around for ages, the English Wikipedia is just less aware of them.

Not sure if you meant this how it sounds, but the English Wikipedia
community is acutely aware of copyright problems and have undertaken many,
many large and complicated cleanup tasks of the sort Marco described.
I think he meant that the English Wikipedia community is less aware of 
the fact that we face these sorts of large-scale challenges in many 
other languages as well. In other words, the antecedent to "them" is 
"issues on the Dutch/Italian/etc. Wikipedia", rather than "copyright 
issues" generally. Most people participating in other languages are 
reasonably aware when major concerns surface from the English Wikipedia; 
people participating only in English often haven't a clue about the 
concerns being dealt with in other languages.


--Michael Snow

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread George Herbert
On Wed, Nov 13, 2013 at 3:48 AM, Fæ  wrote:

> ...
> PS with regard to OTRS verification, we could do with better standards
> for verification,


We are not attempting to perform a complete and unassailable verification;
imagining that we can is folly.

The point is, we need someone who credibly is the author or rightsholder,
and with whom we have an audit trail of their claims and identity (email
address we corresponded with, etc).

When it comes down to it, we have no idea if an email is associated with
the given person, that the alleged sender of a certified letter really is
that person, or that the "John Doe" that came in to the office and showed
valid government issued ID with a claim of copyright violation is the same
John Doe who wrote the original material.  There's no way for us to confirm
in any reasonable manner.

If there is an attempt at identity theft that is discovered, that audit
trail is available to investigators with proper legal authorization etc.


-- 
-george william herbert
george.herb...@gmail.com
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Nathan
On Wed, Nov 13, 2013 at 4:53 AM, Lodewijk wrote:

> Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
> around for ages, the English Wikipedia is just less aware of them.
>


Not sure if you meant this how it sounds, but the English Wikipedia
community is acutely aware of copyright problems and have undertaken many,
many large and complicated cleanup tasks of the sort Marco described.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Quim Gil
On 11/13/2013 12:37 AM, Matthew Flaschen wrote:
> However,
> there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project).

Any technical project able to identify small tasks and mentors available
are welcome to join Wikimedia's Google Code-in team at

https://www.mediawiki.org/wiki/Google_Code-In

GCI will start next week and will last until the beginning of January.
Hundreds of young students will scan our tasks and will eventually
complete some of them.

It is a program ideal for small projects, like the bots or gadgets used
by editors.

-- 
Quim Gil
Technical Contributor Coordinator @ Wikimedia Foundation
http://www.mediawiki.org/wiki/User:Qgil

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread
On 13 November 2013 07:40, James Heilman  wrote:
...
> Our biggest issue is copyright infringement.
...

Thanks for raising this James.

Yes, this is an issue but if you are gunning for elephants this month,
I really don't think the copyright elephant is the biggest one in the
herd.

As a practical example of the tools we already have in place,
yesterday I was facilitating an edit-a-thon for women in science with
King's College London and we had one of the example stubs we had
created on the English Wikipedia up on a projector. Within literally
*minutes* of creation it had been (correctly) flagged by a bot as a
possible copyright violation as some of the text had been cut & past
from King's own website; one of the participants quickly re-wrote it
using their own words. As the communications manager was sitting next
to me at the time, no doubt she found this rather reassuring, even
though in parallel she was asking about how best to "officially"
release text. :-)

We have a more complex problem with how images uploaded to Wikimedia
Commons can be flagged where they match images found elsewhere on the
internet, this is something that may be done by a future bot but we
might need to partner with someone like Google Images or Tineye to
make this truly effective. Having run my own experimental bots on this
area, I would love to see this become a funded project.

PS with regard to OTRS verification, we could do with better standards
for verification, at the moment volunteers like myself are left to use
our own judgement about what checks to make. I tend to double check
text or images being released with Google, just in case, as well as
doing "whois" checks on email domains. These sorts of checks could
become part of OTRS guidelines and would make the reliability of OTRS
tickets a notch higher.

Cheers,
Fae
-- 
fae...@gmail.com http://j.mp/faewm

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 12:39 PM, Chris McKenna  wrote:

>
> The problem isn't that we're waiting for perfection. We're waiting for the
> proportion of false positives and false negatives to fall to a level where
> don't overwhelm the true positives.
>
>
To avoid false positives from mirrors, the best option is to compare a text
as soon as it is saved. Also, you exclude certain websites from the
comparison because you know they're the mirrors, you exclude rollbacks, ...
Then, it is better to have a human checking that it is really a copyvio (it
could well be a public domain text, or another Wikipedia article).

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 12:36 PM, Chris McKenna  wrote:

>
>> But an automated tool can not know whether OTRS verification has happened
> or not.
>
> We put something like {{OTRS verified}} in the article's talk page,
something saying: Part of the text comes from website X, ticket 1234567890.
And if the author wants to use his work for many articles, we tell him/her
to put the template in all his/her articles' talk page.
Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Chris McKenna

On Wed, 13 Nov 2013, Gerard Meijssen wrote:


The point I want to make is that having a tool that is KNOWN to be
deficient in specific ways can still be a huge advantage over not having a
tool at all. So PLEASE lets not make perfection the enemy of the good.


The problem isn't that we're waiting for perfection. We're waiting for the 
proportion of false positives and false negatives to fall to a level where 
don't overwhelm the true positives.



Chris McKenna

cmcke...@sucs.org
www.sucs.org/~cmckenna


The essential things in life are seen not with the eyes,
but with the heart

Antoine de Saint Exupery


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Chris McKenna

On Wed, 13 Nov 2013, Marco Chiesa wrote:


On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen 
wrote:



Hoi
I know several authors who publish and use their original text to publish
on Wikipedia as well.. This is another source of false positives because
they have the copyright to the original source... To recognise this you
have to be even more sophisticated.



Actually, we consider these as copyvios, we delete the text straight away,
and we tell the editor "if you're the author write to OTRS". Of course, if
the text is already somewhere else under a compatible free-license, we
don't need this. Until you can't be sure that User:MrX is actually the
physical person MrX, we need to protect the author's right.



But an automated tool can not know whether OTRS verification has happened 
or not.



Chris McKenna

cmcke...@sucs.org
www.sucs.org/~cmckenna


The essential things in life are seen not with the eyes,
but with the heart

Antoine de Saint Exupery


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 11:44 AM, Gerard Meijssen  wrote:

> Hoi
> I know several authors who publish and use their original text to publish
> on Wikipedia as well.. This is another source of false positives because
> they have the copyright to the original source... To recognise this you
> have to be even more sophisticated.
>

Actually, we consider these as copyvios, we delete the text straight away,
and we tell the editor "if you're the author write to OTRS". Of course, if
the text is already somewhere else under a compatible free-license, we
don't need this. Until you can't be sure that User:MrX is actually the
physical person MrX, we need to protect the author's right.

Marco
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Gerard Meijssen
Hoi
I know several authors who publish and use their original text to publish
on Wikipedia as well.. This is another source of false positives because
they have the copyright to the original source... To recognise this you
have to be even more sophisticated.

The point I want to make is that having a tool that is KNOWN to be
deficient in specific ways can still be a huge advantage over not having a
tool at all. So PLEASE lets not make perfection the enemy of the good.
Thanks,
   GerardM


On 13 November 2013 11:23, Matthew Flaschen wrote:

> On 11/13/2013 05:16 AM, Philippe Beaudette wrote:
>
>> On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen <
>> matthew.flasc...@gatech.edu> wrote:
>>
>>  A significant problem with TurnItIn is that is proprietary, and can not
>>> be
>>> customized by anyone in the movement.  The fact that it is proprietary
>>> also
>>> means it can never be port of the main infrastructure, nor run on
>>> Wikimedia
>>> Labs.
>>>
>>
>>
>> Another significant issue is the "False Positive" factor that is created
>> by
>> our overwhelming popularity.  Frankly, we're mirrored all over the place.
>> And tools like Turnitin find the mirrors too.  It's not an easy problem to
>> solve.  I was on the team that looked at this a couple of years back -
>> it's
>> just not simple, and there are complex challenges.
>>
>
> Yes, an intelligent solution would take into account when the mirror was
> first indexed (or ideally first published), and when the Wikipedia article
> was edited, to reduce false positives requiring manual intervention.
>
> Matt Flaschen
>
>
>
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Matthew Flaschen

On 11/13/2013 05:16 AM, Philippe Beaudette wrote:

On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen <
matthew.flasc...@gatech.edu> wrote:


A significant problem with TurnItIn is that is proprietary, and can not be
customized by anyone in the movement.  The fact that it is proprietary also
means it can never be port of the main infrastructure, nor run on Wikimedia
Labs.



Another significant issue is the "False Positive" factor that is created by
our overwhelming popularity.  Frankly, we're mirrored all over the place.
And tools like Turnitin find the mirrors too.  It's not an easy problem to
solve.  I was on the team that looked at this a couple of years back - it's
just not simple, and there are complex challenges.


Yes, an intelligent solution would take into account when the mirror was 
first indexed (or ideally first published), and when the Wikipedia 
article was edited, to reduce false positives requiring manual intervention.


Matt Flaschen


___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Philippe Beaudette
On Wed, Nov 13, 2013 at 2:37 AM, Matthew Flaschen <
matthew.flasc...@gatech.edu> wrote:

> A significant problem with TurnItIn is that is proprietary, and can not be
> customized by anyone in the movement.  The fact that it is proprietary also
> means it can never be port of the main infrastructure, nor run on Wikimedia
> Labs.


Another significant issue is the "False Positive" factor that is created by
our overwhelming popularity.  Frankly, we're mirrored all over the place.
And tools like Turnitin find the mirrors too.  It's not an easy problem to
solve.  I was on the team that looked at this a couple of years back - it's
just not simple, and there are complex challenges.


*Philippe Beaudette * \\  Director, Community Advocacy \\ Wikimedia
Foundation, Inc.
 T: 1-415-839-6885 x6643 |  phili...@wikimedia.org  |  :
@Philippewiki
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Federico Leva (Nemo)

Marco Chiesa, 13/11/2013 10:21:

There are bots that go and look whether a newly inserted block of text is
already present somewhere else, [...]


Rectius: there *used* to be a bot (RevertBot, Lusumbot). The program 
 has been 
stopped when search engines changed their limits and Lusum has been 
waiting for the WMF's Yahoo! BOSS key, needed to run the bot, for a while.


Nemo

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Lodewijk
Marco: I agree, we had also issues on the Dutch Wikipedia - these have been
around for ages, the English Wikipedia is just less aware of them. Often,
copypasting in the same language is caught easily - between different
languages is much harder and persistent. There are many people, including
experienced editors, that think translating from random sources is OK. It
is no new problem, and chapters have indeed been working on getting this
understanding of what free licenses really mean more widely accepted in the
general audience. Not something that is easily measured of course.
Technical solutions sound great, but are only catching a small amount
inside the same language.

Steven: I understand this research was limited to the English Wikipedia
(where most of the plagiarism will be in the same language). It would not
strike me out of the realm of realism to assume this might be very
different for other languages than English. It also says little about the
problem in general of course.

For those who don't want to click on links to get information, it basically
says (simplification alert) that they don't have any indication that the US
& Canada education program makes the plagiarism problem on the English
Wikipedia any worse than it already is.

Anyway: I think this problem is more prominently there in non-English
communities, and that technical solutions are not going to be the answer
there. An educational answer is more likely to be successful, focusing on
explaining people how Wikipedia works and doesn't work, and what are do's
and don'ts. This doesn't have to be an education program like executed in
the US, but basically all outreach programs as executed by chapters, user
groups, thematic organizations or groups of volunteers can contribute to
this. This is already happening in most countries.

In some countries (like Germany ;-) ) politicians are doing the work for
us, explaining how evil plagiarism is and how it works by firing government
ministers over it :)

Best,
Lodewijk




2013/11/13 Marco Chiesa 

> On Wed, Nov 13, 2013 at 8:40 AM, James Heilman  wrote:
>
> >
> > Our biggest issue is copyright infringement. We have had the Indian
> > program, we have had issues with the Education program, and I have today
> > come across a user who has made nearly 20,000 edits to 1,742 article
> since
> > 2006 which appear to be nearly all copy and pasted from the sources he
> has
> > used.
> > https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
> > This
> > has seriously shaken my faith in Wikipedia.
> >
>
> Back in 2007 we found out a user on it.wp, a former sysop, with more than
> 40,000 edits that used to copy-paste from his sources, often outdated. He
> was banned, and the community made a great effort to cleanup the articles
> he contributed to (and damn it was hard, because those articles had a long
> history after his edits). And in the following years, we had other similar
> cases, you can find a selection here:
> https://it.wikipedia.org/wiki/Progetto:Cococo/Controlli_conclusi
> There are bots that go and look whether a newly inserted block of text is
> already present somewhere else, it doesn't find everything  (of course it
> won't find things copied from a printed book), but sooner or later serial
> copyviolers get caught, and the fall from hero to zero is sooo quick.
>
> At the end of the day, I think copyvios have always been taken seriously,
> so that I don't remember big problems with that, while there have always
> been more problems with libel, privacy, and editor retention.
>
>
> Marco (Cruccone)
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
>
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Marco Chiesa
On Wed, Nov 13, 2013 at 8:40 AM, James Heilman  wrote:

>
> Our biggest issue is copyright infringement. We have had the Indian
> program, we have had issues with the Education program, and I have today
> come across a user who has made nearly 20,000 edits to 1,742 article since
> 2006 which appear to be nearly all copy and pasted from the sources he has
> used.
> https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
> This
> has seriously shaken my faith in Wikipedia.
>

Back in 2007 we found out a user on it.wp, a former sysop, with more than
40,000 edits that used to copy-paste from his sources, often outdated. He
was banned, and the community made a great effort to cleanup the articles
he contributed to (and damn it was hard, because those articles had a long
history after his edits). And in the following years, we had other similar
cases, you can find a selection here:
https://it.wikipedia.org/wiki/Progetto:Cococo/Controlli_conclusi
There are bots that go and look whether a newly inserted block of text is
already present somewhere else, it doesn't find everything  (of course it
won't find things copied from a printed book), but sooner or later serial
copyviolers get caught, and the fall from hero to zero is sooo quick.

At the end of the day, I think copyvios have always been taken seriously,
so that I don't remember big problems with that, while there have always
been more problems with libel, privacy, and editor retention.


Marco (Cruccone)
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Steven Walling
On Tue, Nov 12, 2013 at 11:40 PM, James Heilman  wrote:

> The Wikimedia Foundation needs to wake up and deal with the "real tech
> elephant in the room". Our primary issue is not a lack of FLOW, a lack of a
> visual editor, or a lack of a rapidly expanding education program.
>
> Our biggest issue is copyright infringement. We have had the Indian
> program, we have had issues with the Education program, and I have today
> come across a user who has made nearly 20,000 edits to 1,742 article since
> 2006 which appear to be nearly all copy and pasted from the sources he has
> used.
> https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
> This
> has seriously shaken my faith in Wikipedia.
>
> This is especially devastating as there is a tech solution that would have
> prevented it. The efforts are being worked on by volunteers here
> https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
> least March of 2012. We NEED all tech resource at the foundation thrown at
> this project. Other less important project like FLOW and the visual editor
> need to be put on hold to develop this tool.
>

Relevant info on the subject of copyvio is the recent plagiarism study by
the Education Program team. They looked different types of users (students,
newbies, experienced editors, admins) and compared them. Results were
published on Meta at
https://meta.wikimedia.org/wiki/Research:Plagiarism_on_the_English_Wikipediaand
also discussed in the last WMF Metrics & Activities meeting:
https://meta.wikimedia.org/wiki/Metrics_and_activities_meetings/2013-11-07

AFAIK this is the best data we have about how often different kinds of
editors close paraphrase or outright copy/paste.
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Matthew Flaschen

On 11/13/2013 02:40 AM, James Heilman wrote:

The Wikimedia Foundation needs to wake up and deal with the "real tech
elephant in the room". Our primary issue is not a lack of FLOW, a lack of a
visual editor, or a lack of a rapidly expanding education program.

Our biggest issue is copyright infringement.


I don't really agree with that.  It is a serious issue, but I would put 
NPOV (in the face of active threats such as companies paying for 
publicity on Wikipedia) and growing the editor community higher.


We also have solutions to address it (not perfectly, true), both 
preventing the problem and dealing with it after the fact


* MadmanBot (https://en.wikipedia.org/wiki/User:MadmanBot) (mentioned at 
Wikipedia:TurnItIn, and a major technical tool against copyright 
infringement).

* Clear policies against copyright infringement
* Dealing with copyright violations 
(https://en.wikipedia.org/wiki/Wikipedia:Text_Copyright_Violations_101)
* Finally, the DMCA ensures the foundation is not liable as long as they 
promptly respond to notifications (which of course we want them to anyway).



We have had the Indian program, we have had issues with the Education program, 
and I have today
come across a user who has made nearly 20,000 edits to 1,742 article since
2006 which appear to be nearly all copy and pasted from the sources he has
used. https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
This has seriously shaken my faith in Wikipedia.


That is indeed disturbing, and I'm glad you found it.


This is especially devastating as there is a tech solution that would have
prevented it. The efforts are being worked on by volunteers here
https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
least March of 2012. We NEED all tech resource at the foundation thrown at
this project. Other less important project like FLOW and the visual editor
need to be put on hold to develop this tool.


I don't agree that all tech resources should be used for this.  However, 
there may be room for enhancing MadmanBot (e.g. as a GSOC or OPW project).


A significant problem with TurnItIn is that is proprietary, and can not 
be customized by anyone in the movement.  The fact that it is 
proprietary also means it can never be port of the main infrastructure, 
nor run on Wikimedia Labs.


Matt Flaschen

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 


Re: [Wikimedia-l] Copyright infringement - The real elephant in the room

2013-11-13 Thread Gerard Meijssen
Hoi,
Seriously we should never ever be ruled be panic.What you see is bad, no
doubt but the notion that we should dump everything because of the latest
issue to come along is way overboard.

   - by stopping the flow on projects like Visual Editor you break
   dependencies for the work of many developers
   - what you have noticed is for only one Wikipedia not all of them
   - we do need more mature discussion software what we have is horrible
   - such dramatics only have you go away and upset others it does not
   solve things
   - the dramatics detract me from your message
   - my hobby horse needs more attention too and I think my argument is
   better ...

Anyway, it would be nice when someone looks at the tool with an eye of
making it happen and making it scale. When it doesn't it becomes a less
attractive option to pursue.
Thanks,
  GerardM


On 13 November 2013 08:40, James Heilman  wrote:

> The Wikimedia Foundation needs to wake up and deal with the "real tech
> elephant in the room". Our primary issue is not a lack of FLOW, a lack of a
> visual editor, or a lack of a rapidly expanding education program.
>
> Our biggest issue is copyright infringement. We have had the Indian
> program, we have had issues with the Education program, and I have today
> come across a user who has made nearly 20,000 edits to 1,742 article since
> 2006 which appear to be nearly all copy and pasted from the sources he has
> used.
> https://en.wikipedia.org/wiki/User_talk:DrMicro#Copyright_infringement
> This
> has seriously shaken my faith in Wikipedia.
>
> This is especially devastating as there is a tech solution that would have
> prevented it. The efforts are being worked on by volunteers here
> https://en.wikipedia.org/wiki/Wikipedia:Turnitin and has been since at
> least March of 2012. We NEED all tech resource at the foundation thrown at
> this project. Other less important project like FLOW and the visual editor
> need to be put on hold to develop this tool.
>
> --
> James Heilman
> MD, CCFP-EM, Wikipedian
>
> The Wikipedia Open Textbook of Medicine
> www.opentextbookofmedicine.com
> ___
> Wikimedia-l mailing list
> Wikimedia-l@lists.wikimedia.org
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> 
___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,