Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-04 Thread Rui Correia
Thanks James

Just out of curiosity, the other day I found two articles with a long
section with identical wording, only names and numbers had been
changed. Example:
The town of ... has a population of .. . The town is know for
its challenges in fighting poverty. According to local authorities,
trhey have undertaken housing and sanitation projects bla bla bla.

When I queried it, the author of the earlier article responded to say
that 'it was acceptable' so that beginners could find it easier to
start writing articles. From that I dug deeper and discovered that he
had tutored the writer of the derived article.

Regards, and a great weekend,

Rui

2015-04-04 3:49 GMT+02:00 James Heilman jmh...@gmail.com:
 1) Yes the source code is available. User:Eran has posted it here
 https://github.com/valhallasw/plagiabot

 2) This bot ONLY works on new edits within a couple of hours of them
 occurring. This reducing the number of false positives. It DOES NOT look at
 old edits.

 3) This requires human follow up and common sense. One needs to make sure
 that a) the source is not PD/CCBYSA b) that it is not wiki text that has
 been moved around c) that the authors of both are not the same, etc

 4) True positive rate is around 50% which is from my perspective good /
 useful. This bot has flagged a lot of copyright issues would have been
 missed otherwise.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe



-- 
_
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
___

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

[Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread James Heilman
The new and improved version of the copy and detection bot that we at [[WP:
MED]] have been using for nearly a year [
https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
to be expanded to other topic areas.

It can be found here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
the common.js code it will give you buttons to click to indicate follow up
of concerns. Additionally one can sort the edits in question by
WikiProject. We are working to set up auto-archiving such that once
concerns are dealt with they will be removed from the main list.

We also want to have automatic compilation of data such as the frequency of
true positives and false positives generated by the bot. A blacklist of
sites that are know mirrors of Wikipedia is here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
list is improved / expanded the accuracy of the bot will improve. Many
thanks to [[User:ערן]] for his amazing work.

The bot also has  the potential to work in other languages.

-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread rubin.happy
Hi, James.

Is the source code available anywhere?
IF you want to try your bot in other languages, I could help you with
testing in Russian Wikipedia :)

Best regards.
rubin16

2015-04-03 12:07 GMT+03:00 James Heilman jmh...@gmail.com:

 The new and improved version of the copy and detection bot that we at [[WP:
 MED]] have been using for nearly a year [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
 to be expanded to other topic areas.

 It can be found here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
 the common.js code it will give you buttons to click to indicate follow up
 of concerns. Additionally one can sort the edits in question by
 WikiProject. We are working to set up auto-archiving such that once
 concerns are dealt with they will be removed from the main list.

 We also want to have automatic compilation of data such as the frequency of
 true positives and false positives generated by the bot. A blacklist of
 sites that are know mirrors of Wikipedia is here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
 list is improved / expanded the accuracy of the bot will improve. Many
 thanks to [[User:ערן]] for his amazing work.

 The bot also has  the potential to work in other languages.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread Rui Correia
Hi James

I often suspect copy-paste and find exact matches of the text
elsewhere. However, whereas one can painstakingly (unless there is a
trick that I am not aware of)  ascertain when text was enetered into
an article, it is not always possible to know when the other text
first appeared on the internet to know for sure who coppied who. From
my limited knowledge, I believe that some trace of the date of upload
must be retained somewhere in the code - will this bot be able to pick
up on that and provide a date?

Thanks and congratulations to all involved and for sharing.

Regards,

Rui

2015-04-03 11:07 GMT+02:00 James Heilman jmh...@gmail.com:
 The new and improved version of the copy and detection bot that we at [[WP:
 MED]] have been using for nearly a year [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
 to be expanded to other topic areas.

 It can be found here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
 the common.js code it will give you buttons to click to indicate follow up
 of concerns. Additionally one can sort the edits in question by
 WikiProject. We are working to set up auto-archiving such that once
 concerns are dealt with they will be removed from the main list.

 We also want to have automatic compilation of data such as the frequency of
 true positives and false positives generated by the bot. A blacklist of
 sites that are know mirrors of Wikipedia is here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
 list is improved / expanded the accuracy of the bot will improve. Many
 thanks to [[User:ערן]] for his amazing work.

 The bot also has  the potential to work in other languages.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe



-- 
_
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
___

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

[Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread James Heilman
1) Yes the source code is available. User:Eran has posted it here
https://github.com/valhallasw/plagiabot

2) This bot ONLY works on new edits within a couple of hours of them
occurring. This reducing the number of false positives. It DOES NOT look at
old edits.

3) This requires human follow up and common sense. One needs to make sure
that a) the source is not PD/CCBYSA b) that it is not wiki text that has
been moved around c) that the authors of both are not the same, etc

4) True positive rate is around 50% which is from my perspective good /
useful. This bot has flagged a lot of copyright issues would have been
missed otherwise.

-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copy and paste

2012-10-20 Thread Ray Saintonge

On 10/17/12 10:26 PM, James Heilman wrote:

We really need a plagiarism detection tool so that we can make sure our
sources are not simply copy and pastes of older versions of Wikipedia.
Today I was happily improving our article on pneumonia as I have a day off.
I came across a recommendation that baby's should be suction at birth to
decrease their risk of pneumonia with a {{cn}} tag. So I went to Google
books and up came a book that supported it perfectly. And than I noticed
that this book supported the previous and next few sentences as well. It
also supported a number of other sections we had in the article but was
missing our references. The book was selling for $340 a copy. Our articles
have improved a great deal since 2007 and yet school are buying copy edited
version of Wikipedia from 5 years ago. The bit about suctioning babies at
birth is was wrong and I have corrected it. I think we need to get this
news out. Support Wikipedia and use the latest version online!

Further details / discuss are here
http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Can_we_still_use_books_as_refs.3F



This situation was entirely predictable, even if its particular 
circumstances weren't. I ran into something of the sort as far back as 
2003. I have long since lost track of references to the incident; it had 
to do with literary biographies of long dead authors, and thus much less 
critical than in a medical article. The broader question goes well 
beyond simple matters of plagiarism or copyright infringement. The 
passages will often be short enough that a fair dealing claim is 
available, and the moral right to be credited for one's work has no 
meaningful legal enforcement to back it up. To those familiar with these 
things that right isn't even controversial.


The disputed version in this case is a mere five years old. Over a 
longer time that could encompass the entire validity period of a 
copyright we could easily see such a thing bounce back and forth many 
times over without ever  being discovered.  A bot could do some of the 
search for infringing material; it may even look through archived and 
archaic versions of a document. I believe that at some point any such 
processes reach a limit. That broader solution will need to be more 
imaginative than more police work.


Ray

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Copy and paste

2012-10-18 Thread Federico Leva (Nemo)
How hard would it be to set up a tool like the software that as far as I 
know the MIT uses to automatically check plagiarism among thesis etc. 
submitted to their digital library, checking the text of all Wikimedia 
projects against e.g. newspaper websites and Google Books, and then 
publishing the results in some visually appealing way to show how much 
newspapers copy from Wikipedia and each other?
On it.wiki we regularly see complaints and unhappy discussions about 
newspaper articles which are just a copy and paste from Wikipedia and 
still feature a COPY RESERVED warning without citing any source... 
newspapers are by definition arrogant, so nothing can be done to stop 
them, but an informative tool would be useful and might be as effective 
as wikiscanner was with regard to IP editing from organizations.


Nemo

James Heilman, 18/10/2012 07:26:

We really need a plagiarism detection tool so that we can make sure our
sources are not simply copy and pastes of older versions of Wikipedia.
Today I was happily improving our article on pneumonia as I have a day off.
I came across a recommendation that baby's should be suction at birth to
decrease their risk of pneumonia with a {{cn}} tag. So I went to Google
books and up came a book that supported it perfectly. And than I noticed
that this book supported the previous and next few sentences as well. It
also supported a number of other sections we had in the article but was
missing our references. The book was selling for $340 a copy. Our articles
have improved a great deal since 2007 and yet school are buying copy edited
version of Wikipedia from 5 years ago. The bit about suctioning babies at
birth is was wrong and I have corrected it. I think we need to get this
news out. Support Wikipedia and use the latest version online!

Further details / discuss are here
http://en.wikipedia.org/wiki/Wikipedia_talk:WikiProject_Medicine#Can_we_still_use_books_as_refs.3F



___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l


Re: [Wikimedia-l] Copy and paste

2012-10-18 Thread Tom Morris
On Thu, Oct 18, 2012 at 6:26 AM, James Heilman jmh...@gmail.com wrote:
 We really need a plagiarism detection tool so that we can make sure our
 sources are not simply copy and pastes of older versions of Wikipedia.
 Today I was happily improving our article on pneumonia as I have a day off.
 I came across a recommendation that baby's should be suction at birth to
 decrease their risk of pneumonia with a {{cn}} tag. So I went to Google
 books and up came a book that supported it perfectly. And than I noticed
 that this book supported the previous and next few sentences as well. It
 also supported a number of other sections we had in the article but was
 missing our references. The book was selling for $340 a copy. Our articles
 have improved a great deal since 2007 and yet school are buying copy edited
 version of Wikipedia from 5 years ago. The bit about suctioning babies at
 birth is was wrong and I have corrected it. I think we need to get this
 news out. Support Wikipedia and use the latest version online!


It's sort of unrelated, but there's a project called Common Crawl:

http://commoncrawl.org/

It is trying to produce an open crawl of the web (much as Google,
Bing etc. have for their search engines).

Now that the copyvio bot is down, I'm wondering if someone would be
interested in building something that used the Common Crawl database,
or whether that'd be practical.

-- 
Tom Morris
http://tommorris.org/

___
Wikimedia-l mailing list
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l