Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-04 Thread Rui Correia
Thanks James

Just out of curiosity, the other day I found two articles with a long
section with identical wording, only names and numbers had been
changed. Example:
The town of ... has a population of .. . The town is know for
its challenges in fighting poverty. According to local authorities,
trhey have undertaken housing and sanitation projects bla bla bla.

When I queried it, the author of the earlier article responded to say
that 'it was acceptable' so that beginners could find it easier to
start writing articles. From that I dug deeper and discovered that he
had tutored the writer of the derived article.

Regards, and a great weekend,

Rui

2015-04-04 3:49 GMT+02:00 James Heilman jmh...@gmail.com:
 1) Yes the source code is available. User:Eran has posted it here
 https://github.com/valhallasw/plagiabot

 2) This bot ONLY works on new edits within a couple of hours of them
 occurring. This reducing the number of false positives. It DOES NOT look at
 old edits.

 3) This requires human follow up and common sense. One needs to make sure
 that a) the source is not PD/CCBYSA b) that it is not wiki text that has
 been moved around c) that the authors of both are not the same, etc

 4) True positive rate is around 50% which is from my perspective good /
 useful. This bot has flagged a lot of copyright issues would have been
 missed otherwise.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe



-- 
_
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
___

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

[Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread James Heilman
The new and improved version of the copy and detection bot that we at [[WP:
MED]] have been using for nearly a year [
https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
to be expanded to other topic areas.

It can be found here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
the common.js code it will give you buttons to click to indicate follow up
of concerns. Additionally one can sort the edits in question by
WikiProject. We are working to set up auto-archiving such that once
concerns are dealt with they will be removed from the main list.

We also want to have automatic compilation of data such as the frequency of
true positives and false positives generated by the bot. A blacklist of
sites that are know mirrors of Wikipedia is here [
https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
list is improved / expanded the accuracy of the bot will improve. Many
thanks to [[User:ערן]] for his amazing work.

The bot also has  the potential to work in other languages.

-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread rubin.happy
Hi, James.

Is the source code available anywhere?
IF you want to try your bot in other languages, I could help you with
testing in Russian Wikipedia :)

Best regards.
rubin16

2015-04-03 12:07 GMT+03:00 James Heilman jmh...@gmail.com:

 The new and improved version of the copy and detection bot that we at [[WP:
 MED]] have been using for nearly a year [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
 to be expanded to other topic areas.

 It can be found here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
 the common.js code it will give you buttons to click to indicate follow up
 of concerns. Additionally one can sort the edits in question by
 WikiProject. We are working to set up auto-archiving such that once
 concerns are dealt with they will be removed from the main list.

 We also want to have automatic compilation of data such as the frequency of
 true positives and false positives generated by the bot. A blacklist of
 sites that are know mirrors of Wikipedia is here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
 list is improved / expanded the accuracy of the bot will improve. Many
 thanks to [[User:ערן]] for his amazing work.

 The bot also has  the potential to work in other languages.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at:
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

Re: [Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread Rui Correia
Hi James

I often suspect copy-paste and find exact matches of the text
elsewhere. However, whereas one can painstakingly (unless there is a
trick that I am not aware of)  ascertain when text was enetered into
an article, it is not always possible to know when the other text
first appeared on the internet to know for sure who coppied who. From
my limited knowledge, I believe that some trace of the date of upload
must be retained somewhere in the code - will this bot be able to pick
up on that and provide a date?

Thanks and congratulations to all involved and for sharing.

Regards,

Rui

2015-04-03 11:07 GMT+02:00 James Heilman jmh...@gmail.com:
 The new and improved version of the copy and detection bot that we at [[WP:
 MED]] have been using for nearly a year [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready
 to be expanded to other topic areas.

 It can be found here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install
 the common.js code it will give you buttons to click to indicate follow up
 of concerns. Additionally one can sort the edits in question by
 WikiProject. We are working to set up auto-archiving such that once
 concerns are dealt with they will be removed from the main list.

 We also want to have automatic compilation of data such as the frequency of
 true positives and false positives generated by the bot. A blacklist of
 sites that are know mirrors of Wikipedia is here [
 https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this
 list is improved / expanded the accuracy of the bot will improve. Many
 thanks to [[User:ערן]] for his amazing work.

 The bot also has  the potential to work in other languages.

 --
 James Heilman
 MD, CCFP-EM, Wikipedian

 The Wikipedia Open Textbook of Medicine
 www.opentextbookofmedicine.com
 ___
 Wikimedia-l mailing list, guidelines at: 
 https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
 Wikimedia-l@lists.wikimedia.org
 Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
 mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe



-- 
_
Rui Correia
Advocacy, Human Rights, Media and Language Work Consultant
Bridge to Angola - Angola Liaison Consultant

Mobile Number in South Africa +27 74 425 4186
Número de Telemóvel na África do Sul +27 74 425 4186
___

___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe

[Wikimedia-l] Copy and Paste Detection Bot

2015-04-03 Thread James Heilman
1) Yes the source code is available. User:Eran has posted it here
https://github.com/valhallasw/plagiabot

2) This bot ONLY works on new edits within a couple of hours of them
occurring. This reducing the number of false positives. It DOES NOT look at
old edits.

3) This requires human follow up and common sense. One needs to make sure
that a) the source is not PD/CCBYSA b) that it is not wiki text that has
been moved around c) that the authors of both are not the same, etc

4) True positive rate is around 50% which is from my perspective good /
useful. This bot has flagged a lot of copyright issues would have been
missed otherwise.

-- 
James Heilman
MD, CCFP-EM, Wikipedian

The Wikipedia Open Textbook of Medicine
www.opentextbookofmedicine.com
___
Wikimedia-l mailing list, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines
Wikimedia-l@lists.wikimedia.org
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, 
mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe