Re: [Wikimedia-l] Copy and Paste Detection Bot
Thanks James Just out of curiosity, the other day I found two articles with a long section with identical wording, only names and numbers had been changed. Example: The town of ... has a population of .. . The town is know for its challenges in fighting poverty. According to local authorities, trhey have undertaken housing and sanitation projects bla bla bla. When I queried it, the author of the earlier article responded to say that 'it was acceptable' so that beginners could find it easier to start writing articles. From that I dug deeper and discovered that he had tutored the writer of the derived article. Regards, and a great weekend, Rui 2015-04-04 3:49 GMT+02:00 James Heilman jmh...@gmail.com: 1) Yes the source code is available. User:Eran has posted it here https://github.com/valhallasw/plagiabot 2) This bot ONLY works on new edits within a couple of hours of them occurring. This reducing the number of false positives. It DOES NOT look at old edits. 3) This requires human follow up and common sense. One needs to make sure that a) the source is not PD/CCBYSA b) that it is not wiki text that has been moved around c) that the authors of both are not the same, etc 4) True positive rate is around 50% which is from my perspective good / useful. This bot has flagged a lot of copyright issues would have been missed otherwise. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe -- _ Rui Correia Advocacy, Human Rights, Media and Language Work Consultant Bridge to Angola - Angola Liaison Consultant Mobile Number in South Africa +27 74 425 4186 Número de Telemóvel na África do Sul +27 74 425 4186 ___ ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
[Wikimedia-l] Copy and Paste Detection Bot
The new and improved version of the copy and detection bot that we at [[WP: MED]] have been using for nearly a year [ https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready to be expanded to other topic areas. It can be found here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install the common.js code it will give you buttons to click to indicate follow up of concerns. Additionally one can sort the edits in question by WikiProject. We are working to set up auto-archiving such that once concerns are dealt with they will be removed from the main list. We also want to have automatic compilation of data such as the frequency of true positives and false positives generated by the bot. A blacklist of sites that are know mirrors of Wikipedia is here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this list is improved / expanded the accuracy of the bot will improve. Many thanks to [[User:ערן]] for his amazing work. The bot also has the potential to work in other languages. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
Re: [Wikimedia-l] Copy and Paste Detection Bot
Hi, James. Is the source code available anywhere? IF you want to try your bot in other languages, I could help you with testing in Russian Wikipedia :) Best regards. rubin16 2015-04-03 12:07 GMT+03:00 James Heilman jmh...@gmail.com: The new and improved version of the copy and detection bot that we at [[WP: MED]] have been using for nearly a year [ https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready to be expanded to other topic areas. It can be found here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install the common.js code it will give you buttons to click to indicate follow up of concerns. Additionally one can sort the edits in question by WikiProject. We are working to set up auto-archiving such that once concerns are dealt with they will be removed from the main list. We also want to have automatic compilation of data such as the frequency of true positives and false positives generated by the bot. A blacklist of sites that are know mirrors of Wikipedia is here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this list is improved / expanded the accuracy of the bot will improve. Many thanks to [[User:ערן]] for his amazing work. The bot also has the potential to work in other languages. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
Re: [Wikimedia-l] Copy and Paste Detection Bot
Hi James I often suspect copy-paste and find exact matches of the text elsewhere. However, whereas one can painstakingly (unless there is a trick that I am not aware of) ascertain when text was enetered into an article, it is not always possible to know when the other text first appeared on the internet to know for sure who coppied who. From my limited knowledge, I believe that some trace of the date of upload must be retained somewhere in the code - will this bot be able to pick up on that and provide a date? Thanks and congratulations to all involved and for sharing. Regards, Rui 2015-04-03 11:07 GMT+02:00 James Heilman jmh...@gmail.com: The new and improved version of the copy and detection bot that we at [[WP: MED]] have been using for nearly a year [ https://en.wikipedia.org/wiki/User:EranBot/Copyright here] is nearly ready to be expanded to other topic areas. It can be found here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/rc]. If you install the common.js code it will give you buttons to click to indicate follow up of concerns. Additionally one can sort the edits in question by WikiProject. We are working to set up auto-archiving such that once concerns are dealt with they will be removed from the main list. We also want to have automatic compilation of data such as the frequency of true positives and false positives generated by the bot. A blacklist of sites that are know mirrors of Wikipedia is here [ https://en.wikipedia.org/wiki/User:EranBot/Copyright/Blacklist]. As this list is improved / expanded the accuracy of the bot will improve. Many thanks to [[User:ערן]] for his amazing work. The bot also has the potential to work in other languages. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe -- _ Rui Correia Advocacy, Human Rights, Media and Language Work Consultant Bridge to Angola - Angola Liaison Consultant Mobile Number in South Africa +27 74 425 4186 Número de Telemóvel na África do Sul +27 74 425 4186 ___ ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe
[Wikimedia-l] Copy and Paste Detection Bot
1) Yes the source code is available. User:Eran has posted it here https://github.com/valhallasw/plagiabot 2) This bot ONLY works on new edits within a couple of hours of them occurring. This reducing the number of false positives. It DOES NOT look at old edits. 3) This requires human follow up and common sense. One needs to make sure that a) the source is not PD/CCBYSA b) that it is not wiki text that has been moved around c) that the authors of both are not the same, etc 4) True positive rate is around 50% which is from my perspective good / useful. This bot has flagged a lot of copyright issues would have been missed otherwise. -- James Heilman MD, CCFP-EM, Wikipedian The Wikipedia Open Textbook of Medicine www.opentextbookofmedicine.com ___ Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines Wikimedia-l@lists.wikimedia.org Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, mailto:wikimedia-l-requ...@lists.wikimedia.org?subject=unsubscribe