Re: [Wiki-research-l] Kill the bots

2014-05-21 Thread Oliver Keyes
Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that
browse/interact via the web - ones I can identify include WPCleaner[0],
which is semi-automated, something I can't find through WP or google called
DigitalsmithsBot (could be internal, could be external), and Hoo Bot (run
by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
framework that could be masking multiple underlying bots and has ~ 7.4m
requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come
through the API, but they're choosing to request content through the web
interface for some weird reason. I don't know enough about the software
behind each bot to comment on that. I can try explicitly looking for
web-based edit attempts, but there would be far fewer observations that the
bots might appear in, because the underlying dataset is sampled at a 1:1000
rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


On 20 May 2014 07:50, Oliver Keyes oke...@wikimedia.org wrote:

 Actually, belay that, I have a pretty good idea. I'll fire the log parser
 up now.


 On 20 May 2014 01:21, Oliver Keyes oke...@wikimedia.org wrote:

 I think a *lot* of them use the API, but I don't know off the top of my
 head if it's *all* of them. If only we knew somebody who has spent the
 last 3 months staring into the cthulian nightmare of our request logs and
 could look this up...

 More seriously; drop me a note off-list so that I can try to work out
 precisely what you need me to find out, and I'll write a quick-and-dirty
 parser of our sampled logs to drag the answer kicking and screaming into
 the light.

 (sorry, it's annual review season. That always gets me blithe.)


 On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote:

 Thanks all for the comments on my paper, and even more thanks to
 everyone sharing these super helpful ideas on filtering bots: this is why I
 love the Wikipedia research committee.

 I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing
 research, if anyone is looking for paper ideas.

 Citation goldmine as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.


 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


 Cheers,
 Scott

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-21 Thread Scott Hale
Thank you, Oliver,

This is really interesting and gives some credibility to the idea that the
ability to track API/non-API edits could address the bot problem in part,
but definitely could miss some bots. Thank you very much for your time to
check this and share the results. Anyone think it would be worth requesting
feature for logging  API/non-API edits in the issue tracker?

I'm also very interested in continuing to hear suggestions on how everyone
is identifying bots in existing data (or if they think this is
unnecessary). Apologies if I hijacked the thread slightly by asking about
API/non-API bots.

Best wishes,
Scott



On Wed, May 21, 2014 at 11:06 PM, Oliver Keyes oke...@wikimedia.org wrote:


 Okay. Methodology:

 *take the last 5 days of requestlogs;
 *Filter them down to text/html requests as a heuristic for non-API
 requests;
 *Run them through the UA parser we use;
 *Exclude spiders and things which reported valid browsers;
 *Aggregate the user agents left;
 *???
 *Profit

 It looks like there are a relatively small number of bots that
 browse/interact via the web - ones I can identify include WPCleaner[0],
 which is semi-automated, something I can't find through WP or google called
 DigitalsmithsBot (could be internal, could be external), and Hoo Bot (run
 by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
 framework that could be masking multiple underlying bots and has ~ 7.4m
 requests through the web interface in that time period.

 Obvious caveat is obvious; the edits from these tools may actually come
 through the API, but they're choosing to request content through the web
 interface for some weird reason. I don't know enough about the software
 behind each bot to comment on that. I can try explicitly looking for
 web-based edit attempts, but there would be far fewer observations that the
 bots might appear in, because the underlying dataset is sampled at a 1:1000
 rate.

 [0]
 https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


 On 20 May 2014 07:50, Oliver Keyes oke...@wikimedia.org wrote:

 Actually, belay that, I have a pretty good idea. I'll fire the log parser
 up now.


 On 20 May 2014 01:21, Oliver Keyes oke...@wikimedia.org wrote:

 I think a *lot* of them use the API, but I don't know off the top of my
 head if it's *all* of them. If only we knew somebody who has spent the
 last 3 months staring into the cthulian nightmare of our request logs and
 could look this up...

 More seriously; drop me a note off-list so that I can try to work out
 precisely what you need me to find out, and I'll write a quick-and-dirty
 parser of our sampled logs to drag the answer kicking and screaming into
 the light.

 (sorry, it's annual review season. That always gets me blithe.)


 On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote:

 Thanks all for the comments on my paper, and even more thanks to
 everyone sharing these super helpful ideas on filtering bots: this is why I
 love the Wikipedia research committee.

 I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing
 research, if anyone is looking for paper ideas.

 Citation goldmine as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.


 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


 Cheers,
 Scott

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation




 --
 Oliver Keyes
 Research Analyst
 Wikimedia Foundation

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-20 Thread Oliver Keyes
I think a *lot* of them use the API, but I don't know off the top of my
head if it's *all* of them. If only we knew somebody who has spent the last
3 months staring into the cthulian nightmare of our request logs and could
look this up...

More seriously; drop me a note off-list so that I can try to work out
precisely what you need me to find out, and I'll write a quick-and-dirty
parser of our sampled logs to drag the answer kicking and screaming into
the light.

(sorry, it's annual review season. That always gets me blithe.)


On 19 May 2014 13:03, Scott Hale computermacgy...@gmail.com wrote:

 Thanks all for the comments on my paper, and even more thanks to everyone
 sharing these super helpful ideas on filtering bots: this is why I love the
 Wikipedia research committee.

 I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing
 research, if anyone is looking for paper ideas.

 Citation goldmine as one friend called it, I think.

 This won't address edit logs to date, but do  we know if most bots and
 automated tools use the API to make edits? If so, would it be feasibility
 to add a flag to each edit as to whether it came through the API or not.
 This won't stop determined users, but might be a nice way to identify
 cyborg edits from those made manually by the same user for many of the
 standard tools going forward.

 The closest thing I found in the bug tracker is [1], but it doesn't
 address the issue of 'what is a bot' which this thread has clearly shown is
 quite complex. An API-edit vs. non-API edit might be a way forward unless
 there are automated tools/bots that don't use the API.


 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


 Cheers,
 Scott

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread WereSpielChequers
If your bot is only running automated reports in its own userspace then it
doesn't need a bot flag. But it probably wont be a very active bot so may
not be a problem for your stats

On the English language wikipedia you are going to be fairly close if you
exclude all accounts which currently have a bot flag, this list of former
botshttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/Unflagged_bots
(I
occasionally maintain this in order for the list of editors by edit count
to work, as of a couple of weeks ago when I last checked I believe it to be
a comprehensive list of retired bots with 6,000 or more edits), and perhaps
the individual with a very high edit count who has in the past been blocked
for running unauthorised bots on his user account. (I won't name that
account on list, but since it also contains a large number of manual edits,
the true answer is that you can't get an exact divide between bots and non
bots by classifying every account as either a bot or a human).

If you are minded to treat all accounts containing the word syllable bot as
bots, then you might want to tweak that to count anyone on
thesehttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits
 two 
listshttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/5001%E2%80%931as
human even if their name includes bot. I check those lists
occasionally
and make sure that the only bots included are human editors.


On 18 May 2014 20:33, R.Stuart Geiger sgei...@gmail.com wrote:

 Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get
 no mercy. :-)

 But seriously, my tl;dr: instead of asking if an account is or isn't a
 bot, ask if a set of edits are or are not automated

 Great responses so far: searching usernames for *bot will exclude non-bot
 users who were registered before the username policy change (although *Bot
 is a bit better), and the logging table is a great way to collect bot
 flags. However, Scott is right -- the bot flag (or *Bot username) doesn't
 signify a bot, it signifies a bureaucrat recognizing that a user account
 successfully went through the Bot Approval Group process. If I see an
 account with a bot flag, I can generally assume the edits that account
 makes are initiated by an automated software agent. This is especially the
 case in the main namespace. The inverse assumption is not nearly as easy: I
 can't assume that every edit made from an account *without* a bot flag was
 *not* an automated edit.

 About unauthorized bots: yes, there are a relatively small number of
 Wikipedians who, on occasion, run fully-automated, continuously-operating
 bots without approval. Complicating this, if someone is going to take the
 time to build and run a bot, but isn't going to create a separate account
 for it, then it is likely that they are also using that account to do
 non-automated edits. Sometimes new bot developers will run an unauthorized
 bot under their own account during the initial stages of development, and
 only later in the process will they create a separate bot account and seek
 formal approval and flagging. It can get tricky when you exclude all the
 edits from an account for being automated based on a single suspicious set
 of edits.

 More commonly, there are many more people who use automated batch tools
 like AutoWikiBrowser to support one-off tasks, like mass find-and-replace
 or category cleanup. Accounts powered by AWB are technically not bots,
 only because a human has to sit there and click save for every batch edit
 that is made. Some people will create a separate bot account for AWB work
 and get it approved and flagged, but many more will not bother. Then
 there are people using semi-automated, human-in-the-loop tools like Huggle
 to do vandal fighting. I find that the really hard question is whether
 you include or exclude these different kinds of 'cyborgs', because it
 really makes you think hard about what exactly you're measuring. Is
 someone who does a mass find-and-replace on all articles in a category a
 co-author of each article they edit? Is a vandal fighter patrolling the
 recent changes feed with Huggle a co-author of all the articles they edit
 when they revert vandalism and then move on to the next diff? What about
 somebody using rollback in the web browser? If so, what is it that makes
 these entities authors and ClueBot NG not an author?

 When you think about it, user accounts are actually pretty remarkable in
 that they allow such a diverse set of uses and agents to be attributed to a
 single entity. So when it comes to identifying automation, I personally
 think it is better to shift the unit of analysis from the user account to
 the individual edit. A bot flag lets you assume all edits from an account
 are automated, but you can use a range of approaches to identifying sets of
 automated edits from non-flagged accounts. Then I have a set of regex SQL
 queries in the Query Library 

Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Oliver Keyes
That would cover most of them, but runs into the problem of you're only
including the unauthorised bots written poorly enough that we've caught the
operator ;). It seems like this would be a useful topic for some piece of
method-comparing research, if anyone is looking for paper ideas.


On 19 May 2014 03:30, WereSpielChequers werespielchequ...@gmail.com wrote:

 If your bot is only running automated reports in its own userspace then it
 doesn't need a bot flag. But it probably wont be a very active bot so may
 not be a problem for your stats

 On the English language wikipedia you are going to be fairly close if you
 exclude all accounts which currently have a bot flag, this list of former
 botshttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/Unflagged_bots
  (I
 occasionally maintain this in order for the list of editors by edit count
 to work, as of a couple of weeks ago when I last checked I believe it to be
 a comprehensive list of retired bots with 6,000 or more edits), and perhaps
 the individual with a very high edit count who has in the past been blocked
 for running unauthorised bots on his user account. (I won't name that
 account on list, but since it also contains a large number of manual edits,
 the true answer is that you can't get an exact divide between bots and non
 bots by classifying every account as either a bot or a human).

 If you are minded to treat all accounts containing the word syllable bot
 as bots, then you might want to tweak that to count anyone on 
 thesehttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits
  two 
 listshttps://en.wikipedia.org/wiki/Wikipedia:List_of_Wikipedians_by_number_of_edits/5001%E2%80%931as
  human even if their name includes bot. I check those lists occasionally
 and make sure that the only bots included are human editors.


 On 18 May 2014 20:33, R.Stuart Geiger sgei...@gmail.com wrote:

 Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will
 get no mercy. :-)

 But seriously, my tl;dr: instead of asking if an account is or isn't a
 bot, ask if a set of edits are or are not automated

 Great responses so far: searching usernames for *bot will exclude non-bot
 users who were registered before the username policy change (although *Bot
 is a bit better), and the logging table is a great way to collect bot
 flags. However, Scott is right -- the bot flag (or *Bot username) doesn't
 signify a bot, it signifies a bureaucrat recognizing that a user account
 successfully went through the Bot Approval Group process. If I see an
 account with a bot flag, I can generally assume the edits that account
 makes are initiated by an automated software agent. This is especially the
 case in the main namespace. The inverse assumption is not nearly as easy: I
 can't assume that every edit made from an account *without* a bot flag was
 *not* an automated edit.

 About unauthorized bots: yes, there are a relatively small number of
 Wikipedians who, on occasion, run fully-automated, continuously-operating
 bots without approval. Complicating this, if someone is going to take
 the time to build and run a bot, but isn't going to create a separate
 account for it, then it is likely that they are also using that account to
 do non-automated edits. Sometimes new bot developers will run an
 unauthorized bot under their own account during the initial stages of
 development, and only later in the process will they create a separate bot
 account and seek formal approval and flagging. It can get tricky when you
 exclude all the edits from an account for being automated based on a single
 suspicious set of edits.

 More commonly, there are many more people who use automated batch tools
 like AutoWikiBrowser to support one-off tasks, like mass find-and-replace
 or category cleanup. Accounts powered by AWB are technically not bots,
 only because a human has to sit there and click save for every batch edit
 that is made. Some people will create a separate bot account for AWB
 work and get it approved and flagged, but many more will not bother. Then
 there are people using semi-automated, human-in-the-loop tools like Huggle
 to do vandal fighting. I find that the really hard question is whether
 you include or exclude these different kinds of 'cyborgs', because it
 really makes you think hard about what exactly you're measuring. Is
 someone who does a mass find-and-replace on all articles in a category a
 co-author of each article they edit? Is a vandal fighter patrolling the
 recent changes feed with Huggle a co-author of all the articles they edit
 when they revert vandalism and then move on to the next diff? What about
 somebody using rollback in the web browser? If so, what is it that makes
 these entities authors and ClueBot NG not an author?

 When you think about it, user accounts are actually pretty remarkable in
 that they allow such a diverse set of uses and agents to be attributed to a
 single 

Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Federico Leva (Nemo)

Brian Keegan, 18/05/2014 18:10:

Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?


A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit 
https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts


Nemo

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Brian Keegan
Thanks for all the references and excellent advice so far!

I've looked into the Hale Anti-Bot Method™, but because I've sampled my
corpus on articles (based on category co-membership), the resulting groupby
users gives these semi-automated users more normal distributions since
their other contributions are censored. In other words, I see only a
fraction of these users' contributions and thus the resulting time
intervals I observe are spaced farther apart (more typical) than they
actually are. It's not feasible for me to get 100k+ users' histories just
for the purposes of cleaning up ~6k articles' histories.

Another thought I had was that because many semi-automated tools such as
Twinkle and AWB leave parenthetical annotations in their revision comments,
would this be a relatively inexpensive way to filter out revisions rather
than users? Some caveats, I'd like to get domain experts' feedback on. I'm
not expecting settled research, just input from others' experiences munging
the data.

1. Is the inclusion of this markup in revision comments optional? This is a
concern that some users may enable or disable it, so I may end up biasing
inclusion based on users' preferences.
2. How have these flags or markup changed over time? This is a concern that
Twinke/AWB/etc. may have started/stopped including flags or changed what
they included over time.
3. Are there other API queries or data elsewhere I could use to identify
(semi-)automated revisions?


On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo)
nemow...@gmail.comwrote:

 Brian Keegan, 18/05/2014 18:10:

  Is there a way to retrieve a canonical list of bots on enwiki or
 elsewhere?


 A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
 In general: please edit https://meta.wikimedia.org/
 wiki/Research:Identifying_bot_accounts

 Nemo


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Brian C. Keegan, Ph.D.
Post-Doctoral Research Fellow, Lazer Lab
College of Social Sciences and Humanities, Northeastern University
Fellow, Institute for Quantitative Social Sciences, Harvard University
Affiliate, Berkman Center for Internet  Society, Harvard Law School

b.kee...@neu.edu
www.brianckeegan.com
M: 617.803.6971
O: 617.373.7200
Skype: bckeegan
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Ann Samoilenko

 the Hale Anti-Bot Method™

That's a good one.  =)

I'm a big fan of Scott's method

I second that. Again, great paper, Scott!


On Mon, May 19, 2014 at 5:31 PM, Aaron Halfaker aaron.halfa...@gmail.comwrote:

 Another thought I had was that because many semi-automated tools such as
 Twinkle and AWB leave parenthetical annotations in their revision comments


 See Stuarts comments above.  And also the queries he linked too.
 https://wiki.toolserver.org/view/MySQL_queries#Automated_tool_and_bot_edits 
 It would be nice if we could get these queries in version control and
 share them.

 Maybe there is potential for building a hand-curated list of bot user_ids
 in version control as well.

 -Aaron


 On Mon, May 19, 2014 at 10:17 AM, Brian Keegan b.kee...@neu.edu wrote:

 Thanks for all the references and excellent advice so far!

 I've looked into the Hale Anti-Bot Method™, but because I've sampled my
 corpus on articles (based on category co-membership), the resulting groupby
 users gives these semi-automated users more normal distributions since
 their other contributions are censored. In other words, I see only a
 fraction of these users' contributions and thus the resulting time
 intervals I observe are spaced farther apart (more typical) than they
 actually are. It's not feasible for me to get 100k+ users' histories just
 for the purposes of cleaning up ~6k articles' histories.

 Another thought I had was that because many semi-automated tools such as
 Twinkle and AWB leave parenthetical annotations in their revision comments,
 would this be a relatively inexpensive way to filter out revisions rather
 than users? Some caveats, I'd like to get domain experts' feedback on. I'm
 not expecting settled research, just input from others' experiences munging
 the data.

 1. Is the inclusion of this markup in revision comments optional? This is
 a concern that some users may enable or disable it, so I may end up biasing
 inclusion based on users' preferences.
 2. How have these flags or markup changed over time? This is a concern
 that Twinke/AWB/etc. may have started/stopped including flags or changed
 what they included over time.
 3. Are there other API queries or data elsewhere I could use to identify
 (semi-)automated revisions?


 On Mon, May 19, 2014 at 10:35 AM, Federico Leva (Nemo) 
 nemow...@gmail.com wrote:

 Brian Keegan, 18/05/2014 18:10:

  Is there a way to retrieve a canonical list of bots on enwiki or
 elsewhere?


 A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
 In general: please edit https://meta.wikimedia.org/
 wiki/Research:Identifying_bot_accounts

 Nemo


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




 --
 Brian C. Keegan, Ph.D.
 Post-Doctoral Research Fellow, Lazer Lab
 College of Social Sciences and Humanities, Northeastern University
 Fellow, Institute for Quantitative Social Sciences, Harvard University
 Affiliate, Berkman Center for Internet  Society, Harvard Law School

 b.kee...@neu.edu
 www.brianckeegan.com
 M: 617.803.6971
 O: 617.373.7200
 Skype: bckeegan

 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
-
Kind regards,
Ann Samoilenko, MSc

Oxford Internet Institute
University of Oxford

Adventures can change your life

e-mail: ann.samoile...@gmail.com
Skype: ann.samoilenko
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-19 Thread Scott Hale
Thanks all for the comments on my paper, and even more thanks to everyone
sharing these super helpful ideas on filtering bots: this is why I love the
Wikipedia research committee.

I think Oliver is definitely right that

  this would be a useful topic for some piece of method-comparing research,
 if anyone is looking for paper ideas.

Citation goldmine as one friend called it, I think.

This won't address edit logs to date, but do  we know if most bots and
automated tools use the API to make edits? If so, would it be feasibility
to add a flag to each edit as to whether it came through the API or not.
This won't stop determined users, but might be a nice way to identify
cyborg edits from those made manually by the same user for many of the
standard tools going forward.

The closest thing I found in the bug tracker is [1], but it doesn't address
the issue of 'what is a bot' which this thread has clearly shown is quite
complex. An API-edit vs. non-API edit might be a way forward unless there
are automated tools/bots that don't use the API.


1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181


Cheers,
Scott
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Amir E. Aharoni
People whose last name is Abbot will be discriminated.

And a true story: A prominent human Catalan Wikipedia editor whose name is
PauCabot skewed the results of an actual study.

So don't trust just the user names.
בתאריך 18 במאי 2014 19:34, מאת Andrew G. West west.andre...@gmail.com:

 User name policy states that *bot* names are reserved for bots. Thus,
 such a regex shouldn't be too hacky, but I cannot comment whether some
 non-automated cases might slip through new user patrol. I do think dumps
 make the 'users' table available, and I know for sure one could get a full
 list via the API.

 As a check on this, you could check that when these usernames edit,
 whether or not they set the bot flag. -AW

 --
 Andrew G. West, PhD
 Research Scientist
 Verisign Labs - Reston, VA
 Website: http://www.andrew-g-west.com


 On 05/18/2014 12:10 PM, Brian Keegan wrote:

 Is there a way to retrieve a canonical list of bots on enwiki or
 elsewhere? I'm interested in omitting automated revisions (sorry
 Stuart!) for the purposes of building co-authorship networks.

 Grabbing everything under 'Category:All Wikipedia bots' excludes some
 major ones like SmackBot, Cydebot, VIAFbot, Full-date unlinking bot,
 etc. because these bots have changed names but the redirect is not
 categorized, the account has been removed/deprecated, or a user appears
 to have removed the relevant bot categories from the page.

 Can anyone advise me on how to kill all the bots in my data without
 having to resort to manual cleaning or hacky regex?


 --
 Brian C. Keegan, Ph.D.
 Post-Doctoral Research Fellow, Lazer Lab
 College of Social Sciences and Humanities, Northeastern University
 Fellow, Institute for Quantitative Social Sciences, Harvard University
 Affiliate, Berkman Center for Internet  Society, Harvard Law School

 b.kee...@neu.edu mailto:b.kee...@neu.edu
 www.brianckeegan.com http://www.brianckeegan.com
 M: 617.803.6971
 O: 617.373.7200
 Skype: bckeegan


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l



 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Scott Hale
Very helpful, Lukas, I didn't know about the logging table.

In some recent work [1] I found many users that appeared to be bots but
whose edits did not have the bot flag set. My approach was to exclude users
who didn't have a break of more than 6 hours between edits over the entire
month I was studying. I was interested in the users who had multiple edit
sessions in the month and so when with a straight threshold. A way to keep
users with only one editing session would be to exclude users who have no
break longer than X hours in an edit session lasting at least Y hours
 (e.g., a user who doesn't break for more than 6 hours in 5-6 days is
probably not human)

Cheers,
Scott

[1] Multilinguals and Wikipedia Editing
http://www.scotthale.net/pubs/?websci2014


-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk



On Sun, May 18, 2014 at 5:45 PM, Lukas Benedix lbene...@l3q.de wrote:

 Here is a list of currently flagged bots:

 https://en.wikipedia.org/w/index.php?title=Special:ListUsersoffset=limit=2000username=group=bot

 Another good point to look for bots is here:

 https://en.wikipedia.org/w/index.php?title=Special%3APrefixIndexprefix=Bots%2FRequests_for_approvalnamespace=4

 You should also have a look at this pages to find former bots:
 https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_1
 https://en.wikipedia.org/wiki/Wikipedia:Bots/Status/inactive_bots_2

 And last but not least the logging table you can access via tool labs:
 SELECT DISTINCT(log_title)
 FROM logging
 WHERE log_action = 'rights'
 AND log_params LIKE '%bot%';

 Lukas

 Am So 18.05.2014 18:34, schrieb Andrew G. West:
  User name policy states that *bot* names are reserved for bots.
  Thus, such a regex shouldn't be too hacky, but I cannot comment
  whether some non-automated cases might slip through new user patrol. I
  do think dumps make the 'users' table available, and I know for sure
  one could get a full list via the API.
 
  As a check on this, you could check that when these usernames edit,
  whether or not they set the bot flag. -AW
 


 ___
 Wiki-research-l mailing list
 Wiki-research-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wiki-research-l




-- 
Scott Hale
Oxford Internet Institute
University of Oxford
http://www.scotthale.net/
scott.h...@oii.ox.ac.uk
___
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


Re: [Wiki-research-l] Kill the bots

2014-05-18 Thread Brian Keegan
How does one cite emails in ACM proceedings format? :)

On Sunday, May 18, 2014, R.Stuart Geiger sgei...@gmail.com wrote:

 Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get
 no mercy. :-)

 But seriously, my tl;dr: instead of asking if an account is or isn't a
 bot, ask if a set of edits are or are not automated

 Great responses so far: searching usernames for *bot will exclude non-bot
 users who were registered before the username policy change (although *Bot
 is a bit better), and the logging table is a great way to collect bot
 flags. However, Scott is right -- the bot flag (or *Bot username) doesn't
 signify a bot, it signifies a bureaucrat recognizing that a user account
 successfully went through the Bot Approval Group process. If I see an
 account with a bot flag, I can generally assume the edits that account
 makes are initiated by an automated software agent. This is especially the
 case in the main namespace. The inverse assumption is not nearly as easy: I
 can't assume that every edit made from an account *without* a bot flag was
 *not* an automated edit.

 About unauthorized bots: yes, there are a relatively small number of
 Wikipedians who, on occasion, run fully-automated, continuously-operating
 bots without approval. Complicating this, if someone is going to take the
 time to build and run a bot, but isn't going to create a separate account
 for it, then it is likely that they are also using that account to do
 non-automated edits. Sometimes new bot developers will run an unauthorized
 bot under their own account during the initial stages of development, and
 only later in the process will they create a separate bot account and seek
 formal approval and flagging. It can get tricky when you exclude all the
 edits from an account for being automated based on a single suspicious set
 of edits.

 More commonly, there are many more people who use automated batch tools
 like AutoWikiBrowser to support one-off tasks, like mass find-and-replace
 or category cleanup. Accounts powered by AWB are technically not bots,
 only because a human has to sit there and click save for every batch edit
 that is made. Some people will create a separate bot account for AWB work
 and get it approved and flagged, but many more will not bother. Then
 there are people using semi-automated, human-in-the-loop tools like Huggle
 to do vandal fighting. I find that the really hard question is whether
 you include or exclude these different kinds of 'cyborgs', because it
 really makes you think hard about what exactly you're measuring. Is
 someone who does a mass find-and-replace on all articles in a category a
 co-author of each article they edit? Is a vandal fighter patrolling the
 recent changes feed with Huggle a co-author of all the articles they edit
 when they revert vandalism and then move on to the next diff? What about
 somebody using rollback in the web browser? If so, what is it that makes
 these entities authors and ClueBot NG not an author?

 When you think about it, user accounts are actually pretty remarkable in
 that they allow such a diverse set of uses and agents to be attributed to a
 single entity. So when it comes to identifying automation, I personally
 think it is better to shift the unit of analysis from the user account to
 the individual edit. A bot flag lets you assume all edits from an account
 are automated, but you can use a range of approaches to identifying sets of
 automated edits from non-flagged accounts. Then I have a set of regex SQL
 queries in the Query Library [1] which parses edit summaries for the traces
 that AWB, Huggle, Twinkle, rollback, etc. automatically leave by default.
 You can also use the edit session approach like Scott has suggested -- Aaron
 and I found a few unauthorized bots in our edit session study [2], and we
 were even using a more aggressive break, with no more than a 60 minute gap
 between edits. To catch short bursts of bulk edits, you could look at large
 numbers of edits made in a short period of time -- I'd say more than 7 main
 namespace edits a minute for 10 minutes would be a hard rate for even a
 very aggressive vandal fighter to maintain with Huggle.

 I'll conclude by saying that different kinds of automated editing
 techniques are different ways of participating in and contributing to
 Wikipedia. To systematically exclude automated edits is to remove a very
 important, meaningful, and heterogeneous kind of activity from view. These
 activities constitute a core part of what Wikipedia is, particularly
 those forms of automation which the community has explicitly authorized and
 recognized. Now, we researchers inevitably have to selectively reveal
 and occlude -- a co-authorship network based on main namespace edits also
 excludes talk page discussions and conflict resolution, and this also
 constitutes a core part of what Wikipedia is. It isn't wrong per se to
 exclude automated edits, and it is certainly much worse to not