Okay. Methodology:
*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit
It looks like
Thank you, Oliver,
This is really interesting and gives some credibility to the idea that the
ability to track API/non-API edits could address the bot problem in part,
but definitely could miss some bots. Thank you very much for your time to
check this and share the results. Anyone think it would
I think a *lot* of them use the API, but I don't know off the top of my
head if it's *all* of them. If only we knew somebody who has spent the last
3 months staring into the cthulian nightmare of our request logs and could
look this up...
More seriously; drop me a note off-list so that I can try
If your bot is only running automated reports in its own userspace then it
doesn't need a bot flag. But it probably wont be a very active bot so may
not be a problem for your stats
On the English language wikipedia you are going to be fairly close if you
exclude all accounts which currently have
That would cover most of them, but runs into the problem of you're only
including the unauthorised bots written poorly enough that we've caught the
operator ;). It seems like this would be a useful topic for some piece of
method-comparing research, if anyone is looking for paper ideas.
On 19 May
Brian Keegan, 18/05/2014 18:10:
Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?
A Bots.csv list exists. https://meta.wikimedia.org/wiki/Wikistat_csv
In general: please edit
https://meta.wikimedia.org/wiki/Research:Identifying_bot_accounts
Nemo
Thanks for all the references and excellent advice so far!
I've looked into the Hale Anti-Bot Method™, but because I've sampled my
corpus on articles (based on category co-membership), the resulting groupby
users gives these semi-automated users more normal distributions since
their other
the Hale Anti-Bot Method™
That's a good one. =)
I'm a big fan of Scott's method
I second that. Again, great paper, Scott!
On Mon, May 19, 2014 at 5:31 PM, Aaron Halfaker aaron.halfa...@gmail.comwrote:
Another thought I had was that because many semi-automated tools such as
Twinkle and
Thanks all for the comments on my paper, and even more thanks to everyone
sharing these super helpful ideas on filtering bots: this is why I love the
Wikipedia research committee.
I think Oliver is definitely right that
this would be a useful topic for some piece of method-comparing research,
Is there a way to retrieve a canonical list of bots on enwiki or elsewhere?
I'm interested in omitting automated revisions (sorry Stuart!) for the
purposes of building co-authorship networks.
Grabbing everything under 'Category:All Wikipedia bots' excludes some major
ones like SmackBot, Cydebot,
People whose last name is Abbot will be discriminated.
And a true story: A prominent human Catalan Wikipedia editor whose name is
PauCabot skewed the results of an actual study.
So don't trust just the user names.
בתאריך 18 במאי 2014 19:34, מאת Andrew G. West west.andre...@gmail.com:
User name
Very helpful, Lukas, I didn't know about the logging table.
In some recent work [1] I found many users that appeared to be bots but
whose edits did not have the bot flag set. My approach was to exclude users
who didn't have a break of more than 6 hours between edits over the entire
month I was
How does one cite emails in ACM proceedings format? :)
On Sunday, May 18, 2014, R.Stuart Geiger sgei...@gmail.com wrote:
Tsk tsk tsk, Brian. When the revolution comes, bot discriminators will get
no mercy. :-)
But seriously, my tl;dr: instead of asking if an account is or isn't a
bot, ask
13 matches
Mail list logo