Okay. Methodology:

*take the last 5 days of requestlogs;
*Filter them down to text/html requests as a heuristic for non-API requests;
*Run them through the UA parser we use;
*Exclude spiders and things which reported valid browsers;
*Aggregate the user agents left;
*???
*Profit

It looks like there are a relatively small number of bots that
browse/interact via the web - ones I can identify include WPCleaner[0],
which is semi-automated, something I can't find through WP or google called
"DigitalsmithsBot" (could be internal, could be external), and Hoo Bot (run
by User:Hoo man). My biggest concern is DotNetWikiBot, which is a general
framework that could be masking multiple underlying bots and has ~ 7.4m
requests through the web interface in that time period.

Obvious caveat is obvious; the edits from these tools may actually come
through the API, but they're choosing to request content through the web
interface for some weird reason. I don't know enough about the software
behind each bot to comment on that. I can try explicitly looking for
web-based edit attempts, but there would be far fewer observations that the
bots might appear in, because the underlying dataset is sampled at a 1:1000
rate.

[0] https://en.wikipedia.org/wiki/User:NicoV/Wikipedia_Cleaner/Documentation


On 20 May 2014 07:50, Oliver Keyes <oke...@wikimedia.org> wrote:

> Actually, belay that, I have a pretty good idea. I'll fire the log parser
> up now.
>
>
> On 20 May 2014 01:21, Oliver Keyes <oke...@wikimedia.org> wrote:
>
>> I think a *lot* of them use the API, but I don't know off the top of my
>> head if it's *all* of them. If only we knew somebody who has spent the
>> last 3 months staring into the cthulian nightmare of our request logs and
>> could look this up...
>>
>> More seriously; drop me a note off-list so that I can try to work out
>> precisely what you need me to find out, and I'll write a quick-and-dirty
>> parser of our sampled logs to drag the answer kicking and screaming into
>> the light.
>>
>> (sorry, it's annual review season. That always gets me blithe.)
>>
>>
>> On 19 May 2014 13:03, Scott Hale <computermacgy...@gmail.com> wrote:
>>
>>> Thanks all for the comments on my paper, and even more thanks to
>>> everyone sharing these super helpful ideas on filtering bots: this is why I
>>> love the Wikipedia research committee.
>>>
>>> I think Oliver is definitely right that
>>>
>>>>  this would be a useful topic for some piece of method-comparing
>>>> research, if anyone is looking for paper ideas.
>>>
>>> "Citation goldmine" as one friend called it, I think.
>>>
>>> This won't address edit logs to date, but do  we know if most bots and
>>> automated tools use the API to make edits? If so, would it be feasibility
>>> to add a flag to each edit as to whether it came through the API or not.
>>> This won't stop determined users, but might be a nice way to identify
>>> cyborg edits from those made manually by the same user for many of the
>>> standard tools going forward.
>>>
>>> The closest thing I found in the bug tracker is [1], but it doesn't
>>> address the issue of 'what is a bot' which this thread has clearly shown is
>>> quite complex. An API-edit vs. non-API edit might be a way forward unless
>>> there are automated tools/bots that don't use the API.
>>>
>>>
>>> 1. https://bugzilla.wikimedia.org/show_bug.cgi?id=11181
>>>
>>>
>>> Cheers,
>>> Scott
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>>
>>>
>>
>>
>> --
>> Oliver Keyes
>> Research Analyst
>> Wikimedia Foundation
>>
>
>
>
> --
> Oliver Keyes
> Research Analyst
> Wikimedia Foundation
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation
_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to