Re: [Wikitech-l] Small Amendment to User-Agent Policy

Marcel Ruiz Forns Tue, 22 Mar 2016 02:22:07 -0700

>
> The algorithm has been imperfect for a long time.  How long and how
> imperfect doesnt matter.  Analytics is all about making good use of
> imperfect algorithms to provide reasonable approximations.
> However, I expect it is the role of Analytics is to improve the
> definitions and implementation over time, not force a bad algorithms
> into policy.



I don't think it is a bad algorithm. Using 'bot' in the user-agent is a
widely adopted convention, so analytics code needs to implement this (even
if it is an approximation). Because of that, Wikimedia bots having the word
'bot' in their user-agents have been tagged as bots for a long time now.
And it seems to make sense to have a line that refers to that fact in the
user-agent policy.

It is no different from a web browser in how it *may* be used,
> although of course typically the primary goal of using Pywikibot
> instead of a Web browser is to reduce the amount of human consumption
> and decision making needed to perform a task.


That is also Analytics view on the subject. As you said, it is an
approximation that won't fit all cases. But in general, it makes sense to
approximate that, and tag them as non-human.




On Tue, Mar 22, 2016 at 5:18 AM, John Mark Vandenberg <[email protected]>
wrote:

> On Tue, Mar 22, 2016 at 12:44 AM, Marcel Ruiz Forns
> <[email protected]> wrote:
> > ...
> > I think adding the word bot to the user-agent of bot-like programs is a
> > widely adopted convention. Actually, the word bot is already (for a long
> > time now) being parsed and used to tag requests as bot-originated in our
> > jobs that process requests into pageviews stats, because many external
> bots
> > include it in their user-agent. See:
> > http://www.useragentstring.com/pages/Crawlerlist/
>
> The algorithm has been imperfect for a long time.  How long and how
> imperfect doesnt matter.  Analytics is all about making good use of
> imperfect algorithms to provide reasonable approximations.
>
> However, I expect it is the role of Analytics is to improve the
> definitions and implementation over time, not force a bad algorithms
> into policy.
>
> Pywiki*bot* has the string 'bot' in its useragent, because it is part
> of the product name.
> However, not all usage of Pywikibot is a crawler or even a bot, in any
> sensible definition of those concepts.
> Pywikibot is a *user agent* that knows how to be a client of the
> *MediaWiki API*.  It can be used for "in-situ human consumption" or
> not.
>
> It is no different from a web browser in how it *may* be used,
> although of course typically the primary goal of using Pywikibot
> instead of a Web browser is to reduce the amount of human consumption
> and decision making needed to perform a task.  But that is no
> different to Gadgets written using the JavaScript libraries that run
> in the Web browser.
>
> It can function *exactly* like a web browser reading a special:search
> results page, viewing some of those page in the search results, and
> making edits to some of them.  Each page may be viewed by a real
> human, who is making decisions throughout the entire process about
> which pages to view and which pages to edit.
>
> Or it can function *exactly* like a crawler, spider, bot, etc., with
> zero human consumption.
>
> Almost every script that is packaged with Pywikibot has an automatic
> and non-automatic mode of operation.
> Should we change our user-agent to "Pywikihuman" when in non-automatic
> mode of operation, so that it isnt considered to be a bot by
> Analytics?
>
> Using the string 'bot' in the user-agent may be a useful approximation
> for Analytics to use circa 2010, but it is bad policy, and Analytics
> can and should do much better than that in 2016 now that API usage is
> in focus.
>
> > There is very little information at
> >> https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
> >> I can see) regarding what use of the API is considered to be a
> >> **page** view.  For example, is it a page view when I ask the API for
> >> metadata only of the last revision of a page -- i.e. the page/revision
> >> text is not included in the response?
> >
> > You're right, and this is a very good question. I fear the only ways to
> > look into this are browsing the actual code in:
> >
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java
>
> I am not very interested in the code, which is at best an attempt at
> implementing the API page view definition.  I'd like to understand the
> high level goal.
>
> However, having read that file, and the accompanying test suite, it is
> my understanding that there is no definition of an API page view.
> i.e. all requests to api.php , excepting api.php usage by the
> Wikipedia App (i.e. with user-agent "WikipediaApp", used by the iOS
> and Android Apps), is classified as *not a page view*.
>
> fwiw, rather than reading the source, this test data file with
> expected results is a simpler way to see the current status.
>
>
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/test/resources/pageview_test_data.csv
>
> > or asking the Research team, who owns the definition.
>
> Could the Research team please publish their definition of API
> (api.php) page views, like they do for Web (index.php) page views.
>
> Without this, it is hard to have a serious conversation about how
> changing the user-agent policy might be helpful to achieve the goal of
> better classifying API page views.
>
> --
> John Vandenberg
>
> _______________________________________________
> Wikitech-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>



-- 
*Marcel Ruiz Forns*
Analytics Developer
Wikimedia Foundation
_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Small Amendment to User-Agent Policy

Reply via email to