On Mon, Mar 21, 2016 at 12:37 PM, Marcel Ruiz Forns
<[email protected]> wrote:
> Hi wikitech-l,
>
> After the discussion in analytics-l [1][2] and Phabricator [3], the
> Analytics team added a small amendment [4] to Wikimedia's user-agent policy
> [5] with the intention of improving the quality of WMF's pageview
> statistics.
>
> The amendment asks Wikimedia bot/framework maintainers to optionally add
> the word *bot* (case insensitive) to their user-agents. With that, the
> analytical jobs that process request data into pageview statistics will be
> capable of better identifying traffic generated by bots, and thus of better
> isolating traffic originated by humans (corresponding code is already in
> production [6]). The convention is optional, because modifications to the
> user-agent can be a breaking change.

As asked on the talk page over a month ago with no response...
https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot

How does adding 'bot' help over and above including email addresses
and URLs in the User-Agent?
Are there significant cases of human traffic browsers including email
addresses and URLs in the User-Agent?

If not, I am struggling to understand how the addition of 'bot'
assists better isolating traffic originated by humans.

Or, is adding 'bot' an alternative to including email addresses and
URLs?  This will also introduce some false positives, as 'bot' is a
word and word-part with meanings other than the English meaning. See
https://en.wiktionary.org/wiki/bot ,
https://en.wiktionary.org/wiki/Special:Search/intitle:bot and
https://en.wiktionary.org/wiki/Talk:-bot#Fish_suffix

> Targets of this convention are: bots/frameworks that can generate Wikimedia
> pageviews [7] to Wikimedia sites and/or API and are not for in-situ human
> consumption. Not targets: bots/frameworks used to assist in-situ human
> consumption, and bots/frameworks that are otherwise well known and
> recognizable like WordPress, Scrapy, etc. Note that there are many editing
> bots that also generate pageviews, like when trying to copy content from
> one page to another the source page is requested and the corresponding
> pageview is generated.

I appreciate this attempt to classify devise a clearer "target" for
when a client needs to follow this new convention from the analytics
team, as requested during the discussion on the analytics list.

Regarding "Wikimedia pageviews [7] to Wikimedia sites and/or API ..
[7] https://meta.wikimedia.org/wiki/Research:Page_view";

There is very little information at
https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that
I can see) regarding what use of the API is considered to be a
**page** view.  For example, is it a page view when I ask the API for
metadata only of the last revision of a page -- i.e. the page/revision
text is not included in the response?

"in-situ human consumption" is an interesting formula.
"in situ human" strongly implies a human is directly accessing the
content that caused the page view.

But how much 'consumption' is required?  This was briefly discussed
during the analytics list discussion, and it would be good to bring
the wider audience into this discussion.

Obviously 'Navigation popups'/Hovercards is definitely "in-situ human
consumption".

But what about gadgets Twinkle's "unlink" feature and Cat-a-lot (on
Wikimeda Commons)?  They do batch modifications to pages, and the
in-situ human does not see the pages fetched by the JavaScript.  Based
on your responses in analytics mailing list discussion, and this new
terminology "in-situ human consumption", I believe that these gadgets
would be considered subject to the bot user-agent policy.
It would be good to identify a list of gadgets which need to be
updated to comply with the new user agent policy.

--
John Vandenberg

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to