John, thanks for your ideas!
> As asked on the talk page over a month ago with no response... > https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot The responses are there now. Sorry for that, I forgot to watch the page. How does adding 'bot' help over and above including email addresses > and URLs in the User-Agent? > Are there significant cases of human traffic browsers including email > addresses and URLs in the User-Agent? > If not, I am struggling to understand how the addition of 'bot' > assists better isolating traffic originated by humans. No, I don't think that there are cases of humans with such user-agents :] I understand, though, that the policy asks the bot maintainers to add "some way of contacting them" to the user-agent, and some examples are given: "(e.g. a userpage on the local wiki, a userpage on a related wiki using interwiki linking syntax, a URI for a relevant external website, or an email address)". I assume that the example list is not exclusive, meaning they may also use other ways of contact info. Also, parsing the word bot is less error prone and cheaper than parsing long heterogeneous strings. Or, is adding 'bot' an alternative to including email addresses and > URLs? Adding bot is not intended to be an alternative or to replace the current policy at all. It is only intended to add an optional way bot maintainers can help us. This will also introduce some false positives, as 'bot' is a > word and word-part with meanings other than the English meaning. I think adding the word bot to the user-agent of bot-like programs is a widely adopted convention. Actually, the word bot is already (for a long time now) being parsed and used to tag requests as bot-originated in our jobs that process requests into pageviews stats, because many external bots include it in their user-agent. See: http://www.useragentstring.com/pages/Crawlerlist/ There is very little information at > https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that > I can see) regarding what use of the API is considered to be a > **page** view. For example, is it a page view when I ask the API for > metadata only of the last revision of a page -- i.e. the page/revision > text is not included in the response? You're right, and this is a very good question. I fear the only ways to look into this are browsing the actual code in: https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java or asking the Research team, who owns the definition. But how much 'consumption' is required? This was briefly discussed > during the analytics list discussion, and it would be good to bring > the wider audience into this discussion. Another good question. It is very difficult, though, to get to the perfect wording that will make it totally clear to all bot maintainers if their bot is target of the convention or not. I guess it will come to the good sense and will-to-help of the bot maintainers to decide this in the case the bot behavior is in the frontier between "in-situ human consumption" and "not in-situ human consumption". It would be good to identify a list of gadgets which need to be > updated to comply with the new user agent policy. I'd say all gadgets already comply with the amendment to the user-agent policy, because the amendment is optional. Nevertheless, our next step is reaching out to main bot maintainers to present them with that option. Thanks again for the discussion! On Mon, Mar 21, 2016 at 4:28 PM, John Mark Vandenberg <[email protected]> wrote: > On Mon, Mar 21, 2016 at 12:37 PM, Marcel Ruiz Forns > <[email protected]> wrote: > > Hi wikitech-l, > > > > After the discussion in analytics-l [1][2] and Phabricator [3], the > > Analytics team added a small amendment [4] to Wikimedia's user-agent > policy > > [5] with the intention of improving the quality of WMF's pageview > > statistics. > > > > The amendment asks Wikimedia bot/framework maintainers to optionally add > > the word *bot* (case insensitive) to their user-agents. With that, the > > analytical jobs that process request data into pageview statistics will > be > > capable of better identifying traffic generated by bots, and thus of > better > > isolating traffic originated by humans (corresponding code is already in > > production [6]). The convention is optional, because modifications to the > > user-agent can be a breaking change. > > As asked on the talk page over a month ago with no response... > https://meta.wikimedia.org/wiki/Talk:User-Agent_policy#bot > > How does adding 'bot' help over and above including email addresses > and URLs in the User-Agent? > Are there significant cases of human traffic browsers including email > addresses and URLs in the User-Agent? > > If not, I am struggling to understand how the addition of 'bot' > assists better isolating traffic originated by humans. > > Or, is adding 'bot' an alternative to including email addresses and > URLs? This will also introduce some false positives, as 'bot' is a > word and word-part with meanings other than the English meaning. See > https://en.wiktionary.org/wiki/bot , > https://en.wiktionary.org/wiki/Special:Search/intitle:bot and > https://en.wiktionary.org/wiki/Talk:-bot#Fish_suffix > > > Targets of this convention are: bots/frameworks that can generate > Wikimedia > > pageviews [7] to Wikimedia sites and/or API and are not for in-situ human > > consumption. Not targets: bots/frameworks used to assist in-situ human > > consumption, and bots/frameworks that are otherwise well known and > > recognizable like WordPress, Scrapy, etc. Note that there are many > editing > > bots that also generate pageviews, like when trying to copy content from > > one page to another the source page is requested and the corresponding > > pageview is generated. > > I appreciate this attempt to classify devise a clearer "target" for > when a client needs to follow this new convention from the analytics > team, as requested during the discussion on the analytics list. > > Regarding "Wikimedia pageviews [7] to Wikimedia sites and/or API .. > [7] https://meta.wikimedia.org/wiki/Research:Page_view" > > There is very little information at > https://meta.wikimedia.org/wiki/Research:Page_view or elsewhere (that > I can see) regarding what use of the API is considered to be a > **page** view. For example, is it a page view when I ask the API for > metadata only of the last revision of a page -- i.e. the page/revision > text is not included in the response? > > "in-situ human consumption" is an interesting formula. > "in situ human" strongly implies a human is directly accessing the > content that caused the page view. > > But how much 'consumption' is required? This was briefly discussed > during the analytics list discussion, and it would be good to bring > the wider audience into this discussion. > > Obviously 'Navigation popups'/Hovercards is definitely "in-situ human > consumption". > > But what about gadgets Twinkle's "unlink" feature and Cat-a-lot (on > Wikimeda Commons)? They do batch modifications to pages, and the > in-situ human does not see the pages fetched by the JavaScript. Based > on your responses in analytics mailing list discussion, and this new > terminology "in-situ human consumption", I believe that these gadgets > would be considered subject to the bot user-agent policy. > It would be good to identify a list of gadgets which need to be > updated to comply with the new user agent policy. > > -- > John Vandenberg > > _______________________________________________ > Wikitech-l mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/wikitech-l -- *Marcel Ruiz Forns* Analytics Developer Wikimedia Foundation _______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
