Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-04 Thread Leila Zia
Hi Nuria and others,

For context: Stas and I are points of contact in the WMF for Markus et
al.'s project. That's why I'm commenting here. :)


* The project and its goals at the proposal level are described at
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries .

* As Markus said, they are not looking for global solutions, they're trying
to increase signal in the data and comments seem to be one natural and
relatively cheap place to begin with, given that query owners can add them
if they're aware of this conversation and that already helps.

* I suggest that we move discussions about possible changes of X-Analytics
header to a new thread, if there is a need for it (long term or short term)
given that we don't need those changes for this research, at least for now.

Thanks,
Leila


On Tue, Oct 4, 2016 at 7:56 AM, Nuria Ruiz  wrote:

> mmm...There are several things here that are already taken care of by our
> user agent policy, for example: if you are using a bot or automated tool we
> already ask you to please include bot in the user agent plus contact info.
>
> Please see:
> https://meta.wikimedia.org/wiki/User-Agent_policy
>
> Now, we do not keep this information long term, after 60 days it gets
> deleted.
>
> X-Analytics is used for bits of info of analytics value, and the contact
> info of a tool developer doesn't seem to be one of those. Can we backtrack
> a little bit? What is the goal of this project? To keep tally of who is
> queying wikidata query service? Anything else?
>
> Thanks,
>
> Nuria
>
>
>
>
> On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan 
> wrote:
>
>> For consistency between all possible clients, we seem to have only two
>> options:  either part of the query, or the X-Analytics header.   The
>> user-agent header is not really an option because it is not available for
>> all types of clients, and we want to have just one way for everyone.
>> Headers other than X-Analytics will need custom handling, whereas we
>> already have plenty of Varnish code to deal with X-Analytics header, split
>> it into parts, and for Hive to parse it. Yes it will be an extra line of
>> code in JS ($.ajax instead of $.get), but I am sure this is not such a big
>> deal if we provide cookie cutter code. Parsing query string in varnish/hive
>> is also some complex extra work, so lets keep X-Analytics. Proposed
>> required values (semicolon separated):
>> * tool=
>> * toolver=
>> * contact=> +1.212.555.1234, ...>
>>
>> Bikeshedding ?   See also:  https://wikitech.wikimedia.or
>> g/wiki/X-Analytics
>>
>> On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev 
>> wrote:
>>
>>> Hi!
>>>
>>> > Using custom HTTP headers would, of course, complicate calls for the
>>> > tool authors (i.e., myself). $.ajax instead of $.get and all that. I
>>> > would be less inclined to change to that.
>>>
>>> Yes, if you're using browser, you probably can't change user agent. In
>>> that case I guess we need either X-Analytics or put it in the query. Or
>>> maybe Referer header would be fine then - it is also recorded. If
>>> Referer is distinct enough it can be used then.
>>>
>>> --
>>> Stas Malyshev
>>> smalys...@wikimedia.org
>>>
>>> ___
>>> Analytics mailing list
>>> analyt...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>> ___
>> Analytics mailing list
>> analyt...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-04 Thread Nuria Ruiz
mmm...There are several things here that are already taken care of by our
user agent policy, for example: if you are using a bot or automated tool we
already ask you to please include bot in the user agent plus contact info.

Please see:
https://meta.wikimedia.org/wiki/User-Agent_policy

Now, we do not keep this information long term, after 60 days it gets
deleted.

X-Analytics is used for bits of info of analytics value, and the contact
info of a tool developer doesn't seem to be one of those. Can we backtrack
a little bit? What is the goal of this project? To keep tally of who is
queying wikidata query service? Anything else?

Thanks,

Nuria




On Mon, Oct 3, 2016 at 10:05 PM, Yuri Astrakhan 
wrote:

> For consistency between all possible clients, we seem to have only two
> options:  either part of the query, or the X-Analytics header.   The
> user-agent header is not really an option because it is not available for
> all types of clients, and we want to have just one way for everyone.
> Headers other than X-Analytics will need custom handling, whereas we
> already have plenty of Varnish code to deal with X-Analytics header, split
> it into parts, and for Hive to parse it. Yes it will be an extra line of
> code in JS ($.ajax instead of $.get), but I am sure this is not such a big
> deal if we provide cookie cutter code. Parsing query string in varnish/hive
> is also some complex extra work, so lets keep X-Analytics. Proposed
> required values (semicolon separated):
> * tool=
> * toolver=
> * contact= +1.212.555.1234, ...>
>
> Bikeshedding ?   See also:  https://wikitech.wikimedia.
> org/wiki/X-Analytics
>
> On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev 
> wrote:
>
>> Hi!
>>
>> > Using custom HTTP headers would, of course, complicate calls for the
>> > tool authors (i.e., myself). $.ajax instead of $.get and all that. I
>> > would be less inclined to change to that.
>>
>> Yes, if you're using browser, you probably can't change user agent. In
>> that case I guess we need either X-Analytics or put it in the query. Or
>> maybe Referer header would be fine then - it is also recorded. If
>> Referer is distinct enough it can be used then.
>>
>> --
>> Stas Malyshev
>> smalys...@wikimedia.org
>>
>> ___
>> Analytics mailing list
>> analyt...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-04 Thread Markus Kroetzsch

Hi again,

The solutions discussed here seem to be quite a bit more general than 
what I was thinking about. Of course it would be nice to have a uniform, 
cross-client way to indicate tools in any MW Web service or API, but 
this is a slightly bigger (and probably more long-term) goal than what I 
had in mind. It is a good idea to suggest a standard approach to tool 
developers there and to have a documentation page on that, but it would 
take some time until this is adopted by enough tools to work.


For our present task, we just need some more signals we can use. 
Analysing SPARQL queries requires us to parse them anyway, so comments 
are fine. In general, the data we are looking at has a lot of noise, so 
we cannot rely on a single field. We will combine user agents, 
X-analytics, query comments, and also query shapes (if you get 1M+ 
similar looking queries in one hour, you know its a bot). With the 
current data, the query shape is often our main clue, so comments would 
already be a big step forward.


Best,

Markus


On 04.10.2016 07:05, Yuri Astrakhan wrote:

For consistency between all possible clients, we seem to have only two
options:  either part of the query, or the X-Analytics header.   The
user-agent header is not really an option because it is not available
for all types of clients, and we want to have just one way for everyone.
Headers other than X-Analytics will need custom handling, whereas we
already have plenty of Varnish code to deal with X-Analytics header,
split it into parts, and for Hive to parse it. Yes it will be an extra
line of code in JS ($.ajax instead of $.get), but I am sure this is not
such a big deal if we provide cookie cutter code. Parsing query string
in varnish/hive is also some complex extra work, so lets keep
X-Analytics. Proposed required values (semicolon separated):
* tool=
* toolver=
* contact=mailto:em...@example.com>, +1.212.555.1234, ...>

Bikeshedding ?   See also:  https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev > wrote:

Hi!

> Using custom HTTP headers would, of course, complicate calls for the
> tool authors (i.e., myself). $.ajax instead of $.get and all that. I
> would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In
that case I guess we need either X-Analytics or put it in the query. Or
maybe Referer header would be fine then - it is also recorded. If
Referer is distinct enough it can be used then.

--
Stas Malyshev
smalys...@wikimedia.org 

___
Analytics mailing list
analyt...@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/analytics



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-03 Thread Guillaume Lederrey
On Mon, Oct 3, 2016 at 11:55 AM, Magnus Manske
 wrote:
> Using custom HTTP headers would, of course, complicate calls for the tool
> authors (i.e., myself). $.ajax instead of $.get and all that. I would be
> less inclined to change to that.

Yes, the limitation of HTTP headers is that it makes things a bit more
complicated for tools authors. At the same time, it is a limitation
that is already pushed to tools authors using the mediawiki APIs.
Having a specific way of doing things for WDQS increases the overall
complexity of our infrastructure. As I am more involved on the general
infrastructure and not only on WDQS, I am of course biased toward a
globally standardized solution more than for a WDQS specific one. I am
not absolutely against having a WDQS specific solution if it makes
things sufficiently easier on tools author, I just want to make sure
we don't take this decision lightly...

> On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey 
> wrote:
>>
>> On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev 
>> wrote:
>> > Hi!
>> >
>> >> This thread is missing some background context info as to what the
>> >> issues are,  if you could forward it it will be great.
>> >
>> > Well, I'm not talking about specific issues, except for the general need
>> > of identifying which tool is responsible for which queries. Basically,
>> > there are several ways of doing it:
>> >
>> > 1. Adding comments to the query itself
>> > 2. Adding query parameters
>> > 3. Adding query headers, specifically:
>> > a) distinct User-Agent
>> > b) distinct X-Analytics header
>> > c) custom headers
>> >
>> > I think that 3a is good for statistics purposes, though 1 could be more
>> > efficient when we need to find out who sent a particular query. 3b may
>> > be superior to 3a, but I admit I don't know enough about it :)
>>
>> I'm a bit late to the discussion, but still...
>>
>> I think that as much as possible metadata about a query should be done
>> via HTTP headers. This way, they are not coupled to SPARQL itself and
>> can be analysed with generic tools already in place. Setting a
>> user-agent is a standard best practice and seems to be part of the
>> Mediawiki API guidelines [1], we should use the same guidelines, no
>> reason to reinvent them.
>>
>> X-Analytics header might allow for more fine grained information, but
>> I'm not sure this is actually needed (and using X-Analytics should not
>> preclude from having a sensible user-agent).
>>
>>
>> [1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
>>
>>
>> > --
>> > Stas Malyshev
>> > smalys...@wikimedia.org
>> >
>> > ___
>> > Wikidata mailing list
>> > Wikidata@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>>
>> --
>> Guillaume Lederrey
>> Operations Engineer, Discovery
>> Wikimedia Foundation
>> UTC+2 / CEST
>>
>> ___
>> Analytics mailing list
>> analyt...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-03 Thread Magnus Manske
Using custom HTTP headers would, of course, complicate calls for the tool
authors (i.e., myself). $.ajax instead of $.get and all that. I would be
less inclined to change to that.

On Mon, Oct 3, 2016 at 10:42 AM Guillaume Lederrey 
wrote:

> On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev 
> wrote:
> > Hi!
> >
> >> This thread is missing some background context info as to what the
> >> issues are,  if you could forward it it will be great.
> >
> > Well, I'm not talking about specific issues, except for the general need
> > of identifying which tool is responsible for which queries. Basically,
> > there are several ways of doing it:
> >
> > 1. Adding comments to the query itself
> > 2. Adding query parameters
> > 3. Adding query headers, specifically:
> > a) distinct User-Agent
> > b) distinct X-Analytics header
> > c) custom headers
> >
> > I think that 3a is good for statistics purposes, though 1 could be more
> > efficient when we need to find out who sent a particular query. 3b may
> > be superior to 3a, but I admit I don't know enough about it :)
>
> I'm a bit late to the discussion, but still...
>
> I think that as much as possible metadata about a query should be done
> via HTTP headers. This way, they are not coupled to SPARQL itself and
> can be analysed with generic tools already in place. Setting a
> user-agent is a standard best practice and seems to be part of the
> Mediawiki API guidelines [1], we should use the same guidelines, no
> reason to reinvent them.
>
> X-Analytics header might allow for more fine grained information, but
> I'm not sure this is actually needed (and using X-Analytics should not
> preclude from having a sensible user-agent).
>
>
> [1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client
>
>
> > --
> > Stas Malyshev
> > smalys...@wikimedia.org
> >
> > ___
> > Wikidata mailing list
> > Wikidata@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
> --
> Guillaume Lederrey
> Operations Engineer, Discovery
> Wikimedia Foundation
> UTC+2 / CEST
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-03 Thread Guillaume Lederrey
On Mon, Oct 3, 2016 at 12:40 AM, Stas Malyshev  wrote:
> Hi!
>
>> This thread is missing some background context info as to what the
>> issues are,  if you could forward it it will be great.
>
> Well, I'm not talking about specific issues, except for the general need
> of identifying which tool is responsible for which queries. Basically,
> there are several ways of doing it:
>
> 1. Adding comments to the query itself
> 2. Adding query parameters
> 3. Adding query headers, specifically:
> a) distinct User-Agent
> b) distinct X-Analytics header
> c) custom headers
>
> I think that 3a is good for statistics purposes, though 1 could be more
> efficient when we need to find out who sent a particular query. 3b may
> be superior to 3a, but I admit I don't know enough about it :)

I'm a bit late to the discussion, but still...

I think that as much as possible metadata about a query should be done
via HTTP headers. This way, they are not coupled to SPARQL itself and
can be analysed with generic tools already in place. Setting a
user-agent is a standard best practice and seems to be part of the
Mediawiki API guidelines [1], we should use the same guidelines, no
reason to reinvent them.

X-Analytics header might allow for more fine grained information, but
I'm not sure this is actually needed (and using X-Analytics should not
preclude from having a sensible user-agent).


[1] https://www.mediawiki.org/wiki/API:Main_page#Identifying_your_client


> --
> Stas Malyshev
> smalys...@wikimedia.org
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata



-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation
UTC+2 / CEST

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-02 Thread Stas Malyshev
Hi!

> This thread is missing some background context info as to what the
> issues are,  if you could forward it it will be great. 

Well, I'm not talking about specific issues, except for the general need
of identifying which tool is responsible for which queries. Basically,
there are several ways of doing it:

1. Adding comments to the query itself
2. Adding query parameters
3. Adding query headers, specifically:
a) distinct User-Agent
b) distinct X-Analytics header
c) custom headers

I think that 3a is good for statistics purposes, though 1 could be more
efficient when we need to find out who sent a particular query. 3b may
be superior to 3a, but I admit I don't know enough about it :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-02 Thread Nuria Ruiz
Yuri/Stas:

This thread is missing some background context info as to what the issues
are,  if you could forward it it will be great.

>Thanks, though using distinct User-Agent may be easier for analysis,
>since those are stored as separate fields, and doing operations on
>separate field would be much easier than extracting comments from query
>field e.g. when doing Hive data processing.

X-analytics is a separate field in our hive data, we like it when info
intended for analytics is dropped there.
Please see docs: https://wikitech.wikimedia.org/wiki/X-Analytics



On Sun, Oct 2, 2016 at 1:32 PM, Yuri Astrakhan 
wrote:

> I would highly recommend using X-Analytics header for this, and
> establishing a "well known" key name(s). X-Analytics gets parsed into
> key-value pairs (object field) by our varnish/hadoop infrastructure,
> whereas the user agent is basically a semi-free form text string. Also,
> user agent cannot be set for by any javascript client, so we will
> constantly have to perform two types of analysis - those that came from the
> "backend" and those that were made by the browser.
>
> On Sun, Oct 2, 2016 at 4:28 PM Stas Malyshev 
> wrote:
>
>> Hi!
>>
>> > I'll try to throw in a #TOOL: comment where I can remember using SPARQL,
>> > but I'll be bound to forget a few...
>>
>> Thanks, though using distinct User-Agent may be easier for analysis,
>> since those are stored as separate fields, and doing operations on
>> separate field would be much easier than extracting comments from query
>> field e.g. when doing Hive data processing.
>>
>> --
>> Stas Malyshev
>> smalys...@wikimedia.org
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>
> ___
> Analytics mailing list
> analyt...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata