Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
Thanks.

On Fri, Jan 19, 2018 at 12:30 PM, Nuria Ruiz  wrote:

> >Thanks, good to know - is there a report around that? I'm wondering how
> "missing requests" ought to be expressed with some margin of error.
> I think the ones that can quantify this best is your team. If anything
> from what I remember from pop ups experiments the inflow of events was
> higher than expected calculations. Overall usage of DNT for FF users was
> about ~10% last time we looked at it, overall usage on our userbase is
> quite a bit smaller I bet.
>
> https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-
> new-do-not-track-dashboard-firefox-users-continue-to-
> seek-out-and-enable-dnt/
>
> On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso  wrote:
>
>>
>> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>>> library would some sort of new method be needed so that these impressions
>>> arena't undercounted?
>>> If we had a lot of users with DNT, maybe, from our tests when we enabled
>>> that on EL this is not the case.
>>>
>>
>> Thanks, good to know - is there a report around that? I'm wondering how
>> "missing requests" ought to be expressed with some margin of error.
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error.
I think the ones that can quantify this best is your team. If anything from
what I remember from pop ups experiments the inflow of events was higher
than expected calculations. Overall usage of DNT for FF users was about
~10% last time we looked at it, overall usage on our userbase is quite a
bit smaller I bet.

https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-new-do-not-track-dashboard-firefox-users-continue-to-seek-out-and-enable-dnt/

On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso  wrote:

>
> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>> library would some sort of new method be needed so that these impressions
>> arena't undercounted?
>> If we had a lot of users with DNT, maybe, from our tests when we enabled
>> that on EL this is not the case.
>>
>
> Thanks, good to know - is there a report around that? I'm wondering how
> "missing requests" ought to be expressed with some margin of error.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>So maybe it's worth considering which approach takes us closer to that?
AIUI the beacon puts the record into the webrequest table and from there it
would only take some >trivial preprocessing to replace the beacon URL with
the virtual URL and and add the beacon type as a "virtual_type" field or
something, making it very easy to expose it >everywhere where views are
tracked, while EventLogging data gets stored in a different, unrelated way.
Any thing that involves combing* 1 terabyte of data a day and 150.000
request s per second at peak *cannot be consider "simple" or "trivial".
Rather than looking for a needle in the haystack rely let's please on the
client to send you preselected data (events). That data can be
aggregated later in different ways, and the fact that the data comes from
event logging does not dictate how aggregation needs to happen.




On Wed, Jan 17, 2018 at 6:09 PM, Gergo Tisza  wrote:

> On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz  wrote:
>
>> Recording "preview_events" is really no different that recording any
>> other kind of UI event, difference is going to come from scale if anything,
>> as they are probably tens of thousands of those per second (I think your
>> team already estimated volume, if so please send those estimates along)
>>
>
> Conceptually I think a virtual pageview is a different thing from a UI
> event (which is how e.g. Google Analytics handles it, there is a method to
> send an event for the current page and a different method to send a virtual
> pageview for a different page), and the ideal way it is exposed in an
> analytics system should be very different. (I would want to see virtual
> pageviews together with normal pageviews, with some filtering option. If I
> deploy code that shows previews and converts users from making real
> pageviews to making virtual pageviews, I want to see how the total
> pageviews changed in the normal pageview stats; I don't want to have to
> create that chart and export one dataset from pageviews and one dataset
> from eventlogging to do that. As a user, I want to see in the fileview API
> how many people looked at the photo I uploaded, I don't particularly care
> if they used MediaViewer or not. etc.)
>
> So maybe it's worth considering which approach takes us closer to that?
> AIUI the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing to replace the beacon URL with
> the virtual URL and and add the beacon type as a "virtual_type" field or
> something, making it very easy to expose it everywhere where views are
> tracked, while EventLogging data gets stored in a different, unrelated way.
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
> If we had a lot of users with DNT, maybe, from our tests when we enabled
> that on EL this is not the case.
>

Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
> You could join these together in a broader ‘content consumption’ dataset
somehow, either in Hadoop with batch jobs, or more realtime with streaming
jobs.

Hm, idea…which I think has been mentioned before:  Could we leave pageviews
as is, but make a new dataset that counts both pageviews and page
previews?  Maybe this is ‘content_views’?  We could explicitly state that
the definition of content_views is supposed to change with time, and could
possibly incorporate other future types of content views too. Eh?





On Fri, Jan 19, 2018 at 12:27 PM, Nuria Ruiz  wrote:

> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
> If we had a lot of users with DNT, maybe, from our tests when we enabled
> that on EL this is not the case. Your team has already run experiments on
> this functionality and they can speak as to the projection of numbers.
>
> On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso  wrote:
>
>> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>> library would some sort of new method be needed so that these impressions
>> arena't undercounted?
>>
>> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith 
>> wrote:
>>
>>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>>
 Adding to this, one thing to consider is DNT - is there a way to invoke
 EL so that such traffic is appropriately imputed or something?

>>>
>>> The EventLogging client respects DNT [0]. When the user enables DNT,
>>> mw.eventLog.logEvent is a NOP.
>>>
>>> I don't see any mention of DNT in the Varnish VCLs around the the
>>> /beacon endpoint or otherwise but it may be handled elsewhere. While it's
>>> unlikely, there's nothing stopping a client sending a well-formatted
>>> request to the /beacon/event endpoint directly [1], ignoring the user's
>>> choice.
>>>
>>> -Sam
>>>
>>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>>> 5c1755223fd7a5bab9b9$251
>>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c17
>>> 55223fd7a5bab9b9$215
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions
arena't undercounted?
If we had a lot of users with DNT, maybe, from our tests when we enabled
that on EL this is not the case. Your team has already run experiments on
this functionality and they can speak as to the projection of numbers.

On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso  wrote:

> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
>
> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:
>
>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>
>>> Adding to this, one thing to consider is DNT - is there a way to invoke
>>> EL so that such traffic is appropriately imputed or something?
>>>
>>
>> The EventLogging client respects DNT [0]. When the user enables DNT,
>> mw.eventLog.logEvent is a NOP.
>>
>> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
>> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
>> there's nothing stopping a client sending a well-formatted request to the
>> /beacon/event endpoint directly [1], ignoring the user's choice.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>> 5c1755223fd7a5bab9b9$251
>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c
>> 1755223fd7a5bab9b9$215
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] is there an hourly pageviews API?

2018-01-19 Thread Joseph Allemandou
Hi James,
We don't have hourly resolution for pageviews-per-article in API.
I don't think of better options than getting the dumps :(
Joseph

On Fri, Jan 19, 2018 at 12:15 PM, James Salsman  wrote:

> Hourly pageviews are in
> /public/dumps/pageviews/$year/$year-$month/pageviews-$year$
> month$day-[012][0-9].gz
>
> Is there an API faster than zgreping those?
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>



-- 
*Joseph Allemandou*
Data Engineer @ Wikimedia Foundation
IRC: joal
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] is there an hourly pageviews API?

2018-01-19 Thread Thomas Steiner
>
> Thanks, Thomas, but that has only daily and monthly granularity for
> articles.
>

True. Sorry, replied too quickly. Hourly granularity is available for /
metrics/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end}
,
but not what per-article, which is what you want. Pardon the spam.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] is there an hourly pageviews API?

2018-01-19 Thread James Salsman
>> Hourly pageviews are in
>> /public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz
>>
>> Is there an API faster than zgreping those?
>
> https://wikimedia.org/api/rest_v1/#/ :-)

Thanks, Thomas, but that has only daily and monthly granularity for articles.

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
>  For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).

Ah! Ok I get this use case now.   I might not be able to comment about this
much then.  I think this totally changes the meaning of a pageview.
Perhaps this is what you want?  If so, this is outside the realm of my
opinionatedness. :)

However, IF you do convince folks to change the meaning of ‘pageview’ to
include ‘previews’, then we might be able to compromise.  All I object to
more filtering of webrequests :)  The rest of this email might be moot if
we don’t change the ‘pageview definition’, but I’ll continue anyway…


The page previews data could come in as events.  Augmenting the generated
pageviews table from more incoming event sources sounds more flexible than
doing more filtering logic in webrequests.  I’d defer to the Analytics team
members who would be implementing this though, I might be wrong.

In my ideal, pageviews and page_previews would both be separate event
streams.  These would be imported as is to Hive tables, but also available
in Kafka.  You could join these together in a broader ‘content consumption’
dataset somehow, either in Hadoop with batch jobs, or more realtime with
streaming jobs.  (If this is done right, you can even use the same code for
both cases.)  If we had a good stream processing system here, I might
suggest that we move pageview filtering to a more realtime setup and
generate a derived pageview stream in Kafka. We’d then that as the source
of pageviews in Hadoop.   Anyway, this is my ideal setup, but not what we
have now!  But we might one day (in the next FY???), and intaking events
for page previews and other counters will help us migrate to this kind
of architecture later.

> Is that different from preprocessing them via EventLogging? Either way
you take a HTTP request, and end up with a Hadoop record - is there
something that makes that process a lot more costly for normal pageviews
than EventLogging beacon hits?

>From a hardware perspective, only in that the stream of events is much
smaller, so there’s less wasted repeated I/O.  From a engineering time
perspective, if we use the webrequest tagging system to do this, I think
we’re good, but only in the short term.  In the long term, it hides the
complexity involved in maintaining the logic of what a pageview or page
preview or any other ‘tagged’ webrequest in complicated Java logic that is
really only useable in Hadoop.  I’m mainly objecting because we want to
draw a line to stop doing this kind of thing.  Doing this for page previews
now might be ok if we really really really have to (although Nuria might
not agree ;) ), but ultimately we need to push this kind of interaction
logic out to feature developers who have more control over it.

The Analytics team wants to build infrastructure that make it easy for
developers to measure their product usage, not implement the measuring
logic ourselves.





On Fri, Jan 19, 2018 at 6:05 AM, Adam Baso  wrote:

> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
>
> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:
>
>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>
>>> Adding to this, one thing to consider is DNT - is there a way to invoke
>>> EL so that such traffic is appropriately imputed or something?
>>>
>>
>> The EventLogging client respects DNT [0]. When the user enables DNT,
>> mw.eventLog.logEvent is a NOP.
>>
>> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
>> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
>> there's nothing stopping a client sending a well-formatted request to the
>> /beacon/event endpoint directly [1], ignoring the user's choice.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>> 5c1755223fd7a5bab9b9$251
>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c
>> 1755223fd7a5bab9b9$215
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] is there an hourly pageviews API?

2018-01-19 Thread Thomas Steiner
https://wikimedia.org/api/rest_v1/#/ :-)
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] is there an hourly pageviews API?

2018-01-19 Thread James Salsman
Hourly pageviews are in
/public/dumps/pageviews/$year/$year-$month/pageviews-$year$month$day-[012][0-9].gz

Is there an API faster than zgreping those?

___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions
arena't undercounted?

On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:

> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>
>> Adding to this, one thing to consider is DNT - is there a way to invoke
>> EL so that such traffic is appropriately imputed or something?
>>
>
> The EventLogging client respects DNT [0]. When the user enables DNT,
> mw.eventLog.logEvent is a NOP.
>
> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
> there's nothing stopping a client sending a well-formatted request to the
> /beacon/event endpoint directly [1], ignoring the user's choice.
>
> -Sam
>
> [0] https://phabricator.wikimedia.org/diffusion/EEVL/
> browse/master/modules/ext.eventLogging.core.js;
> 4480f7e27140fcb8ae915c1755223fd7a5bab9b9$251
> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223f
> d7a5bab9b9$215
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Sam Smith
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:

> Adding to this, one thing to consider is DNT - is there a way to invoke EL
> so that such traffic is appropriately imputed or something?
>

The EventLogging client respects DNT [0]. When the user enables DNT,
mw.eventLog.logEvent is a NOP.

I don't see any mention of DNT in the Varnish VCLs around the the /beacon
endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
there's nothing stopping a client sending a well-formatted request to the
/beacon/event endpoint directly [1], ignoring the user's choice.

-Sam

[0]
https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223fd7a5bab9b9$251
[1]
https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223fd7a5bab9b9$215
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Gergo Tisza
On Thu, Jan 18, 2018 at 3:56 PM, Nuria Ruiz  wrote:

> Event logging use cases are events, as we move to a thicker client -more
> javascript heavy- you will be needing to measure events for -nearly-
> everything, whether those are to be consider "content consumption"  or "ui
> interaction" is not that relevant. Example: video plays are content
> consumption and are also "ui interactions".
>

That could be an argument for not separating pageviews from events (in
which the question of whether virtual pageviews should be more like
pageviews or more like events would be moot), but given that those *are*
separated I don't see how it applies. In the current analytics setup, and
given what kinds of frontends are currently supported, there are types of
report generation that are easier to perform on pageviews and not so easy
on events, and other types of report generation that are easier to do on
events. For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).

On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto  wrote:

> > the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing
> ‘Trivial’ preprocessing that has to look through 150K requests per second!
> This is a lot of work!
>

Is that different from preprocessing them via EventLogging? Either way you
take a HTTP request, and end up with a Hadoop record - is there something
that makes that process a lot more costly for normal pageviews than
EventLogging beacon hits?

Anyway what I meant by trivial preprocessing is that you take something
like 
*http://bits.wikimedia.org/beacon/page-preview?duration=123=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo
*,
convert it into *https://en.wikipedia.org/wiki/Foo
*, tack the duration and the type
('page-preview') into some extra fields, add those extra fields to the
dimensions along which pageviews can be inspected, and you have integrated
virtual views into your analytics APIs / UIs, almost for free. The
alternative would be that every analytics customer who wants to deal
with content
consumption and does not want to automatically filter out content
consumption happening via thick clients would have to update their
interfaces and do some kind of union query to merge the data that's now
distributed between the webrequest table and one or more EventLogging
tables; surely that's less expedient?

If we use webrequests+Hadoop tagging to count these, any time in the future
> there is a change to the URLs that page previews load (or the beacon URLs
> they hit), we’d have to make a patch to the tagging logic and release and
> deploy a new refinery version to account for the change.  Any time a new
> feature is added for which someone wants interactions counted, we have to
> do the same.


There doesn't seem to be much reason for the beacon URL to ever change. As
for new beacon endpoints (new virtual view types), why can't that just be a
whitelist that's offloaded to configuration?
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics