Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Can we keep further discussion on the phablet thread?
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Nuria Ruiz
>Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern
about using the GeoIP cookie: >It will create significant discrepancies
with the existing geolocation data we record for pageviews, where we have
chosen to derive this information from the IP instead

How did you came to the conclusion that the data will differ?

GeoIP cookie is inferred from your IP just the same, right?
https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/geoip.inc.vcl.erb#L10




On Wed, Feb 7, 2018 at 9:09 AM, Tilman Bayer  wrote:

> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have created a task for the backend work at https://phabricator.wikimedia.
> org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies with
> the existing geolocation data we record for pageviews, where we have chosen
> to derive this information from the IP instead. (Remember the overarching
> goal here of measuring page previews the same way we measure page views
> currently; the basic principle is that if a reader visits a page and then
> uses the page preview feature on that page to read preview cards, all the
> metadata that is recorded for both should have identical values for both
> the preview and the pageview.) Therefore, we should go with the kind of
> solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>> some reason, we can figure out the backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>>
 > Using the GeoIP cookie will require reconfiguring the EventLogging
 varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it
 would be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side
 to geocode or populate the country in the event.

>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
>>> /beacon/event endpoint, and the event itself. This duplication seems
>>> reasonable when balanced against capturing either: the client IP and then
>>> doing similar geocoding further along in the pipeline; or the cookie for
>>> all requests to that endpoint and then discarding them further along in the
>>> pipeline. It also reflects a seemingly core principle of the EventLogging
>>> system: that it doesn't capture potentiallly PII by default.
>>>
>>> -Sam
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Gonna paste your reply on the ticket
 and respond there.



On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer  wrote:

> On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto  wrote:
> >> It will create significant discrepancies with the existing geolocation
> >> data we record for pageviews
> > If you only need country (or whatever is in the cookie), then likely
> > whatever the output dataset is would only include country when selecting
> > from pageviews.  If you need more than country (it sounded like you
> didn’t),
> > then we can get into doing the IP Geocoding  in EventLogging, but there
> are
> > few technical complications here, and we’re prefer not to have to do
> this if
> > we don’t have to.
>
> As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
> the goal is to record metadata consistent with with our existing
> content consumption measurement, concretely: the fields available in
> the pageview_hourly table. See
> https://phabricator.wikimedia.org/T186728 for details (also regarding
> other fields that are not in EL by default but are likewise generated
> in a standard fashion for webrequest/pageview data).
>
> I appreciate it will need a bit of engineering work to implement your
> proposal of reusing the existing UDF that underlies the pageview data
> for the new preview data. But it will serve to avoid a lot of data
> limitations and headaches for years to come. To highlight just one
> aspect: If we relied on the cookie, the data would be inconsistent
> from the start because not all clients accept cookies. When we want to
> know (say) the ratio of previews to pageviews in a particular country,
> we don't want to have to embark on a research project estimating the
> number of cookie-less pageviews in that country. And so on.
>
>
> >
> > On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer 
> wrote:
> >>
> >> Thanks everyone! Separate from Sam's mapping out the frontend
> >> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have
> >> created a task for the backend work at
> >> https://phabricator.wikimedia.org/T186728 based on this thread.
> >>
> >> Regarding the last few posts about the geolocation information, from the
> >> data analysis perspective, there is indeed another, more serious concern
> >> about using the GeoIP cookie: It will create significant discrepancies
> with
> >> the existing geolocation data we record for pageviews, where we have
> chosen
> >> to derive this information from the IP instead. (Remember the
> overarching
> >> goal here of measuring page previews the same way we measure page views
> >> currently; the basic principle is that if a reader visits a page and
> then
> >> uses the page preview feature on that page to read preview cards, all
> the
> >> metadata that is recorded for both should have identical values for
> both the
> >> preview and the pageview.) Therefore, we should go with the kind of
> solution
> >> Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
> >>
> >> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
> >>>
> >>> Wow Sam, yeah, if this cookie works for you, it will make many things
> >>> much easier for us.  Check it out and let us know.  If it doesn’t work
> for
> >>> some reason, we can figure out the backend geocoding part.
> >>>
> >>>
> >>>
> >>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith 
> wrote:
> 
>  On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto 
> wrote:
> >
> > > Using the GeoIP cookie will require reconfiguring the EventLogging
> > > varnishkafka instance [0]
> >
> > I’m not familiar with this cookie, but, if we used it, I thought it
> > would be sent back to by the client in the event. E.g. event.country
> =
> > response.headers.country; EventLogging.emit(event);
> >
> > That way, there’s no additional special logic needed on the server
> side
> > to geocode or populate the country in the event.
> 
> 
>  Hah! I didn't think about accessing the GeoIP cookie on the client. As
>  you say, the implementation is quite easy.
> 
>  My only concern with this approach is the duplication of the value
>  between the cookie, which is sent in every HTTP request to the
> /beacon/event
>  endpoint, and the event itself. This duplication seems reasonable when
>  balanced against capturing either: the client IP and then doing
> similar
>  geocoding further along in the pipeline; or the cookie for all
> requests to
>  that endpoint and then discarding them further along in the pipeline.
> It
>  also reflects a seemingly core principle of the EventLogging system:
> that it
>  doesn't capture potentiallly PII by default.
> 
>  -Sam
> 
> 
> 
>  ___
>  Analytics mailing list
>  

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Tilman Bayer
On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto  wrote:
>> It will create significant discrepancies with the existing geolocation
>> data we record for pageviews
> If you only need country (or whatever is in the cookie), then likely
> whatever the output dataset is would only include country when selecting
> from pageviews.  If you need more than country (it sounded like you didn’t),
> then we can get into doing the IP Geocoding  in EventLogging, but there are
> few technical complications here, and we’re prefer not to have to do this if
> we don’t have to.

As mentioned repeatedly in this thread (see e.g. Sam's Jan 29 email),
the goal is to record metadata consistent with with our existing
content consumption measurement, concretely: the fields available in
the pageview_hourly table. See
https://phabricator.wikimedia.org/T186728 for details (also regarding
other fields that are not in EL by default but are likewise generated
in a standard fashion for webrequest/pageview data).

I appreciate it will need a bit of engineering work to implement your
proposal of reusing the existing UDF that underlies the pageview data
for the new preview data. But it will serve to avoid a lot of data
limitations and headaches for years to come. To highlight just one
aspect: If we relied on the cookie, the data would be inconsistent
from the start because not all clients accept cookies. When we want to
know (say) the ratio of previews to pageviews in a particular country,
we don't want to have to embark on a research project estimating the
number of cookie-less pageviews in that country. And so on.


>
> On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer  wrote:
>>
>> Thanks everyone! Separate from Sam's mapping out the frontend
>> instrumentation work at https://phabricator.wikimedia.org/T184793 , I have
>> created a task for the backend work at
>> https://phabricator.wikimedia.org/T186728 based on this thread.
>>
>> Regarding the last few posts about the geolocation information, from the
>> data analysis perspective, there is indeed another, more serious concern
>> about using the GeoIP cookie: It will create significant discrepancies with
>> the existing geolocation data we record for pageviews, where we have chosen
>> to derive this information from the IP instead. (Remember the overarching
>> goal here of measuring page previews the same way we measure page views
>> currently; the basic principle is that if a reader visits a page and then
>> uses the page preview feature on that page to read preview cards, all the
>> metadata that is recorded for both should have identical values for both the
>> preview and the pageview.) Therefore, we should go with the kind of solution
>> Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>>
>> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>>>
>>> Wow Sam, yeah, if this cookie works for you, it will make many things
>>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>>> some reason, we can figure out the backend geocoding part.
>>>
>>>
>>>
>>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:

 On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>
> > Using the GeoIP cookie will require reconfiguring the EventLogging
> > varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it
> would be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side
> to geocode or populate the country in the event.


 Hah! I didn't think about accessing the GeoIP cookie on the client. As
 you say, the implementation is quite easy.

 My only concern with this approach is the duplication of the value
 between the cookie, which is sent in every HTTP request to the 
 /beacon/event
 endpoint, and the event itself. This duplication seems reasonable when
 balanced against capturing either: the client IP and then doing similar
 geocoding further along in the pipeline; or the cookie for all requests to
 that endpoint and then discarding them further along in the pipeline. It
 also reflects a seemingly core principle of the EventLogging system: that 
 it
 doesn't capture potentiallly PII by default.

 -Sam



 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>>
>> --
>> Tilman Bayer
>> Senior Analyst
>> Wikimedia Foundation
>> IRC (Freenode): HaeB
>>
>> 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
> It will create significant discrepancies with the existing geolocation
data we record for pageviews
If you only need country (or whatever is in the cookie), then likely
whatever the output dataset is would only include country when selecting
from pageviews.  If you need more than country (it sounded like you
didn’t), then we can get into doing the IP Geocoding  in EventLogging, but
there are few technical complications here, and we’re prefer not to have to
do this if we don’t have to.

On Wed, Feb 7, 2018 at 12:09 PM, Tilman Bayer  wrote:

> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have created a task for the backend work at https://phabricator.wikimedia.
> org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies with
> the existing geolocation data we record for pageviews, where we have chosen
> to derive this information from the IP instead. (Remember the overarching
> goal here of measuring page previews the same way we measure page views
> currently; the basic principle is that if a reader visits a page and then
> uses the page preview feature on that page to read preview cards, all the
> metadata that is recorded for both should have identical values for both
> the preview and the pageview.) Therefore, we should go with the kind of
> solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>> some reason, we can figure out the backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>>
 > Using the GeoIP cookie will require reconfiguring the EventLogging
 varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it
 would be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side
 to geocode or populate the country in the event.

>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
>>> /beacon/event endpoint, and the event itself. This duplication seems
>>> reasonable when balanced against capturing either: the client IP and then
>>> doing similar geocoding further along in the pipeline; or the cookie for
>>> all requests to that endpoint and then discarding them further along in the
>>> pipeline. It also reflects a seemingly core principle of the EventLogging
>>> system: that it doesn't capture potentiallly PII by default.
>>>
>>> -Sam
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Tilman Bayer
Thanks everyone! Separate from Sam's mapping out the frontend
instrumentation work at https://phabricator.wikimedia.org/T184793 , I have
created a task for the backend work at
https://phabricator.wikimedia.org/T186728 based on this thread.

Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern
about using the GeoIP cookie: It will create significant discrepancies with
the existing geolocation data we record for pageviews, where we have chosen
to derive this information from the IP instead. (Remember the overarching
goal here of measuring page previews the same way we measure page views
currently; the basic principle is that if a reader visits a page and then
uses the page preview feature on that page to read preview cards, all the
metadata that is recorded for both should have identical values for both
the preview and the pageview.) Therefore, we should go with the kind of
solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).

On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:

> Wow Sam, yeah, if this cookie works for you, it will make many things much
> easier for us.  Check it out and let us know.  If it doesn’t work for some
> reason, we can figure out the backend geocoding part.
>
>
>
> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>
>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>
>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>>> varnishkafka instance [0]
>>>
>>> I’m not familiar with this cookie, but, if we used it, I thought it
>>> would be sent back to by the client in the event. E.g. event.country =
>>> response.headers.country; EventLogging.emit(event);
>>>
>>> That way, there’s no additional special logic needed on the server side
>>> to geocode or populate the country in the event.
>>>
>>
>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>> you say, the implementation is quite easy.
>>
>> My only concern with this approach is the duplication of the value
>> between the cookie, which is sent in every HTTP request to the
>> /beacon/event endpoint, and the event itself. This duplication seems
>> reasonable when balanced against capturing either: the client IP and then
>> doing similar geocoding further along in the pipeline; or the cookie for
>> all requests to that endpoint and then discarding them further along in the
>> pipeline. It also reflects a seemingly core principle of the EventLogging
>> system: that it doesn't capture potentiallly PII by default.
>>
>> -Sam
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Sam Smith
Just a quick update: I've captured details from this discussion and the
background in https://phabricator.wikimedia.org/T184793. I'd sure
appreciate your feedback.

-Sam
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Nuria Ruiz
>Wow Sam, yeah, if this cookie works for you, it will make many things much
easier for us
This is how it is done on performance schemas for Navigation timing data
per country, so there is a precedence.
https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/master/modules/ext.navigationTiming.js#L218

In this case because a preview request must happen after a full page
download the cookie will always be available. Now, the cookie mappings are
of this  form US:WA:Seattle so they would need further processing to be
akin to the current pageviews split.

On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:

> Wow Sam, yeah, if this cookie works for you, it will make many things much
> easier for us.  Check it out and let us know.  If it doesn’t work for some
> reason, we can figure out the backend geocoding part.
>
>
>
> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>
>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>
>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>>> varnishkafka instance [0]
>>>
>>> I’m not familiar with this cookie, but, if we used it, I thought it
>>> would be sent back to by the client in the event. E.g. event.country =
>>> response.headers.country; EventLogging.emit(event);
>>>
>>> That way, there’s no additional special logic needed on the server side
>>> to geocode or populate the country in the event.
>>>
>>
>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>> you say, the implementation is quite easy.
>>
>> My only concern with this approach is the duplication of the value
>> between the cookie, which is sent in every HTTP request to the
>> /beacon/event endpoint, and the event itself. This duplication seems
>> reasonable when balanced against capturing either: the client IP and then
>> doing similar geocoding further along in the pipeline; or the cookie for
>> all requests to that endpoint and then discarding them further along in the
>> pipeline. It also reflects a seemingly core principle of the EventLogging
>> system: that it doesn't capture potentiallly PII by default.
>>
>> -Sam
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Andrew Otto
Wow Sam, yeah, if this cookie works for you, it will make many things much
easier for us.  Check it out and let us know.  If it doesn’t work for some
reason, we can figure out the backend geocoding part.



On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:

> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>
>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>> varnishkafka instance [0]
>>
>> I’m not familiar with this cookie, but, if we used it, I thought it would
>> be sent back to by the client in the event. E.g. event.country =
>> response.headers.country; EventLogging.emit(event);
>>
>> That way, there’s no additional special logic needed on the server side
>> to geocode or populate the country in the event.
>>
>
> Hah! I didn't think about accessing the GeoIP cookie on the client. As you
> say, the implementation is quite easy.
>
> My only concern with this approach is the duplication of the value between
> the cookie, which is sent in every HTTP request to the /beacon/event
> endpoint, and the event itself. This duplication seems reasonable when
> balanced against capturing either: the client IP and then doing similar
> geocoding further along in the pipeline; or the cookie for all requests to
> that endpoint and then discarding them further along in the pipeline. It
> also reflects a seemingly core principle of the EventLogging system: that
> it doesn't capture potentiallly PII by default.
>
> -Sam
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Sam Smith
On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:

> > Using the GeoIP cookie will require reconfiguring the EventLogging
> varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it would
> be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side to
> geocode or populate the country in the event.
>

Hah! I didn't think about accessing the GeoIP cookie on the client. As you
say, the implementation is quite easy.

My only concern with this approach is the duplication of the value between
the cookie, which is sent in every HTTP request to the /beacon/event
endpoint, and the event itself. This duplication seems reasonable when
balanced against capturing either: the client IP and then doing similar
geocoding further along in the pipeline; or the cookie for all requests to
that endpoint and then discarding them further along in the pipeline. It
also reflects a seemingly core principle of the EventLogging system: that
it doesn't capture potentiallly PII by default.

-Sam
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-30 Thread Nuria Ruiz
>I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the
page preview aggregates with the >pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.

On my opinion the aggregated data should stay in two different tables. I
can see a future where the preview data is of different types (might
include rich media that was/was not played, there are simple popups and
"richer" ones .. whatever) and the dimensions where you represent this
consumption are not going to match with pageview_hourly which again only
represents well full page loads.

On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto  wrote:

> CoOOOl :)
>
> > Using the GeoIP cookie will require reconfiguring the EventLogging
> varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it would
> be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side to
> geocode or populate the country in the event.
>
> However, if y’all can’t or don’t want to use the country cookie, then
> yaaa, we gotta figure out what to do about IPs and geocoding in
> EventLogging. There are a few options here, but none of them are great. The
> options basically are variations on ‘treat this event schema as special and
> make special conditionals in EventLogging processor code’, or, 'include IP
> and/or geocode all events in all schemas'. We’re not sure which we want to
> do yet, but we did mention this at our offsite today. I think we’ll figure
> this out and make it happen in the next week or two. Whatever the
> implementation ends up being, we’ll get geocoded data into this dataset.
>
> > Is the geocoding code that we use on webrequest_raw available as an
> Hive UDF or in PySpark?
> The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
> UDF
> 
> which ultimately just calls this getGeocodedData
> 
> function, which itself is just a wrapper around the Maxmind API. We may end
> up doing geocoding in the EventLogging server codebase (again, really not
> sure about this yet…), but if we do it will use the same Maxmind databases.
>
>
> > Aggregating the EventLogging data in the same way that we aggregate
> webrequest data into pageviews data will require either: replicating the
> process that does this and keeping the two processes in sync; or
> abstracting away the source table from the aggregation process so that it
> can work on both tables
>
> I’m not totally sure if this works for you all, but I had pictured
> generating aggregates from the page preview events, and then joining the
> page preview aggregates with the pageview aggregates into a new table with
> an extra dimension specifying which type of content view was made.
>
>
> >  I’d appreciate it if someone could estimate how much work it will be
> to implement GeoIP information and the other fields from Pageview hourly
> for EventLogging events
>
> Ya we gotta figure this out still, but actual implementation shouldn’t be
> difficult, however we decide to do it.
>
> On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith 
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Hullo all,It seems like we've arrived at an implementation for the
>> client-side (JS) part of this problem: use EventLogging to track a page
>> interaction from within the Page Previews code. This'll give us the
>> flexibility to take advantage of a stream processing solution if/when it
>> becomes available, to push the definition of a "Page Previews page
>> interaction" to the client, and to rely on any events that we log in the
>> immediate future ending up in tables that we're already familiar with.In
>> principle, I agree with Andrew's argument that adding additional filtering
>> logic to the webrequest refinement process will make it harder to change
>> existing definitions of views or add others in future. In practice though,
>> we'll need to: - Ensure that the server-side EventLogging component records
>> metadata consistent with with our existing content consumption measurement,
>> concretely: the fields available in the
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly
>> 
>> table. In particular, that it either doesn't discard the client IP or
>> utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
>> the resulting table so that it can be combined with the pageviews 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-30 Thread Andrew Otto
CoOOOl :)

> Using the GeoIP cookie will require reconfiguring the EventLogging
varnishkafka instance [0]

I’m not familiar with this cookie, but, if we used it, I thought it would
be sent back to by the client in the event. E.g. event.country =
response.headers.country; EventLogging.emit(event);

That way, there’s no additional special logic needed on the server side to
geocode or populate the country in the event.

However, if y’all can’t or don’t want to use the country cookie, then yaaa,
we gotta figure out what to do about IPs and geocoding in EventLogging.
There are a few options here, but none of them are great. The options
basically are variations on ‘treat this event schema as special and make
special conditionals in EventLogging processor code’, or, 'include IP
and/or geocode all events in all schemas'. We’re not sure which we want to
do yet, but we did mention this at our offsite today. I think we’ll figure
this out and make it happen in the next week or two. Whatever the
implementation ends up being, we’ll get geocoded data into this dataset.

> Is the geocoding code that we use on webrequest_raw available as an Hive
UDF or in PySpark?
The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
UDF

which ultimately just calls this getGeocodedData

function, which itself is just a wrapper around the Maxmind API. We may end
up doing geocoding in the EventLogging server codebase (again, really not
sure about this yet…), but if we do it will use the same Maxmind databases.


> Aggregating the EventLogging data in the same way that we aggregate
webrequest data into pageviews data will require either: replicating the
process that does this and keeping the two processes in sync; or
abstracting away the source table from the aggregation process so that it
can work on both tables

I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the
page preview aggregates with the pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.


>  I’d appreciate it if someone could estimate how much work it will be to
implement GeoIP information and the other fields from Pageview hourly for
EventLogging events

Ya we gotta figure this out still, but actual implementation shouldn’t be
difficult, however we decide to do it.

On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith  wrote:

>
>
>
>
>
>
>
>
>
>
>
> *Hullo all,It seems like we've arrived at an implementation for the
> client-side (JS) part of this problem: use EventLogging to track a page
> interaction from within the Page Previews code. This'll give us the
> flexibility to take advantage of a stream processing solution if/when it
> becomes available, to push the definition of a "Page Previews page
> interaction" to the client, and to rely on any events that we log in the
> immediate future ending up in tables that we're already familiar with.In
> principle, I agree with Andrew's argument that adding additional filtering
> logic to the webrequest refinement process will make it harder to change
> existing definitions of views or add others in future. In practice though,
> we'll need to: - Ensure that the server-side EventLogging component records
> metadata consistent with with our existing content consumption measurement,
> concretely: the fields available in the
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly
> 
> table. In particular, that it either doesn't discard the client IP or
> utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
> the resulting table so that it can be combined with the pageviews table to
> generate reports.- Ensure that the events aren't recorded in MySQL.Using
> the GeoIP cookie will require reconfiguring the EventLogging varnishkafka
> instance [0], and raises questions about the compatibility with the
> corresponding field in the pageviews data. Retaining the client IP will
> require a similar change but will also require that we share the geocoding
> code with whatever process we use to refine the data that we’re capturing
> via EventLogging. Is the geocoding code that we use on webrequest_raw
> available as an Hive UDF or in PySpark?Aggregating the EventLogging data in
> the same way that we aggregate webrequest data into pageviews data will
> require either: replicating the process that does this and keeping the two
> processes in sync; or abstracting away the source table from the
> aggregation process so that it can work on both 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
Thanks.

On Fri, Jan 19, 2018 at 12:30 PM, Nuria Ruiz  wrote:

> >Thanks, good to know - is there a report around that? I'm wondering how
> "missing requests" ought to be expressed with some margin of error.
> I think the ones that can quantify this best is your team. If anything
> from what I remember from pop ups experiments the inflow of events was
> higher than expected calculations. Overall usage of DNT for FF users was
> about ~10% last time we looked at it, overall usage on our userbase is
> quite a bit smaller I bet.
>
> https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-
> new-do-not-track-dashboard-firefox-users-continue-to-
> seek-out-and-enable-dnt/
>
> On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso  wrote:
>
>>
>> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>>> library would some sort of new method be needed so that these impressions
>>> arena't undercounted?
>>> If we had a lot of users with DNT, maybe, from our tests when we enabled
>>> that on EL this is not the case.
>>>
>>
>> Thanks, good to know - is there a report around that? I'm wondering how
>> "missing requests" ought to be expressed with some margin of error.
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error.
I think the ones that can quantify this best is your team. If anything from
what I remember from pop ups experiments the inflow of events was higher
than expected calculations. Overall usage of DNT for FF users was about
~10% last time we looked at it, overall usage on our userbase is quite a
bit smaller I bet.

https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-new-do-not-track-dashboard-firefox-users-continue-to-seek-out-and-enable-dnt/

On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso  wrote:

>
> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>> library would some sort of new method be needed so that these impressions
>> arena't undercounted?
>> If we had a lot of users with DNT, maybe, from our tests when we enabled
>> that on EL this is not the case.
>>
>
> Thanks, good to know - is there a report around that? I'm wondering how
> "missing requests" ought to be expressed with some margin of error.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>So maybe it's worth considering which approach takes us closer to that?
AIUI the beacon puts the record into the webrequest table and from there it
would only take some >trivial preprocessing to replace the beacon URL with
the virtual URL and and add the beacon type as a "virtual_type" field or
something, making it very easy to expose it >everywhere where views are
tracked, while EventLogging data gets stored in a different, unrelated way.
Any thing that involves combing* 1 terabyte of data a day and 150.000
request s per second at peak *cannot be consider "simple" or "trivial".
Rather than looking for a needle in the haystack rely let's please on the
client to send you preselected data (events). That data can be
aggregated later in different ways, and the fact that the data comes from
event logging does not dictate how aggregation needs to happen.




On Wed, Jan 17, 2018 at 6:09 PM, Gergo Tisza  wrote:

> On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz  wrote:
>
>> Recording "preview_events" is really no different that recording any
>> other kind of UI event, difference is going to come from scale if anything,
>> as they are probably tens of thousands of those per second (I think your
>> team already estimated volume, if so please send those estimates along)
>>
>
> Conceptually I think a virtual pageview is a different thing from a UI
> event (which is how e.g. Google Analytics handles it, there is a method to
> send an event for the current page and a different method to send a virtual
> pageview for a different page), and the ideal way it is exposed in an
> analytics system should be very different. (I would want to see virtual
> pageviews together with normal pageviews, with some filtering option. If I
> deploy code that shows previews and converts users from making real
> pageviews to making virtual pageviews, I want to see how the total
> pageviews changed in the normal pageview stats; I don't want to have to
> create that chart and export one dataset from pageviews and one dataset
> from eventlogging to do that. As a user, I want to see in the fileview API
> how many people looked at the photo I uploaded, I don't particularly care
> if they used MediaViewer or not. etc.)
>
> So maybe it's worth considering which approach takes us closer to that?
> AIUI the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing to replace the beacon URL with
> the virtual URL and and add the beacon type as a "virtual_type" field or
> something, making it very easy to expose it everywhere where views are
> tracked, while EventLogging data gets stored in a different, unrelated way.
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
> If we had a lot of users with DNT, maybe, from our tests when we enabled
> that on EL this is not the case.
>

Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
> You could join these together in a broader ‘content consumption’ dataset
somehow, either in Hadoop with batch jobs, or more realtime with streaming
jobs.

Hm, idea…which I think has been mentioned before:  Could we leave pageviews
as is, but make a new dataset that counts both pageviews and page
previews?  Maybe this is ‘content_views’?  We could explicitly state that
the definition of content_views is supposed to change with time, and could
possibly incorporate other future types of content views too. Eh?





On Fri, Jan 19, 2018 at 12:27 PM, Nuria Ruiz  wrote:

> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
> If we had a lot of users with DNT, maybe, from our tests when we enabled
> that on EL this is not the case. Your team has already run experiments on
> this functionality and they can speak as to the projection of numbers.
>
> On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso  wrote:
>
>> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>> library would some sort of new method be needed so that these impressions
>> arena't undercounted?
>>
>> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith 
>> wrote:
>>
>>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>>
 Adding to this, one thing to consider is DNT - is there a way to invoke
 EL so that such traffic is appropriately imputed or something?

>>>
>>> The EventLogging client respects DNT [0]. When the user enables DNT,
>>> mw.eventLog.logEvent is a NOP.
>>>
>>> I don't see any mention of DNT in the Varnish VCLs around the the
>>> /beacon endpoint or otherwise but it may be handled elsewhere. While it's
>>> unlikely, there's nothing stopping a client sending a well-formatted
>>> request to the /beacon/event endpoint directly [1], ignoring the user's
>>> choice.
>>>
>>> -Sam
>>>
>>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>>> 5c1755223fd7a5bab9b9$251
>>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c17
>>> 55223fd7a5bab9b9$215
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions
arena't undercounted?
If we had a lot of users with DNT, maybe, from our tests when we enabled
that on EL this is not the case. Your team has already run experiments on
this functionality and they can speak as to the projection of numbers.

On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso  wrote:

> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
>
> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:
>
>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>
>>> Adding to this, one thing to consider is DNT - is there a way to invoke
>>> EL so that such traffic is appropriately imputed or something?
>>>
>>
>> The EventLogging client respects DNT [0]. When the user enables DNT,
>> mw.eventLog.logEvent is a NOP.
>>
>> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
>> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
>> there's nothing stopping a client sending a well-formatted request to the
>> /beacon/event endpoint directly [1], ignoring the user's choice.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>> 5c1755223fd7a5bab9b9$251
>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c
>> 1755223fd7a5bab9b9$215
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
>  For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).

Ah! Ok I get this use case now.   I might not be able to comment about this
much then.  I think this totally changes the meaning of a pageview.
Perhaps this is what you want?  If so, this is outside the realm of my
opinionatedness. :)

However, IF you do convince folks to change the meaning of ‘pageview’ to
include ‘previews’, then we might be able to compromise.  All I object to
more filtering of webrequests :)  The rest of this email might be moot if
we don’t change the ‘pageview definition’, but I’ll continue anyway…


The page previews data could come in as events.  Augmenting the generated
pageviews table from more incoming event sources sounds more flexible than
doing more filtering logic in webrequests.  I’d defer to the Analytics team
members who would be implementing this though, I might be wrong.

In my ideal, pageviews and page_previews would both be separate event
streams.  These would be imported as is to Hive tables, but also available
in Kafka.  You could join these together in a broader ‘content consumption’
dataset somehow, either in Hadoop with batch jobs, or more realtime with
streaming jobs.  (If this is done right, you can even use the same code for
both cases.)  If we had a good stream processing system here, I might
suggest that we move pageview filtering to a more realtime setup and
generate a derived pageview stream in Kafka. We’d then that as the source
of pageviews in Hadoop.   Anyway, this is my ideal setup, but not what we
have now!  But we might one day (in the next FY???), and intaking events
for page previews and other counters will help us migrate to this kind
of architecture later.

> Is that different from preprocessing them via EventLogging? Either way
you take a HTTP request, and end up with a Hadoop record - is there
something that makes that process a lot more costly for normal pageviews
than EventLogging beacon hits?

>From a hardware perspective, only in that the stream of events is much
smaller, so there’s less wasted repeated I/O.  From a engineering time
perspective, if we use the webrequest tagging system to do this, I think
we’re good, but only in the short term.  In the long term, it hides the
complexity involved in maintaining the logic of what a pageview or page
preview or any other ‘tagged’ webrequest in complicated Java logic that is
really only useable in Hadoop.  I’m mainly objecting because we want to
draw a line to stop doing this kind of thing.  Doing this for page previews
now might be ok if we really really really have to (although Nuria might
not agree ;) ), but ultimately we need to push this kind of interaction
logic out to feature developers who have more control over it.

The Analytics team wants to build infrastructure that make it easy for
developers to measure their product usage, not implement the measuring
logic ourselves.





On Fri, Jan 19, 2018 at 6:05 AM, Adam Baso  wrote:

> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
>
> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:
>
>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>
>>> Adding to this, one thing to consider is DNT - is there a way to invoke
>>> EL so that such traffic is appropriately imputed or something?
>>>
>>
>> The EventLogging client respects DNT [0]. When the user enables DNT,
>> mw.eventLog.logEvent is a NOP.
>>
>> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
>> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
>> there's nothing stopping a client sending a well-formatted request to the
>> /beacon/event endpoint directly [1], ignoring the user's choice.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>> 5c1755223fd7a5bab9b9$251
>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c
>> 1755223fd7a5bab9b9$215
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Adam Baso
Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions
arena't undercounted?

On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:

> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>
>> Adding to this, one thing to consider is DNT - is there a way to invoke
>> EL so that such traffic is appropriately imputed or something?
>>
>
> The EventLogging client respects DNT [0]. When the user enables DNT,
> mw.eventLog.logEvent is a NOP.
>
> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
> there's nothing stopping a client sending a well-formatted request to the
> /beacon/event endpoint directly [1], ignoring the user's choice.
>
> -Sam
>
> [0] https://phabricator.wikimedia.org/diffusion/EEVL/
> browse/master/modules/ext.eventLogging.core.js;
> 4480f7e27140fcb8ae915c1755223fd7a5bab9b9$251
> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223f
> d7a5bab9b9$215
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Sam Smith
On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:

> Adding to this, one thing to consider is DNT - is there a way to invoke EL
> so that such traffic is appropriately imputed or something?
>

The EventLogging client respects DNT [0]. When the user enables DNT,
mw.eventLog.logEvent is a NOP.

I don't see any mention of DNT in the Varnish VCLs around the the /beacon
endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
there's nothing stopping a client sending a well-formatted request to the
/beacon/event endpoint directly [1], ignoring the user's choice.

-Sam

[0]
https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223fd7a5bab9b9$251
[1]
https://phabricator.wikimedia.org/diffusion/EEVL/browse/master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c1755223fd7a5bab9b9$215
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Gergo Tisza
On Thu, Jan 18, 2018 at 3:56 PM, Nuria Ruiz  wrote:

> Event logging use cases are events, as we move to a thicker client -more
> javascript heavy- you will be needing to measure events for -nearly-
> everything, whether those are to be consider "content consumption"  or "ui
> interaction" is not that relevant. Example: video plays are content
> consumption and are also "ui interactions".
>

That could be an argument for not separating pageviews from events (in
which the question of whether virtual pageviews should be more like
pageviews or more like events would be moot), but given that those *are*
separated I don't see how it applies. In the current analytics setup, and
given what kinds of frontends are currently supported, there are types of
report generation that are easier to perform on pageviews and not so easy
on events, and other types of report generation that are easier to do on
events. For virtual pageviews, people will probably be more interested in
reports that belong to the first group (summing them up with normal
pageviews, breaking them down along the dimensions that are relevant for
web traffic, counting them for a given URL etc).

On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto  wrote:

> > the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing
> ‘Trivial’ preprocessing that has to look through 150K requests per second!
> This is a lot of work!
>

Is that different from preprocessing them via EventLogging? Either way you
take a HTTP request, and end up with a Hadoop record - is there something
that makes that process a lot more costly for normal pageviews than
EventLogging beacon hits?

Anyway what I meant by trivial preprocessing is that you take something
like 
*http://bits.wikimedia.org/beacon/page-preview?duration=123=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FFoo
*,
convert it into *https://en.wikipedia.org/wiki/Foo
*, tack the duration and the type
('page-preview') into some extra fields, add those extra fields to the
dimensions along which pageviews can be inspected, and you have integrated
virtual views into your analytics APIs / UIs, almost for free. The
alternative would be that every analytics customer who wants to deal
with content
consumption and does not want to automatically filter out content
consumption happening via thick clients would have to update their
interfaces and do some kind of union query to merge the data that's now
distributed between the webrequest table and one or more EventLogging
tables; surely that's less expedient?

If we use webrequests+Hadoop tagging to count these, any time in the future
> there is a change to the URLs that page previews load (or the beacon URLs
> they hit), we’d have to make a patch to the tagging logic and release and
> deploy a new refinery version to account for the change.  Any time a new
> feature is added for which someone wants interactions counted, we have to
> do the same.


There doesn't seem to be much reason for the beacon URL to ever change. As
for new beacon endpoints (new virtual view types), why can't that just be a
whitelist that's offloaded to configuration?
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
> I don't see how this addresses Gergo's larger point about the difference
between consistently tallying content consumption (pageviews, previews,
mediaviewer image views) >and analyzing UI interactions (which is the main
use case that EventLogging has been developed and used for).

Event logging use cases are events, as we move to a thicker client -more
javascript heavy- you will be needing to measure events for -nearly-
everything, whether those are to be consider "content consumption"  or "ui
interaction" is not that relevant. Example: video plays are content
consumption and are also "ui interactions".

We are the only major website that does not have a thick client and this
notion of joining UI interactions and consumption is new to us but really
it is not that new at all.


On Thu, Jan 18, 2018 at 3:17 PM, Tilman Bayer  wrote:

>
> On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz  wrote:
>
>> Gergo,
>>
>> >while EventLogging data gets stored in a different, unrelated way
>> Not really, This has changed quite a bit as of the last two quarters.
>> Eventlogging data as of recent gets preprocessed and refined similar to how
>> webrequest data is preprocessed and refined. You can have a dashboard on
>> top of some eventlogging schemas on superset in the same way you have a
>> dashboard that displays pageview data on superset.
>>
>
> I don't see how this addresses Gergo's larger point about the difference
> between consistently tallying content consumption (pageviews, previews,
> mediaviewer image views) and analyzing UI interactions (which is the main
> use case that EventLogging has been developed and used for). There are
> really quite a few differences between these two. For example, UI
> instrumentations on the web are almost always sampled, because that yields
> enough data to answer UI questions - but on the other hand tend to record
> much more detail about the individual interaction. In contrast, we register
> all pageviews unsampled, but don't keep a permanent record of every single
> one of them with precise timestamps - rather, we have aggregated tables
> (pageview_hourly in particular). Our EventLogging backend is not tailored
> to that.
>
>
>
>>
>> See dashboards on superset (user required).
>>
>> https://superset.wikimedia.org/superset/dashboard/7/?presele
>> ct_filters=%7B%7D
>>
>> And (again, user required) EL data on druid, this very same data we are
>> talking about, page previews:
>>
>> https://pivot.wikimedia.org/#tbayer_popups
>>
>
> That's actually not the "very same data we are talking about". You can
> rest assured that the web team (and Sam in particular) has already been
> aware of the existence of the Popups instrumentation for page previews. The
> team spent considerable effort building it in order to understand how users
> interact with the feature's UI. Now comes the separate effort of
> systematically tallying content consumption from this new channel. Superset
> and Pivot are great, but are nowhere near providing all the ways that WMF
> analysts and community members currently have to study pageview data.
> Storing data about seen previews in the same way as we do for pageviews,
> for example in the pageview_hourly (suitably tagged, perhaps giving that
> table a more general name) would facilitate that a lot, by allowing us to
> largely reuse the work that during the past few years went into getting
> pageview aggregation right.
>
>
>>
>> >I was going to make the point that #2 already has a processing pipeline
>> established whereas #1 doesn't.
>> This is incorrect, we mark as "preview" data that we want to exclude
>> from processing, see:
>> https://github.com/wikimedia/analytics-refinery-source/blob/
>> master/refinery-core/src/main/java/org/wikimedia/analytics/r
>> efinery/core/PageviewDefinition.java#L144
>> Naming is unfortunate but previews are really "preloads" as in requests
>> we make (and cache locally) and maybe shown to users or not.
>>
>>
>> But again, tracking of events is better done on an event based system and
>> EL is such a system.
>>
>>
>> Again, tracking of individual events is not the ultimate goal here.
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
> Are you saying that the server load generated by such an additional
aggregation query would be a blocker? If yes, how about we combine the two
(for pageviews and previews) into one?

Sorry, no it isn’t a blocker.   The tagging logic that Nuria and others
have been working on for a while now makes this a little easier, since the
webrequests only need to be read once to add all tags.  It is separate than
pageviews (for now), but we might use tagging for pageviews eventually too.

>  I assume it could be quite analogous to the one your team has
implemented for pageviews
If we did it like the linked Hive query, it would be quite a lot.  We don’t
want to read every webrequest from disk for every aggregate dataset.
Tagging helps, since we define the set of tags and filters once, and the
job that adds tags reads all webrequest once, and adds all tags.

But anyway, yes, it can be done.

I’m mostly objecting and recommending EventLogging because we
really shouldn’t’ be doing searching webrequest to measure interactions
over and over again.  It's fragile and monolithic and not very portable.
Events are better :)



On Thu, Jan 18, 2018 at 6:44 PM, Tilman Bayer  wrote:

> On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto  wrote:
>
>> > the beacon puts the record into the webrequest table and from there it
>> would only take some trivial preprocessing
>> ‘Trivial’ preprocessing that has to look through 150K requests per
>> second! This is a lot of work!
>>
>
> I think Gergo may have been referring to the human work involved in
> implementing that preprocessing step. I assume it could be quite analogous
> to the one your team has implemented for pageviews: https://github.com/
> wikimedia/analytics-refinery/blob/master/oozie/pageview/hour
> ly/pageview_hourly.hql
>
> Are you saying that the server load generated by such an additional
> aggregation query would be a blocker? If yes, how about we combine the two
> (for pageviews and previews) into one?
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Tilman Bayer
On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto  wrote:

> > the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing
> ‘Trivial’ preprocessing that has to look through 150K requests per second!
> This is a lot of work!
>

I think Gergo may have been referring to the human work involved in
implementing that preprocessing step. I assume it could be quite analogous
to the one your team has implemented for pageviews: https://github.com/
wikimedia/analytics-refinery/blob/master/oozie/pageview/
hourly/pageview_hourly.hql

Are you saying that the server load generated by such an additional
aggregation query would be a blocker? If yes, how about we combine the two
(for pageviews and previews) into one?


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
> For example, UI instrumentations on the web are almost always sampled,
because that yields enough data to answer UI questions - but on the other
hand tend to record much more detail about the individual interaction. In
contrast, we register all pageviews unsampled, but don't keep a permanent
record of every single one of them with precise timestamps - rather, we
have aggregated tables (pageview_hourly in particular). Our EventLogging
backend is not tailored to that.

When you say “Our EventLogging backend here”, what are you referring to?
If MySQL, then for sure. :)


> Storing data about seen previews in the same way as we do for pageviews,
for example in the pageview_hourly (suitably tagged, perhaps giving that
table a more general name) would facilitate that a lot, by allowing us to
largely reuse the work that during the past few years went into getting
pageview aggregation right.

I’m not totally opposed to doing it this way, but at some point we need to
realize that this isn’t a scalable (human and CPU resource wise) way to
measure user feature interaction.

I don’t think a pageview is inherently different than any other kind of
impression, it’s just that we didn’t have the ability in the past (or now?)
for pageviews to be collected and measured like they should.  If we were
designing an interaction measurement system now, it wouldn’t look exactly
like EventLogging, but it would look like something close to it.  And if it
did everything I’d want it to, we would use it to measure pageviews and
everything else you’ve mentioned.

Making events be the source of truth is more accurate than implementing
custom batch logic in Hadoop to comb through webrequests and filter out
what you are looking for.  It pushes control of the definition of what
counts as a ‘pageview’ or ‘page preview’ to the folks who are developing
the app/website/feature.  If we use webrequests+Hadoop tagging to count
these, any time in the future there is a change to the URLs that page
previews load (or the beacon URLs they hit), we’d have to make a patch to
the tagging logic and release and deploy a new refinery version to account
for the change.  Any time a new feature is added for which someone wants
interactions counted, we have to do the same.

Heck, if you use events, you could very easily consume and/or aggregate or
emit them to anywhere you wanted.  Your own datastore, a grafana dashboard,
a monitoring system, etc. etc. :)  It also will help us to standardize this
type of thing, so that in the future creation of new dashboards can be more
automated.














On Thu, Jan 18, 2018 at 6:17 PM, Tilman Bayer  wrote:

>
> On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz  wrote:
>
>> Gergo,
>>
>> >while EventLogging data gets stored in a different, unrelated way
>> Not really, This has changed quite a bit as of the last two quarters.
>> Eventlogging data as of recent gets preprocessed and refined similar to how
>> webrequest data is preprocessed and refined. You can have a dashboard on
>> top of some eventlogging schemas on superset in the same way you have a
>> dashboard that displays pageview data on superset.
>>
>
> I don't see how this addresses Gergo's larger point about the difference
> between consistently tallying content consumption (pageviews, previews,
> mediaviewer image views) and analyzing UI interactions (which is the main
> use case that EventLogging has been developed and used for). There are
> really quite a few differences between these two. For example, UI
> instrumentations on the web are almost always sampled, because that yields
> enough data to answer UI questions - but on the other hand tend to record
> much more detail about the individual interaction. In contrast, we register
> all pageviews unsampled, but don't keep a permanent record of every single
> one of them with precise timestamps - rather, we have aggregated tables
> (pageview_hourly in particular). Our EventLogging backend is not tailored
> to that.
>
>
>
>>
>> See dashboards on superset (user required).
>>
>> https://superset.wikimedia.org/superset/dashboard/7/?presele
>> ct_filters=%7B%7D
>>
>> And (again, user required) EL data on druid, this very same data we are
>> talking about, page previews:
>>
>> https://pivot.wikimedia.org/#tbayer_popups
>>
>
> That's actually not the "very same data we are talking about". You can
> rest assured that the web team (and Sam in particular) has already been
> aware of the existence of the Popups instrumentation for page previews. The
> team spent considerable effort building it in order to understand how users
> interact with the feature's UI. Now comes the separate effort of
> systematically tallying content consumption from this new channel. Superset
> and Pivot are great, but are nowhere near providing all the ways that WMF
> analysts and community members currently have to study pageview data.
> Storing data about seen previews in the same way as we do for 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Tilman Bayer
On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz  wrote:

> Gergo,
>
> >while EventLogging data gets stored in a different, unrelated way
> Not really, This has changed quite a bit as of the last two quarters.
> Eventlogging data as of recent gets preprocessed and refined similar to how
> webrequest data is preprocessed and refined. You can have a dashboard on
> top of some eventlogging schemas on superset in the same way you have a
> dashboard that displays pageview data on superset.
>

I don't see how this addresses Gergo's larger point about the difference
between consistently tallying content consumption (pageviews, previews,
mediaviewer image views) and analyzing UI interactions (which is the main
use case that EventLogging has been developed and used for). There are
really quite a few differences between these two. For example, UI
instrumentations on the web are almost always sampled, because that yields
enough data to answer UI questions - but on the other hand tend to record
much more detail about the individual interaction. In contrast, we register
all pageviews unsampled, but don't keep a permanent record of every single
one of them with precise timestamps - rather, we have aggregated tables
(pageview_hourly in particular). Our EventLogging backend is not tailored
to that.



>
> See dashboards on superset (user required).
>
> https://superset.wikimedia.org/superset/dashboard/7/?presele
> ct_filters=%7B%7D
>
> And (again, user required) EL data on druid, this very same data we are
> talking about, page previews:
>
> https://pivot.wikimedia.org/#tbayer_popups
>

That's actually not the "very same data we are talking about". You can rest
assured that the web team (and Sam in particular) has already been aware of
the existence of the Popups instrumentation for page previews. The team
spent considerable effort building it in order to understand how users
interact with the feature's UI. Now comes the separate effort of
systematically tallying content consumption from this new channel. Superset
and Pivot are great, but are nowhere near providing all the ways that WMF
analysts and community members currently have to study pageview data.
Storing data about seen previews in the same way as we do for pageviews,
for example in the pageview_hourly (suitably tagged, perhaps giving that
table a more general name) would facilitate that a lot, by allowing us to
largely reuse the work that during the past few years went into getting
pageview aggregation right.


>
> >I was going to make the point that #2 already has a processing pipeline
> established whereas #1 doesn't.
> This is incorrect, we mark as "preview" data that we want to exclude from
> processing, see:
> https://github.com/wikimedia/analytics-refinery-source/blob/
> master/refinery-core/src/main/java/org/wikimedia/analytics/r
> efinery/core/PageviewDefinition.java#L144
> Naming is unfortunate but previews are really "preloads" as in requests we
> make (and cache locally) and maybe shown to users or not.
>
>
> But again, tracking of events is better done on an event based system and
> EL is such a system.
>
>
> Again, tracking of individual events is not the ultimate goal here.


-- 
Tilman Bayer
Senior Analyst
Wikimedia Foundation
IRC (Freenode): HaeB
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
>Adding to this, one thing to consider is DNT - is there a way to invoke EL
so that such traffic is appropriately imputed or something?

I am not sure what you are asking ...

On Thu, Jan 18, 2018 at 1:57 PM, Adam Baso  wrote:

> (I'd defer to the Readers Web team with Tilman on whether country
> extracted from the cookie would be sufficient.)
>
> Adding to this, one thing to consider is DNT - is there a way to invoke EL
> so that such traffic is appropriately imputed or something?
>
> -Adam
>
> On Thu, Jan 18, 2018 at 2:13 PM, Andrew Otto  wrote:
>
>> >  In particular, will we be able to sort by country, OS, Browser, etc?
>> OS, Browser, yes.  User Agent parsing is done by the EventLogging
>> processors.
>>
>> Country not quite as easily, as EventLogging does not include client
>> IP addresses.  We could consider putting this back in somehow, or, I’ve
>> also heard that there is a geocoded country cookie that varnish will set
>> that the browser could send back as part of the event.  Is country enough
>> geo detail?
>>
>>
>>
>> On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva 
>> wrote:
>>
>>> Hi all,
>>>
>>> I just want to confirm that the proposed method using Eventlogging will
>>> allow us to gather data in a similar fashion to the web request table.  In
>>> particular, will we be able to sort by country, OS, Browser, etc?  Our goal
>>> here is to be able to consider the new page interactions metric on the same
>>> level and with the same depth as pageviews.
>>>
>>> Thanks!
>>>
>>> - Olga
>>>
>>> On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto  wrote:
>>>
 > the beacon puts the record into the webrequest table and from there
 it would only take some trivial preprocessing
 ‘Trivial’ preprocessing that has to look through 150K requests per
 second! This is a lot of work!

 > tracking of events is better done on an event based system and EL is
 such a system.
 I agree with this too.  We really want to discourage people from trying
 to measure things by searching through the huge haystack of all
 webrequests.  To measure something, you should emit an event if you can.
 If it were practical, I’d prefer that we did this for pageviews as well.
 Currently, we need a complicated definition of what a pageview is, which
 really only exists in the Java implementation in the Hadoop cluster.  It’d
 be much clearer if app developers had a way to define themselves what
 counts as a pageview, and emit that as an event.

 This should be the approach that people take when they want to measure
 something new.  Emit an event!  This event will get its own Kafka topic
 (you can consume this to do whatever you like with it), and be refined into
 its own Hive table.

 >  I don’t want to have to create that chart and export one dataset
 from pageviews and one dataset from eventlogging to do that.
  If you also design your schema nicely
 ,
 it will be easily importable into Druid and usable in Pivot and Superset,
 alongside of pageviews.  We’re working on getting nice schemas 
 automatically
 imported into druid .




 On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz 
 wrote:

> Gergo,
>
> >while EventLogging data gets stored in a different, unrelated way
> Not really, This has changed quite a bit as of the last two quarters.
> Eventlogging data as of recent gets preprocessed and refined similar to 
> how
> webrequest data is preprocessed and refined. You can have a dashboard on
> top of some eventlogging schemas on superset in the same way you have a
> dashboard that displays pageview data on superset.
>
> See dashboards on superset (user required).
>
> https://superset.wikimedia.org/superset/dashboard/7/?presele
> ct_filters=%7B%7D
>
> And (again, user required) EL data on druid, this very same data we
> are talking about, page previews:
>
> https://pivot.wikimedia.org/#tbayer_popups
>
>
> >I was going to make the point that #2 already has a processing
> pipeline established whereas #1 doesn't.
> This is incorrect, we mark as "preview" data that we want to exclude
> from processing, see:
> https://github.com/wikimedia/analytics-refinery-source/blob/
> master/refinery-core/src/main/java/org/wikimedia/analytics/r
> efinery/core/PageviewDefinition.java#L144
> Naming is unfortunate but previews are really "preloads" as in
> requests we make (and cache locally) and maybe shown to users or not.
>
>
> But again, tracking of events is better done on an event based system
> and EL is such a system.
>
>
>
>
> 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
>  In particular, will we be able to sort by country, OS, Browser, etc?
OS, Browser, yes.  User Agent parsing is done by the EventLogging
processors.

Country not quite as easily, as EventLogging does not include client
IP addresses.  We could consider putting this back in somehow, or, I’ve
also heard that there is a geocoded country cookie that varnish will set
that the browser could send back as part of the event.  Is country enough
geo detail?



On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva 
wrote:

> Hi all,
>
> I just want to confirm that the proposed method using Eventlogging will
> allow us to gather data in a similar fashion to the web request table.  In
> particular, will we be able to sort by country, OS, Browser, etc?  Our goal
> here is to be able to consider the new page interactions metric on the same
> level and with the same depth as pageviews.
>
> Thanks!
>
> - Olga
>
> On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto  wrote:
>
>> > the beacon puts the record into the webrequest table and from there it
>> would only take some trivial preprocessing
>> ‘Trivial’ preprocessing that has to look through 150K requests per
>> second! This is a lot of work!
>>
>> > tracking of events is better done on an event based system and EL is
>> such a system.
>> I agree with this too.  We really want to discourage people from trying
>> to measure things by searching through the huge haystack of all
>> webrequests.  To measure something, you should emit an event if you can.
>> If it were practical, I’d prefer that we did this for pageviews as well.
>> Currently, we need a complicated definition of what a pageview is, which
>> really only exists in the Java implementation in the Hadoop cluster.  It’d
>> be much clearer if app developers had a way to define themselves what
>> counts as a pageview, and emit that as an event.
>>
>> This should be the approach that people take when they want to measure
>> something new.  Emit an event!  This event will get its own Kafka topic
>> (you can consume this to do whatever you like with it), and be refined into
>> its own Hive table.
>>
>> >  I don’t want to have to create that chart and export one dataset from
>> pageviews and one dataset from eventlogging to do that.
>>  If you also design your schema nicely
>> ,
>> it will be easily importable into Druid and usable in Pivot and Superset,
>> alongside of pageviews.  We’re working on getting nice schemas automatically
>> imported into druid .
>>
>>
>>
>>
>> On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz  wrote:
>>
>>> Gergo,
>>>
>>> >while EventLogging data gets stored in a different, unrelated way
>>> Not really, This has changed quite a bit as of the last two quarters.
>>> Eventlogging data as of recent gets preprocessed and refined similar to how
>>> webrequest data is preprocessed and refined. You can have a dashboard on
>>> top of some eventlogging schemas on superset in the same way you have a
>>> dashboard that displays pageview data on superset.
>>>
>>> See dashboards on superset (user required).
>>>
>>> https://superset.wikimedia.org/superset/dashboard/7/?
>>> preselect_filters=%7B%7D
>>>
>>> And (again, user required) EL data on druid, this very same data we are
>>> talking about, page previews:
>>>
>>> https://pivot.wikimedia.org/#tbayer_popups
>>>
>>>
>>> >I was going to make the point that #2 already has a processing
>>> pipeline established whereas #1 doesn't.
>>> This is incorrect, we mark as "preview" data that we want to exclude
>>> from processing, see:
>>> https://github.com/wikimedia/analytics-refinery-source/
>>> blob/master/refinery-core/src/main/java/org/wikimedia/
>>> analytics/refinery/core/PageviewDefinition.java#L144
>>> Naming is unfortunate but previews are really "preloads" as in requests
>>> we make (and cache locally) and maybe shown to users or not.
>>>
>>>
>>> But again, tracking of events is better done on an event based system
>>> and EL is such a system.
>>>
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> Olga Vasileva // Product Manager // Reading Web Team
> https://wikimediafoundation.org/
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Olga Vasileva
Hi all,

I just want to confirm that the proposed method using Eventlogging will
allow us to gather data in a similar fashion to the web request table.  In
particular, will we be able to sort by country, OS, Browser, etc?  Our goal
here is to be able to consider the new page interactions metric on the same
level and with the same depth as pageviews.

Thanks!

- Olga

On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto  wrote:

> > the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing
> ‘Trivial’ preprocessing that has to look through 150K requests per second!
> This is a lot of work!
>
> > tracking of events is better done on an event based system and EL is
> such a system.
> I agree with this too.  We really want to discourage people from trying to
> measure things by searching through the huge haystack of all webrequests.
> To measure something, you should emit an event if you can.  If it were
> practical, I’d prefer that we did this for pageviews as well.  Currently,
> we need a complicated definition of what a pageview is, which really only
> exists in the Java implementation in the Hadoop cluster.  It’d be much
> clearer if app developers had a way to define themselves what counts as a
> pageview, and emit that as an event.
>
> This should be the approach that people take when they want to measure
> something new.  Emit an event!  This event will get its own Kafka topic
> (you can consume this to do whatever you like with it), and be refined into
> its own Hive table.
>
> >  I don’t want to have to create that chart and export one dataset from
> pageviews and one dataset from eventlogging to do that.
>  If you also design your schema nicely
> ,
> it will be easily importable into Druid and usable in Pivot and Superset,
> alongside of pageviews.  We’re working on getting nice schemas automatically
> imported into druid .
>
>
>
>
> On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz  wrote:
>
>> Gergo,
>>
>> >while EventLogging data gets stored in a different, unrelated way
>> Not really, This has changed quite a bit as of the last two quarters.
>> Eventlogging data as of recent gets preprocessed and refined similar to how
>> webrequest data is preprocessed and refined. You can have a dashboard on
>> top of some eventlogging schemas on superset in the same way you have a
>> dashboard that displays pageview data on superset.
>>
>> See dashboards on superset (user required).
>>
>>
>> https://superset.wikimedia.org/superset/dashboard/7/?preselect_filters=%7B%7D
>>
>> And (again, user required) EL data on druid, this very same data we are
>> talking about, page previews:
>>
>> https://pivot.wikimedia.org/#tbayer_popups
>>
>>
>> >I was going to make the point that #2 already has a processing pipeline
>> established whereas #1 doesn't.
>> This is incorrect, we mark as "preview" data that we want to exclude
>> from processing, see:
>>
>> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L144
>> Naming is unfortunate but previews are really "preloads" as in requests
>> we make (and cache locally) and maybe shown to users or not.
>>
>>
>> But again, tracking of events is better done on an event based system and
>> EL is such a system.
>>
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
Olga Vasileva // Product Manager // Reading Web Team
https://wikimediafoundation.org/
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
> the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing
‘Trivial’ preprocessing that has to look through 150K requests per second!
This is a lot of work!

> tracking of events is better done on an event based system and EL is such
a system.
I agree with this too.  We really want to discourage people from trying to
measure things by searching through the huge haystack of all webrequests.
To measure something, you should emit an event if you can.  If it were
practical, I’d prefer that we did this for pageviews as well.  Currently,
we need a complicated definition of what a pageview is, which really only
exists in the Java implementation in the Hadoop cluster.  It’d be much
clearer if app developers had a way to define themselves what counts as a
pageview, and emit that as an event.

This should be the approach that people take when they want to measure
something new.  Emit an event!  This event will get its own Kafka topic
(you can consume this to do whatever you like with it), and be refined into
its own Hive table.

>  I don’t want to have to create that chart and export one dataset from
pageviews and one dataset from eventlogging to do that.
 If you also design your schema nicely
,
it will be easily importable into Druid and usable in Pivot and Superset,
alongside of pageviews.  We’re working on getting nice schemas automatically
imported into druid .




On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz  wrote:

> Gergo,
>
> >while EventLogging data gets stored in a different, unrelated way
> Not really, This has changed quite a bit as of the last two quarters.
> Eventlogging data as of recent gets preprocessed and refined similar to how
> webrequest data is preprocessed and refined. You can have a dashboard on
> top of some eventlogging schemas on superset in the same way you have a
> dashboard that displays pageview data on superset.
>
> See dashboards on superset (user required).
>
> https://superset.wikimedia.org/superset/dashboard/7/?
> preselect_filters=%7B%7D
>
> And (again, user required) EL data on druid, this very same data we are
> talking about, page previews:
>
> https://pivot.wikimedia.org/#tbayer_popups
>
>
> >I was going to make the point that #2 already has a processing pipeline
> established whereas #1 doesn't.
> This is incorrect, we mark as "preview" data that we want to exclude from
> processing, see:
> https://github.com/wikimedia/analytics-refinery-source/
> blob/master/refinery-core/src/main/java/org/wikimedia/
> analytics/refinery/core/PageviewDefinition.java#L144
> Naming is unfortunate but previews are really "preloads" as in requests we
> make (and cache locally) and maybe shown to users or not.
>
>
> But again, tracking of events is better done on an event based system and
> EL is such a system.
>
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
Gergo,

>while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters.
Eventlogging data as of recent gets preprocessed and refined similar to how
webrequest data is preprocessed and refined. You can have a dashboard on
top of some eventlogging schemas on superset in the same way you have a
dashboard that displays pageview data on superset.

See dashboards on superset (user required).

https://superset.wikimedia.org/superset/dashboard/7/?preselect_filters=%7B%7D

And (again, user required) EL data on druid, this very same data we are
talking about, page previews:

https://pivot.wikimedia.org/#tbayer_popups


>I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't.
This is incorrect, we mark as "preview" data that we want to exclude from
processing, see:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L144
Naming is unfortunate but previews are really "preloads" as in requests we
make (and cache locally) and maybe shown to users or not.


But again, tracking of events is better done on an event based system and
EL is such a system.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Sam Smith
On Wed, Jan 17, 2018 at 6:46 PM, Leila Zia  wrote:

> On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith  wrote:
>
> > IMO #1 is preferable from the operations and performance perspectives as
> the
> > response is always served from the edge and includes very few headers,
> > whereas the request in #2 may be served by the application servers if the
> > user is logged in (or in the mobile site's beta cohort). However, the
> > requests in #2 are already
>
> It seems the sentence above is cut, can you resend it?
>

Hah! I should've cut the whole sentence.

I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't. AIUI there'd have to be a refinement step
added to Oozie to process the requests in #1 whereas the requests in #2
make it into the webrequest table with the appropriate value in the
x_analytics column.

Thanks,

-Sam
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-17 Thread Gergo Tisza
On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz  wrote:

> Recording "preview_events" is really no different that recording any other
> kind of UI event, difference is going to come from scale if anything, as
> they are probably tens of thousands of those per second (I think your team
> already estimated volume, if so please send those estimates along)
>

Conceptually I think a virtual pageview is a different thing from a UI
event (which is how e.g. Google Analytics handles it, there is a method to
send an event for the current page and a different method to send a virtual
pageview for a different page), and the ideal way it is exposed in an
analytics system should be very different. (I would want to see virtual
pageviews together with normal pageviews, with some filtering option. If I
deploy code that shows previews and converts users from making real
pageviews to making virtual pageviews, I want to see how the total
pageviews changed in the normal pageview stats; I don't want to have to
create that chart and export one dataset from pageviews and one dataset
from eventlogging to do that. As a user, I want to see in the fileview API
how many people looked at the photo I uploaded, I don't particularly care
if they used MediaViewer or not. etc.)

So maybe it's worth considering which approach takes us closer to that?
AIUI the beacon puts the record into the webrequest table and from there it
would only take some trivial preprocessing to replace the beacon URL with
the virtual URL and and add the beacon type as a "virtual_type" field or
something, making it very easy to expose it everywhere where views are
tracked, while EventLogging data gets stored in a different, unrelated way.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics