Re: [Analytics] EventLogging blocked by ad blockers

2020-09-22 Thread Nuria Ruiz
Hello,

What are the problems you see with the beacon being blocked when it comes
to extracting value from data?

In most instances what we look when deriving insights are ratios. For
example: "of the people that saw the red link how many clicked it". In this
scenario, with an adequate sample sizes, insights can be extracted without
any issues.

>Is it reasonable to say that ad blockers should not be blocking
EventLogging (since it's just an internal logging system)?
Addblockers prevent requests to beacons, them being used for internal stats
or otherwise (ad serving) so yes, it is pretty reasonable. A beacon does
not necessarily imply it is used for adds [1]


>If the answer to #1 is "yes", could we change the URL that EventLogging
uses so that it is no longer blacklisted by ad blockers
Any url we use it is likely to be blacklisted by blockers that use lists in
the absence of it being explicitly whitelisted. So, a naming change will
have just a short lived effect.

Another way to proceed is the opposite: whitelist the url so it is not
blocked by the blockers that -like adblocker- rely on lists. Now (since not
all blockers work with lists. Example: Privacy Badger) that is by no means
a 100% guarantee that data would not be blocked.

Thanks,

Nuria


[1] https://en.wikipedia.org/wiki/Web_beacon

On Tue, Sep 22, 2020 at 10:50 AM Ryan Kaldari 
wrote:

> As mentioned at T251464 ,
> EventLogging  is
> currently blocked by EasyPrivacy  (a popular add-on
> for ad blocking software) due to EventLogging sending its data to a URL
> that includes the blacklisted string "beacon/event". In some cases, this
> makes it difficult or impossible for us to get the analytics data we need
> to make product decisions, e.g. T240697
> . Two questions:
>
>1. Is it reasonable to say that ad blockers should not be blocking
>EventLogging (since it's just an internal logging system)?
>2. If the answer to #1 is "yes", could we change the URL that
>EventLogging uses so that it is no longer blacklisted by ad blockers?
>
>
> --
> *Ryan Kaldari* (they/them)
> Director of Engineering, Product Department
> Wikimedia Foundation 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Translations in wikistats

2020-08-31 Thread Nuria Ruiz
Ruben:

>Can I help in any way with that review or is it something that depends on
other people?
Any registered user can contribute and review translations.

Thanks,

Nuria


On Mon, Aug 31, 2020 at 12:48 AM Rubén Ojeda 
wrote:

> Hello Nuria
>
> Thank you very much for the information. As far as I read on the page, the
> Spanish translation is 99%, but only 18% is reviewed. Can I help in any way
> with that review or is it something that depends on other people?
>
> Best,
>
> Rubén Ojeda de la Roza
> Project Manager
> *Wikimedia España*
>
> Tlf: +34 722 61 47 98
> rubenoj...@wikimedia.es
> @rubojeda <https://twitter.com/rubojeda>
> <http://wikimedia.es/>
>
>
>
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
>  Libre
> de virus. www.avast.com
> <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail>
> <#m_3789262350594832144_m_1878655813009564070_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> El vie., 28 ago. 2020 a las 19:45, Nuria Ruiz ()
> escribió:
>
>> Ruben:
>>
>> Thanks for your question about translations in wikistats (
>> http://stats.wikimedia.org). You can contribute translations to
>> wikistats via translate wiki.
>>
>> https://translatewiki.net/wiki/Translating:Wikistats_2.0
>>
>> I think on our end we need to do a bit better at making obvious this is
>> the case so we shall add a test to translate wiki to the footer of
>> wikistats.[1]
>>
>> Thanks,
>>
>>
>> Nuria
>>
>>
>> [1] https://phabricator.wikimedia.org/T261502
>>
>>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Translations in wikistats

2020-08-28 Thread Nuria Ruiz
Ruben:

Thanks for your question about translations in wikistats (
http://stats.wikimedia.org). You can contribute translations to wikistats
via translate wiki.

https://translatewiki.net/wiki/Translating:Wikistats_2.0

I think on our end we need to do a bit better at making obvious this is the
case so we shall add a test to translate wiki to the footer of wikistats.[1]

Thanks,


Nuria


[1] https://phabricator.wikimedia.org/T261502
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] nefarious bot/automated traffic analysis

2020-06-16 Thread Nuria Ruiz
Scott:

A good place to start to read about "bot spam" and its impact on the data
is this one:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection We
recently released a new classification for traffic. Besides classifying
traffic as "user" or "spider" we also have now "automated" which tags as
such traffic from a number of entities (but not all) that can be described
as "high-volume spammers". You probably have some questions after reading
the doc and for those we can set up a meeting.

Thanks,

Nuria





On Tue, Jun 16, 2020 at 9:55 AM Scott Bassett 
wrote:

> Hello Analytics Team-
>
> The Security Team has recently spent some cycles investigating improved
> anti-automation (bad bots, high-volume spammers, etc.) solutions,
> particularly around an improved Wikimedia captcha.  We were curious if your
> team has any methods or advice regarding the analysis of nefarious
> automated traffic within the context of raw web requests or any other
> relevant analytics data.  If the answer is "not really", that's fine.  But
> if there are some relevant tools, methods, research, etc. your team has
> performed that you would like to share with us, that would be much
> appreciated.  If it makes sense to discuss this further during a quick
> call, I can try to find some time for a few of us over the next couple of
> weeks.  We also have an extremely barebones task where we are attempting to
> document various methods of measurement which might be helpful:
> https://phabricator.wikimedia.org/T255208.
>
> Thanks,
>
> --
> Scott Bassett
> sbass...@wikimedia.org
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Clickstream: mobile vs. desktop, empty referrers

2020-06-09 Thread Nuria Ruiz
Hello,

See https://phabricator.wikimedia.org/T195880 for info on "none" referrers.

Thanks,

Nuria

On Tue, Jun 9, 2020 at 6:10 AM Joseph Allemandou 
wrote:

> Hi Robert
>
> From the `WHERE` clause here:
>
> https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-job/src/main/scala/org/wikimedia/analytics/refinery/job/ClickstreamBuilder.scala#L283
> you can see that we bundloe altogether desktop, mobile-web and mobile-app
> rows.
>
> As for the reason of so many empty referrers, I don't know myself, but I
> know it's been a topic already discussed.
> IIRC Nuria and Jon Katz looked into it a bit, and more recently Isaac
> Johnson (all in direct copy).
>
> Cheers
> Joseph
>
> On Tue, Jun 9, 2020 at 9:03 AM Robert West  wrote:
>
>> Hi Analytics team,
>>
>> Quick question:
>> Does the Clickstream data
>>  lump
>> together *mobile and desktop?*
>> It seems to be hinted at here
>> , but
>> it's not mentioned explicitly. It just says that the 2015 data is for
>> desktop only, which seems to imply that after that it's desktop + mobile.
>>
>> Also, I was wondering if anyone has any insights into what might cause 
>> *referrers
>> to be empty?* I tried googling, but the issue is clouded in mystery and
>> seems to depend a lot on browser and website specificities. Any insights
>> (small or big) would be appreciated!
>>
>> Thanks a lot!
>> Bob
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] "automated" marker added to pageview data

2020-05-18 Thread Nuria Ruiz
Neil:

Some of  the rules used to identify automated traffic have been used by the
community for now couple years. See for example [1] and [2].  For more
information you can always ping us.

Thanks,

Nuria

[1] https://tools.wmflabs.org/topviews/faq/#false_positive
[2] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions



On Wed, May 13, 2020 at 7:44 AM Neil Shah-Quinn 
wrote:

> Nuria,
>
> Thank you for this update! I'm very excited about this new system.
>
> I did notice that there's not much explanation of the particular rules or
> strategies that are used to identify automated traffic, or a link to the
> implementing code. I can imagine this might be intentional, to make it
> harder for the spammers and vandals to evade the system. If so, it would be
> helpful to update the page to say that explicitly and explain how people
> can request more details if they have a legitimate need for them.
>
> On Tue, 5 May 2020 at 02:40, Nuria Ruiz  wrote:
>
>> Hello:
>>
>> We have added the 'automated' maker to Wikimedia's pageview data. Up to
>> now pageview agents were classified as 'spider' (self reported bots like
>> 'google bot' or 'bing bot') and 'user'.
>>
>> We have known for a while that some requests classified as 'user' were,
>> in fact, coming from automated agents not disclosed as such. This was a
>> well known fact for our community as for a couple years now they have been
>> applying filtering rules for any "Top X" list compiled [1]. We have
>> incorporated some of these filters (and others) to our automated traffic
>> detection and, as of this week, traffic that meets the filtering
>> criteria is now automatically excluded from being counted towards "top"
>> lists reported by the pageview API.
>>
>> The effect of removing pageviews marked as 'automated' from the overall
>> user traffic is about a 5.6% reduction of pageviews labeled as "user" [2]
>> in the course of  a month. Not all projects are affected equally when it
>> comes to reduction of "user pageviews". The biggest effect is on English
>> Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
>> affected (< 1%).
>>
>> If you are curious as what problems this type of traffic causes in the
>> data, this ticket for Hungarian Wikipedia is a good example of issues
>> inflicted by what we call "bot vandalism/bot spam":
>> https://phabricator.wikimedia.org/T237282
>>
>> Given the delicate nature of this data we have worked for many months now
>> on vetting the algorithms we are using. We will appreciate reports via phab
>> ticket for any issues you might find.
>>
>> Thanks,
>>
>> Nuria
>>
>> [1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
>> [2]
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Global_Impact_-_All_wikimedia_projects
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] "automated" marker added to pageview data

2020-05-04 Thread Nuria Ruiz
Hello:

We have added the 'automated' maker to Wikimedia's pageview data. Up to now
pageview agents were classified as 'spider' (self reported bots like
'google bot' or 'bing bot') and 'user'.

We have known for a while that some requests classified as 'user' were, in
fact, coming from automated agents not disclosed as such. This was a well
known fact for our community as for a couple years now they have been
applying filtering rules for any "Top X" list compiled [1]. We have
incorporated some of these filters (and others) to our automated traffic
detection and, as of this week, traffic that meets the filtering
criteria is now automatically excluded from being counted towards "top"
lists reported by the pageview API.

The effect of removing pageviews marked as 'automated' from the overall
user traffic is about a 5.6% reduction of pageviews labeled as "user" [2]
in the course of  a month. Not all projects are affected equally when it
comes to reduction of "user pageviews". The biggest effect is on English
Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
affected (< 1%).

If you are curious as what problems this type of traffic causes in the
data, this ticket for Hungarian Wikipedia is a good example of issues
inflicted by what we call "bot vandalism/bot spam":
https://phabricator.wikimedia.org/T237282

Given the delicate nature of this data we have worked for many months now
on vetting the algorithms we are using. We will appreciate reports via phab
ticket for any issues you might find.

Thanks,

Nuria

[1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
[2]
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Global_Impact_-_All_wikimedia_projects
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Research-Internal] Kerberos ticket expiry, Jupyterhub on stat1004/1006 and new memory/cpu limits for stat/notebook hosts

2020-03-12 Thread Nuria Ruiz
Hello,

>We deployed jupyterhub on stat1004 and stat1006,
So we are all clear on what this implies it means that disk space
constrains in jupyter notebooks are no longer an issue. The stats machines
have much more disk available than
the notebook hosts. That being said that answer to larger workloads on
jupyter is to run those in hadoop rather than locally. We have done some
work to facilitate running distributed jobs from jupyter in hadoop and that
work will continue next quarter.




On Thu, Mar 12, 2020 at 11:29 AM Luca Toscano 
wrote:

> Hi everybody,
>
> some news from the Analytics team:
>
> - The Kerberos ticket expiry time has been bumped to 48h. You can
> kdestroy/kinit to get the new settings :)
>
> - There are new memory and cpu limits on all stat/notebook hosts, that
> should automatically kill big jobs that cause too much memory pressure. CPU
> cores are also limited to 90% of the available ones to leave space for
> system daemons. This should help a lot in avoiding recurrent alarms to the
> SRE team (and me reaching out to some of you as consequence!) and it should
> be a more fair system for everybody. In order to apply these new settings
> I'd need to shutdown/start all the notebooks running on notebook1003/1004,
> but I didn't do it since I didn't want to impact any work. If you could
> please take care of stopping/starting your notebooks it would be really
> appreciated :)
>
> - We deployed jupyterhub on stat1004 and stat1006, ready for general use!
> This should help in avoiding the small home size problem that many of you
> are experiencing on notebook1003/1004. We are also working on setting up
> jupyterhub on stat1005, with updated dependencies (jupyterhub 1.1.0, toree
> 0.3.0, etc.. full list in
> https://gerrit.wikimedia.org/r/#/c/analytics/jupyterhub/deploy/+/577761/1/frozen-requirements.txt).
> The plan is to eventually have the same version on all stat boxes (no
> timeline yet). We didn't deploy jupyterhub on stat1007 due to some puppet
> code refactoring in progress, but we hope to do it next quarter.
>
> - A new stat host (stat1008) will be ready for general use soon. It hosts
> a GPU like stat1005.
>
> If you have questions/doubts/etc.. please feel free to follow up with me
> or any member of the Analytics team on #wikimedia-analytics :)
>
> Luca (on behalf of the Analytics team)
> ___
> Research-Internal mailing list
> research-inter...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-25 Thread Nuria Ruiz
Hello:

Following up on this issue, We think many of neil's issues come from the
fact that a kerberos ticket expires after 24 hours and once it does your
spark session would not work anymore. We will be extending expiration of
tickets somewhat to 2/3 days but main point to take home is that jupyter
notebooks do not live forever in the state you live them at, a restart of
kernel might be needed.

Please take a look at ticket:
https://phabricator.wikimedia.org/T246132

If anybody has been having these similar problems please chime in.

Thanks,

Nuria

On Thu, Feb 20, 2020 at 2:27 AM Luca Toscano  wrote:

> Hi Neil,
>
> I added the Analytics tag to https://phabricator.wikimedia.org/T245097,
> and also thanks for filing https://phabricator.wikimedia.org/T245713. We
> periodically review tasks in our incoming queue, so we should be able to
> help soon, but it may depend on priorities.
>
> Luca
>
> Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn <
> nshahqu...@wikimedia.org> ha scritto:
>
>> Another update: I'm continuing to encounter these Spark errors and have
>> trouble recovering from them, even when I use proper settings. I've filed
>> T245713  to discuss this
>> further. The specific errors and behavior I'm seeing (for example, whether
>> explicitly calling session.stop allows a new functioning session to be
>> created) are not consistent, so I'm still trying to make sense of it.
>>
>> I would greatly appreciate any input or help, even if it's identifying
>> places where my description doesn't make sense.
>> 
>> 
>>
>> On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn 
>> wrote:
>>
>>> Bump!
>>>
>>> Analytics team, I'm eager to have input from y'all about the best Spark
>>> settings to use.
>>>
>>> On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn 
>>> wrote:
>>>
 I ran into this problem again, and I found that neither session.stop or
 newSession got rid of the error. So it's still not clear how to recover
 from a crashed(?) Spark session.

 On the other hand, I did figure out why my sessions were crashing in
 the first place, so hopefully recovering from that will be a rare need. The
 reason is that wmfdata doesn't modify
 
 the default Spark when it starts a new session, which was (for example)
 causing it to start executors with only ~400 MiB of memory each.

 I'm definitely going to change that, but it's not completely clear what
 the recommended settings for our cluster are. I cataloged the different
 recommendations at https://phabricator.wikimedia.org/T245097, and it
 would very helpful if one of y'all could give some clear recommendations
 about what the settings should be for local SWAP, YARN, and "large" YARN
 jobs. For example, is it important to increase spark.sql.shuffle.partitions
 for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
 job when the SWAP servers only have 64 GiB total?

 Thank you!




 On Fri, 7 Feb 2020 at 06:53, Andrew Otto  wrote:

> Hm, interesting!  I don't think many of us have used
> SparkSession.builder.getOrCreate repeatedly in the same process.
> What happens if you manually stop the spark session first, (
> session.stop()
> ?)
> or maybe try to explicitly create a new session via newSession()
> 
> ?
>
> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <
> nshahqu...@wikimedia.org> wrote:
>
>> Hi Luca!
>>
>> Those were separate Yarn jobs I started later. When I got this error,
>> I found that the Yarn job corresponding to the SparkContext was marked as
>> "successful", but I still couldn't get SparkSession.builder.getOrCreate 
>> to
>> open a new one.
>>
>> Any idea what might have caused that or how I could recover without
>> restarting the notebook, which could mean losing a lot of in-progress 
>> work?
>> I had already restarted that kernel so I don't know if I'll encounter 
>> this
>> problem again. If I do, I'll file a task.
>>
>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>> wrote:
>>
>>> Hey Neil,
>>>
>>> there were two Yarn jobs running related to your notebooks, I just
>>> killed them, let's see if it solves the problem (you might need to 
>>> restart
>>> again your notebook). If not, let's open a task and investigate :)
>>>
>>> Luca
>>>
>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>> nshahqu...@wikimedia.org> ha 

Re: [Analytics] Announcement - Mediawiki History Dumps

2020-02-17 Thread Nuria Ruiz
Hello,

We have added a footer to dumps pages with the CC-0 note. Please see:
https://dumps.wikimedia.org/other/analytics/

For other changes that you think are needed please do file a phab ticket.

Thanks,

Nuria

On Tue, Feb 11, 2020 at 2:50 PM Nuria Ruiz  wrote:

> Regarding Licensing, there is already a ticket:
> https://phabricator.wikimedia.org/T244685
>
> If you take a look the bottom of wikistats (https://stats.wikimedia.org/v2)
> you will see that dedication is CC0, the data in both systems is the same
> but, of course, it can be made more explicit.
>
> Thanks,
>
> Nuria
>
>
>
> On Tue, Feb 11, 2020 at 12:48 PM Leila Zia  wrote:
>
>> Hi Joseph and team,
>>
>> summary: congratulations and some suggestions/requests.
>>
>> I second and third Nate and Neil. Congratulations on meeting this
>> milestone. This effort can empower the research community to spend
>> less time on joining datasets and trying to resolve existing, known
>> (to some) and complex issues with mediawiki history data and instead
>> spend time doing the research. Nice! :)
>>
>> I'm eager to see what the dataset(s) will be used for by others. On my
>> end, I am looking forward to seeing more research on how Wiki(m|p)edia
>> projects have evolved over the past almost 2 decades now that this
>> data is more readily available for studying. What we learn from the
>> Wikimedia projects and their evolution can be helpful in understanding
>> the broader web ecosystem and its evolution as well (as the Web is
>> only 30 years old now).
>>
>> I have some requests if I may:
>>
>> * Pine brings up a good point about licenses. It would be great to
>> make that clear in the documentation page(s). There are many examples
>> of this (that you know better than I), just in case, I find the
>> License section of
>> https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
>> informative, for example.
>>
>> * The other request I have is that you make the template for citing
>> this data-set clear to the end-user in your documentation pages
>> (including readme). You can do this in a few different ways:
>>
>> ** In the documentation pages, put a suggested citation link. For
>> example (for bibtex):
>>
>> @misc{wmfanalytics2020mediawikihistory,
>>   title = {MediaWiki History},
>>   author = {nameoftheauthors},
>>   howpublished = "\url{
>> https://dumps.wikimedia.org/other/mediawiki_history/};,
>>   note = {Accessed on date x},
>>   year={2020}
>> }
>>
>> ** Upload a paper about the work on arxiv.org. This way, your work
>> gets a DOI that you can use in your documentation pages for folks to
>> use for citation. Note that this step can be relatively light-weight.
>> (no peer-review in this case and it's relatively quick.)
>>
>> ** Submit the paper to a conference. Some conferences have a data-set
>> paper track where you publish about the dataset you release. Research
>> is happy to support you with guidance if you need it and if you choose
>> to go down this path. This takes some more time and in return it will
>> give you a "peer-review" stamp and more experience in publishing if
>> you like that.
>>
>> Unless you like publishing your work in a peer-reviewed venue, I
>> suggest one of the first two approaches.
>>
>> * I'm not sure if you intend to make the dataset more discoverable
>> through places such as https://datasetsearch.research.google.com/ .
>> You may want to consider that.
>>
>> Thanks,
>> Leila
>>
>> --
>> Leila Zia
>> Head of Research
>> Wikimedia Foundation
>>
>> On Mon, Feb 10, 2020 at 9:28 PM Pine W  wrote:
>> >
>> > I was thinking about the licensing issue some more. Apparently there
>> > was a relevant United States court case regarding metadata several
>> > years ago in the United States, but it's unclear to me from my brief
>> > web search whether this holding would apply to metadata from every
>> > nation. Also, I don't know if the underlying statues have changed
>> > since the time of that ruling. I think that WMF Legal should be
>> > consulted regarding the copyright status of the metadata. Also, I
>> > think that the licensing of metadata should be explicitly addressed in
>> > the Terms of Use or a similar document which is easily accessible to
>> > all contributors to Wikimedia sites.
>> >
>> > Pine
>> > ( https://meta.wikimedia.org/wiki/User:Pine )
>> >
>> > On Tue, Feb 11, 2020 at 12:17 AM Pine W  wrote

Re: [Analytics] Announcement - Mediawiki History Dumps

2020-02-11 Thread Nuria Ruiz
Regarding Licensing, there is already a ticket:
https://phabricator.wikimedia.org/T244685

If you take a look the bottom of wikistats (https://stats.wikimedia.org/v2)
you will see that dedication is CC0, the data in both systems is the same
but, of course, it can be made more explicit.

Thanks,

Nuria



On Tue, Feb 11, 2020 at 12:48 PM Leila Zia  wrote:

> Hi Joseph and team,
>
> summary: congratulations and some suggestions/requests.
>
> I second and third Nate and Neil. Congratulations on meeting this
> milestone. This effort can empower the research community to spend
> less time on joining datasets and trying to resolve existing, known
> (to some) and complex issues with mediawiki history data and instead
> spend time doing the research. Nice! :)
>
> I'm eager to see what the dataset(s) will be used for by others. On my
> end, I am looking forward to seeing more research on how Wiki(m|p)edia
> projects have evolved over the past almost 2 decades now that this
> data is more readily available for studying. What we learn from the
> Wikimedia projects and their evolution can be helpful in understanding
> the broader web ecosystem and its evolution as well (as the Web is
> only 30 years old now).
>
> I have some requests if I may:
>
> * Pine brings up a good point about licenses. It would be great to
> make that clear in the documentation page(s). There are many examples
> of this (that you know better than I), just in case, I find the
> License section of
> https://iccl.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en
> informative, for example.
>
> * The other request I have is that you make the template for citing
> this data-set clear to the end-user in your documentation pages
> (including readme). You can do this in a few different ways:
>
> ** In the documentation pages, put a suggested citation link. For
> example (for bibtex):
>
> @misc{wmfanalytics2020mediawikihistory,
>   title = {MediaWiki History},
>   author = {nameoftheauthors},
>   howpublished = "\url{
> https://dumps.wikimedia.org/other/mediawiki_history/};,
>   note = {Accessed on date x},
>   year={2020}
> }
>
> ** Upload a paper about the work on arxiv.org. This way, your work
> gets a DOI that you can use in your documentation pages for folks to
> use for citation. Note that this step can be relatively light-weight.
> (no peer-review in this case and it's relatively quick.)
>
> ** Submit the paper to a conference. Some conferences have a data-set
> paper track where you publish about the dataset you release. Research
> is happy to support you with guidance if you need it and if you choose
> to go down this path. This takes some more time and in return it will
> give you a "peer-review" stamp and more experience in publishing if
> you like that.
>
> Unless you like publishing your work in a peer-reviewed venue, I
> suggest one of the first two approaches.
>
> * I'm not sure if you intend to make the dataset more discoverable
> through places such as https://datasetsearch.research.google.com/ .
> You may want to consider that.
>
> Thanks,
> Leila
>
> --
> Leila Zia
> Head of Research
> Wikimedia Foundation
>
> On Mon, Feb 10, 2020 at 9:28 PM Pine W  wrote:
> >
> > I was thinking about the licensing issue some more. Apparently there
> > was a relevant United States court case regarding metadata several
> > years ago in the United States, but it's unclear to me from my brief
> > web search whether this holding would apply to metadata from every
> > nation. Also, I don't know if the underlying statues have changed
> > since the time of that ruling. I think that WMF Legal should be
> > consulted regarding the copyright status of the metadata. Also, I
> > think that the licensing of metadata should be explicitly addressed in
> > the Terms of Use or a similar document which is easily accessible to
> > all contributors to Wikimedia sites.
> >
> > Pine
> > ( https://meta.wikimedia.org/wiki/User:Pine )
> >
> > On Tue, Feb 11, 2020 at 12:17 AM Pine W  wrote:
> > >
> > > Hi Joseph,
> > >
> > > Thanks for this announcement.
> > >
> > > I am looking for license information regarding the dumps, and I'm not
> > > finding it in the pages that you linked at [1] or [2]. The license
> > > that applies to text on Wikimedia sites is often CC-BY-SA 3.0, and the
> > > WMF Terms of Use at https://foundation.wikimedia.org/wiki/Terms_of_Use
> > > do not appear to provide any exception for metadata. In the absence of
> > > a specific license, I think that the CC-BY-SA or other relevant
> > > licenses would apply to the metadata, and that the licensing
> > > information should be prominently included on relevant pages and in
> > > the dumps themselves.
> > >
> > > What do you think?
> > >
> > > Pine
> > > ( https://meta.wikimedia.org/wiki/User:Pine )
> > >
> > > On Mon, Feb 10, 2020 at 4:28 PM Joseph Allemandou
> > >  wrote:
> > > >
> > > > Hi Analytics People,
> > > >
> > > > The Wikimedia Analytics Team is pleased to announce the release of
> the most complete 

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Nuria Ruiz
> and the verdict (supported by you) was that we should use this list or
the public IRC channel.
Indeed, eh? I suggest we revisit that to send questions to
analytics-internal but if others disagree, I am fine with either.



On Fri, Feb 7, 2020 at 12:17 PM Neil Shah-Quinn 
wrote:

> Good suggestions, Andrew! I'll try those if I encounter this again.
>
> Nuria, we had a discussion about the appropriate places to ask questions
> about internal systems in October 2018, and the verdict (supported by you)
> was that we should use this list or the public IRC channel.
>
> If you want to revisit that decision, I'd suggest you consult that thread
> first (the subject was "Where to ask questions about internal analytics
> tools") because I included a detailed list of pros and cons of different
> channels to start the discussion. In that list, I even mentioned that such
> discussions on this channel could annoy subscribers who don't have access
> to these systems 
>
> If you still want us to use a different list, we can certainly do that. If
> so, please send my team a message and update the docs I added
> <https://wikitech.wikimedia.org/wiki/Analytics#Contact> so it stays clear.
>
> On Fri, 7 Feb 2020 at 07:48, Nuria Ruiz  wrote:
>
>> Hello,
>>
>> Probably this discussion is not of wide interest to this public list, I
>> suggest to move it to analytics-internal?
>>
>> Thanks,
>>
>> Nuria
>>
>> On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto  wrote:
>>
>>> Hm, interesting!  I don't think many of us have used
>>> SparkSession.builder.getOrCreate repeatedly in the same process.  What
>>> happens if you manually stop the spark session first, (session.stop()
>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>>> or maybe try to explicitly create a new session via newSession()
>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>>> ?
>>>
>>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
>>> wrote:
>>>
>>>> Hi Luca!
>>>>
>>>> Those were separate Yarn jobs I started later. When I got this error, I
>>>> found that the Yarn job corresponding to the SparkContext was marked as
>>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>>>> open a new one.
>>>>
>>>> Any idea what might have caused that or how I could recover without
>>>> restarting the notebook, which could mean losing a lot of in-progress work?
>>>> I had already restarted that kernel so I don't know if I'll encounter this
>>>> problem again. If I do, I'll file a task.
>>>>
>>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>>>> wrote:
>>>>
>>>>> Hey Neil,
>>>>>
>>>>> there were two Yarn jobs running related to your notebooks, I just
>>>>> killed them, let's see if it solves the problem (you might need to restart
>>>>> again your notebook). If not, let's open a task and investigate :)
>>>>>
>>>>> Luca
>>>>>
>>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>>> nshahqu...@wikimedia.org> ha scritto:
>>>>>
>>>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>>>> after restarting the notebook, without an intermediate Java heap space
>>>>>> error. That seems very strange to me.
>>>>>>
>>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn <
>>>>>> nshahqu...@wikimedia.org> wrote:
>>>>>>
>>>>>>> Hey there!
>>>>>>>
>>>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
>>>>>>> Java heap space".
>>>>>>>
>>>>>>> After that, when I tried to call the spark.sql function again (via
>>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: 
>>>>>>> Cannot
>>>>>>> call methods on a stopped SparkContext."
>>>>>>>
>>>>>>> When I tried to create a new Spark c

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Nuria Ruiz
Hello,

Probably this discussion is not of wide interest to this public list, I
suggest to move it to analytics-internal?

Thanks,

Nuria

On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto  wrote:

> Hm, interesting!  I don't think many of us have used 
> SparkSession.builder.getOrCreate
> repeatedly in the same process.  What happens if you manually stop the
> spark session first, (session.stop()
> ?)
> or maybe try to explicitly create a new session via newSession()
> 
> ?
>
> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
> wrote:
>
>> Hi Luca!
>>
>> Those were separate Yarn jobs I started later. When I got this error, I
>> found that the Yarn job corresponding to the SparkContext was marked as
>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>> open a new one.
>>
>> Any idea what might have caused that or how I could recover without
>> restarting the notebook, which could mean losing a lot of in-progress work?
>> I had already restarted that kernel so I don't know if I'll encounter this
>> problem again. If I do, I'll file a task.
>>
>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano  wrote:
>>
>>> Hey Neil,
>>>
>>> there were two Yarn jobs running related to your notebooks, I just
>>> killed them, let's see if it solves the problem (you might need to restart
>>> again your notebook). If not, let's open a task and investigate :)
>>>
>>> Luca
>>>
>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>> nshahqu...@wikimedia.org> ha scritto:
>>>
 Whoa—I just got the same stopped SparkContext error on the query even
 after restarting the notebook, without an intermediate Java heap space
 error. That seems very strange to me.

 On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
 wrote:

> Hey there!
>
> I was running SQL queries via PySpark (using the wmfdata package
> )
> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
> Java heap space".
>
> After that, when I tried to call the spark.sql function again (via
> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
> call methods on a stopped SparkContext."
>
> When I tried to create a new Spark context using
> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
> or directly), it returned a SparkContent object properly, but calling the
> object's sql function still gave the "stopped SparkContext error".
>
> Any idea what's going on? I assume restarting the notebook kernel
> would take care of the problem, but it seems like there has to be a better
> way to recover.
>
> Thank you!
>
>
> ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Hourly projectviews by country

2020-01-13 Thread Nuria Ruiz
>Is there any way I can get an hourly time series of which countries are
viewing which Wikipedias? Even a (country x project) resolution summary of
average views
> for the 24 hours of the day would be helpful, if that data exists
anywhere.

The public data that exists on this regard is aggregated pageviews per
project per country. Please see:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageviews_split_by_country

Due to privacy concerns pageviews per country per article are not released
at this time, it is our plan to eventually release those but It will not
happen in the near term.

Thanks,

Nuria


On Mon, Jan 13, 2020 at 9:20 AM Dakota Killpack 
wrote:

> Hi all,
>
> I'm attempting to do some research into how different cultures consume
> information. I'm focusing specifically on how this varies by time of day
> and time of year. I had the idea of using the Wikipedia projectviews data
> as a proxy for overall information, since Wikipedia is usually the first or
> second search result for most interesting bits of information from pop
> culture to geopolitics to science. Unfortunately, after looking at WiViVi,
> it seems like my naive assumption of separating out Wikipedias by language
> doesn't actually resolve that cleanly into countries. Since I'm
> particularly interested in the effects of seasonality (e.g. different
> academic calendars and holidays across countries, different lunchtimes
> between northern and southern European countries in the same timezones), I
> can't make the assumption that the % of traffic to a project from each
> country is constant.
>
> Is there any way I can get an hourly time series of which countries are
> viewing which Wikipedias? Even a (country x project) resolution summary of
> average views for the 24 hours of the day would be helpful, if that data
> exists anywhere.
>
> Thanks!
>
> Dakota Killpack
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Active meta users v active wikimedia users

2020-01-06 Thread Nuria Ruiz
>I was looking to try and work out what percent lf the active wikimedia
community are participating on meta and comparing to another wiki farm. Any
thoughts on that?
I think it will help to give a bit of an example of why you are looking to
find this information, why is it important. Participating in a wiki
includes other things besides editions (translations for or software
commits to features used on that one wiki, for example) so a precise
comparison of one wiki ecosystem to the rest is quite a task.

On Mon, Jan 6, 2020 at 9:58 PM Jonathan Morgan 
wrote:

> (Last reply to both lists; sorry for the spam)
>
> This sounds like it'd be a bit of work to build, and I don't think there
> are curated datasets to help out. I think you would need to...
>
> 1. get the count of active editors on Meta for [PERIOD OF TIME]. Easy.
> 2. perform a query or parse dumps to get the *list *of active editors
> from every individual Wikimedia project for the same [PERIOD OF TIME]. Hard.
> 3. de-duplicate that list (since many people edit multiple wikis in a
> given say, month, and you don't want to overcount). Pretty easy.
> 4. compare the resulting all-projects count with the Meta-only count. Easy.
>
> This sounds like a lot of work to me! Again, there might be tools or
> resources for this that already exist, but I'm not aware of them.
>
> It seems like having topline/platform-level counts for active editors
> could be useful, as a dashboard or a public dataset. You might try requesting
> this as a feature
> 
> for WikiStats. The worst they can say is "no", or "not yet" :)
>
> - J
>
>
> On Mon, Jan 6, 2020 at 12:34 PM RhinosF1 -  wrote:
>
>> Hi,
>>
>> I’ve just seen the replies and thanks to everyone whose replied.
>>
>> I was looking to try and work out what percent lf the active wikimedia
>> community are participating on meta and comparing to another wiki farm.
>> Any
>> thoughts on that?
>>
>> RhinosF1
>>
>> On Mon, 6 Jan 2020 at 20:31, Aaron Halfaker 
>> wrote:
>>
>> > It doesn't look like Active Editors works for all wikis.  I think you'd
>> > have to merge activity across all wikis to get a stat like that. I'm not
>> > sure I know of a good data strategy to get that.
>> >
>> > If you were to query it with quarry, you'd need to write a query for
>> every
>> > wiki and then write some code to merge the results.  Oof.
>> >
>> > If you to extract it from the XML dumps, you'd need to process each Wiki
>> > separately and then merge the results.  Oof.
>> >
>> > The best solution to this is to have a common table/relation across all
>> > Wikis and to aggregate from there.  I don't think there's any such
>> > cross-wiki table/relation available.
>> >
>> > On Mon, Jan 6, 2020 at 1:38 PM Jonathan Morgan 
>> > wrote:
>> >
>> > > Same dashboard, but for "All wikis":
>> > > https://stats.wikimedia.org/v2/#/all-projects
>> > >
>> > > That work?
>> > >
>> > > - J
>> > >
>> > > On Mon, Jan 6, 2020 at 11:32 AM RhinosF1 - 
>> wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > That provides active users for meta but not globally. Anything for
>> > > global?
>> > > >
>> > > > RhinosF1
>> > > >
>> > > > On Mon, 6 Jan 2020 at 18:10, Jonathan Morgan > >
>> > > > wrote:
>> > > >
>> > > > > RhinosF1,
>> > > > >
>> > > > > Are you looking for information like this
>> > > > > , or
>> something
>> > > > > different?
>> > > > >
>> > > > > - J
>> > > > >
>> > > > > On Mon, Jan 6, 2020 at 8:51 AM RhinosF1 - 
>> > wrote:
>> > > > >
>> > > > > > Hi,
>> > > > > >
>> > > > > > Does anyone know a way to find out how many  wikimedia users are
>> > > active
>> > > > > > globally compared to active on metawiki?
>> > > > > >
>> > > > > > This mean they've made more than 5 edits in the last 30 days for
>> > > this.
>> > > > > >
>> > > > > > Thanks,
>> > > > > > RhinosF1
>> > > > > > ___
>> > > > > > Analytics mailing list
>> > > > > > Analytics@lists.wikimedia.org
>> > > > > > https://lists.wikimedia.org/mailman/listinfo/analytics
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Jonathan T. Morgan
>> > > > > Senior Design Researcher
>> > > > > Wikimedia Foundation
>> > > > > User:Jmorgan (WMF) <
>> > https://meta.wikimedia.org/wiki/User:Jmorgan_(WMF)
>> > > >
>> > > > > (Uses He/Him)
>> > > > > ___
>> > > > > Wiki-research-l mailing list
>> > > > > wiki-researc...@lists.wikimedia.org
>> > > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> > > > >
>> > > > ___
>> > > > Wiki-research-l mailing list
>> > > > wiki-researc...@lists.wikimedia.org
>> > > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>> > > >
>> > >
>> > >
>> > > --
>> > > Jonathan T. Morgan
>> > > Senior Design Researcher
>> > > Wikimedia Foundation
>> > > 

Re: [Analytics] Pageviews anomaly‏

2019-12-22 Thread Nuria Ruiz
Hello,

This spike is probably caused by bot traffic. I would disregard it
entirely. Please see, for example, a similar problem in all top pageviews
in hungarian wikipedia for last month.

https://phabricator.wikimedia.org/T237282

Thanks,

Nuria

On Sun, Dec 22, 2019 at 2:42 PM Brian Keegan 
wrote:

> Webmasters sometimes design their 404 pages to link to Wikipedia articles,
> so if their website goes down all their users (human and bot) start getting
> referred to Wikipedia articles. I could easily image there being a “This
> page isn’t available, go grab a cup of coffee” kind of placeholder page
> being up.
>
>
>
> *From: *Analytics  on behalf of
> Jan Ainali 
> *Reply-To: *"A mailing list for the Analytics Team at WMF and everybody
> who has an interest in Wikipedia and analytics." <
> analytics@lists.wikimedia.org>
> *Date: *Sunday, December 22, 2019 at 3:01 PM
> *To: *"A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics." 
> *Subject: *Re: [Analytics] Pageviews anomaly‏
>
>
>
> Another observation is that it only spiked from desktop and not from
> mobile which suggests it was not because of a general interest (which would
> cause spikes on all platforms).
>
>
> Best regards,
>
> Jan Ainali
>
> http://ainali.com
>
>
>
>
>
> Den sön 22 dec. 2019 kl 22:01 skrev effe iets anders <
> effeietsand...@gmail.com>:
>
> I agree this is odd - especially the fact that both the day before and the
> day after, the article had less than 100 visits
> .
> Usually there seems to be some spillover at the very least into the next
> day.
>
>
>
> Lodewijk
>
>
>
> On Sun, Dec 22, 2019 at 5:17 AM Keren WMIL  wrote:
>
> Dear all,
>
> It's almost Christmas and the new year is coming around. At the end of
> each year we publish a list of the most viewed Hebrew Wikipedia articles in
> the past year.
>
> We have a data point that appears to be anomalous: the article caffeine
> received
> more than 450K views on one day: 26th of September 2019. We can't see any
> reason for such a surge and it is completely disproportionate. Even on
> English Wikipedia caffeine
> hasn't
> received so many views on one day - not even on the 8th of February
> when Friedlieb Ferdinand Runge who identified caffeine was features on the
> daily Google Doodle.
>
> It seems this data point is erroneous. Is there any way to verify that, or
> inquire where the error stems from?
>
>
>
> Kind regards and seasons greetings,
>
>
>
> Dr. Keren Shatzman
>
> Senior Coordinator, Academia & Projects
> Wikimedia Israel
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Availability of hourly pagecounts files

2019-12-16 Thread Nuria Ruiz
> thought that the hourly files were the source of data for the tool. Is
there any estimate of when the missing files will be available?
The source of data for the tool is the pagevioew API:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageview_counts_by_article

Thanks,

Nuria

On Mon, Dec 16, 2019 at 8:33 AM Dan Andreescu 
wrote:

> Collin, we are in the middle of a big upgrade today, changing the whole
> Analytics cluster to use Kerberos.  So we're expecting delays on all
> datasets/jobs/services today.  If you watch this list, there will be a
> message going out later to let everyone know things are back to normal.
> Thanks for your patience.
>
> On Mon, Dec 16, 2019 at 11:28 AM Collin Stedman 
> wrote:
>
>> Hello,
>>
>> It appears that hourly pagecount files are missing for the past 24 hours
>> or so. I'm looking for the dumps in these two places:
>> https://dumps.wikimedia.org/other/pageviews/2019/2019-12/
>> https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-12/
>>
>> Interestingly, the pageview tool at https://tools.wmflabs.org/pageviews/
>> is showing data for 2019-12-15, and I thought that the hourly files were
>> the source of data for the tool. Is there any estimate of when the missing
>> files will be available? Also, is the pageview tool pulling data from
>> another source?
>>
>> Thank you very much,
>> -CS
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Releasing a dataset for caching research and tunning

2019-12-05 Thread Nuria Ruiz
Hello,

The Analytics team would like to announce the release of  a new dataset for
caching research and tunning. Please take a look, these datasets are used
by the research community for evaluations of caching algorithms.

https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Caching

More context can be found on phab ticket:
https://phabricator.wikimedia.org/T225538

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Statistics

2019-08-27 Thread Nuria Ruiz
Emin:

You can see identified bot traffic versus user traffic in this graph:
https://stats.wikimedia.org/v2/#/az.wikipedia.org/reading/total-page-views/normal|bar|2-year|agent~user*spider|monthly
,
sometimes bot traffic is about 30% of the traffic.

As the prior reply said we know some of the user traffic is actually bots
(that do not identify as such) so the total bot traffic is actually bigger
than 30%.

Thanks,

Nuria


On Tue, Aug 27, 2019 at 11:13 AM Francisco Dans  wrote:

> Hi Emin, thank you for your email.
>
> Given that the virtual totality of the pageviews from NL are made via
> Mobile Web
> ,
> and there are virtually no pageviews from either desktop or mobile app,
> this seems like unidentified bot traffic coming from NL, which is
> problematic as views reported from this country are about 1/4 of all
> pageviews to azwiki.
>
> We're working on improving bot traffic identification. Please follow this
> task on Phabricator  if you're
> interested in reading more.
>
> Thanks,
> Francisco
>
> On Tue, Aug 27, 2019 at 10:00 AM Emin Allahverdi 
> wrote:
>
>> Hello!
>>
>> Sorry to disturb you. I looked at stats.wikimedia.org and some dates
>> seems interesting to me. This site shows that (
>> https://stats.wikimedia.org/v2/#/az.wikipedia.org/reading/page-views-by-country/normal||2019-07-01~2019-07-01|~total|
>> ) last month (in August) pages in Azerbaijani Wikipedia had about 3 million
>> view from Netherlands. These statistics are about same during last months.
>> Even if (
>> https://stats.wikimedia.org/v2/#/az.wikipedia.org/reading/page-views-by-country/normal||2018-07-01~2018-07-01|~total|
>> ) in july 2018, views from Netherlands is more than Azerbaijan. When we
>> compare Netherland and other country, it seems something is wrong. Because
>> not many Azerbaijani people live in Netherland, and I dont believe local
>> people read Azerbaijani Wikipedia ))
>>
>> Thank you for your attention.
>>
>> Kind regards,
>> User:Eminn
>> Emin Allahverdi
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> *Francisco Dans*
> Software Engineer, Analytics Team
> Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Nuria Ruiz
>I have one question for you: As you allow/encourage for more copies of
>the files to exist
To be extra clear, we do not encourage for data to be in that notebooks
hosts at all, there is no capacity of them to neither process nor hosts
large amounts of data. Data that you are working with is best placed on
/user/your-username databse in hadoop so far from encouraging multiple
copies we are rather encouraging you keep the data outside the notebook
machines.

Thanks,

Nuria

On Wed, Jul 10, 2019 at 11:13 AM Kate Zimmerman 
wrote:

> I second Leila's question. The issue of how we flag PII data and ensure
> it's appropriately scrubbed came up in our team meeting yesterday. We're
> discussing team practices for data/project backups tomorrow and plan to
> come out with some proposals, at least for the short term.
>
> Are there any existing processes or guidelines I should be aware of?
>
> Thanks!
> Kate
>
> --
>
> Kate Zimmerman (she/they)
> Head of Product Analytics
> Wikimedia Foundation
>
>
> On Wed, Jul 10, 2019 at 9:00 AM Leila Zia  wrote:
>
>> Hi Luca,
>>
>> Thanks for the heads up. Isaac is coordinating a response from the
>> Research side.
>>
>> I have one question for you: As you allow/encourage for more copies of
>> the files to exist, what is the mechanism you'd like to put in place
>> for reducing the chances of PII to be copied in new folders that then
>> will be even harder (for your team) to keep track of? Having an
>> explicit process/understanding about this will be very helpful.
>>
>> Thanks,
>> Leila
>>
>>
>> On Thu, Jul 4, 2019 at 3:14 AM Luca Toscano 
>> wrote:
>> >
>> > Hi everybody,
>> >
>> > as part of https://phabricator.wikimedia.org/T201165 the Analytics team
>> > thought to reach out to everybody to make it clear that all the home
>> > directories on the stat/notebook nodes are not backed up periodically.
>> They
>> > run on a software RAID configuration spanning multiple disks of course,
>> so
>> > we are resilient on a disk failure, but even if unlikely if might happen
>> > that a host could loose all its data. Please keep this in mind when
>> working
>> > on important projects and/or handling important data that you care
>> about.
>> >
>> > I just added a warning to
>> >
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_access#Analytics_clients
>> .
>> > If you have really important data that is too big to backup, keep in
>> mind
>> > that you can use your home directory (/user/your-username) on HDFS (that
>> > replicates data three times across multiple nodes).
>> >
>> > Please let us know if you have comments/suggestions/etc.. in the
>> > aforementioned task.
>> >
>> > Thanks in advance!
>> >
>> > Luca (on behalf of the Analytics team)
>> > ___
>> > Wiki-research-l mailing list
>> > wiki-researc...@lists.wikimedia.org
>> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Nuria Ruiz
Marc:

>We'd like to start the formal process to have an active collaboration, as
it seems there is no other solution available

Given that formal collaborations are somewhat hard to obtain (research team
has so many resources) my recommendation  would be to import the public
data into other computing platform that is not as constrained as labs in
terms of space and do your calculations there.

Thanks,

Nuria



On Tue, Jul 9, 2019 at 3:50 AM Marc Miquel  wrote:

> Thanks for your clarification Nuria.
>
> The categorylinks table is working better lately. Computing counts at the
> pagelinks table is critical. I'm afraid there is no solution for this one.
>
> I thought about creating a temporary table pagelinks with data from the
> dumps for each language edition. But to replicate the pagelinks database in
> the sever local disk would be so costful in terms of time and space. The
> magnitude of the enwiki table for pagelinks must be more than 50GB. The
> entire process would run during many many days considering the other
> language editions too.
>
> Other counts I need to do is the number of editors per article, which also
> gets stuck with the revision table. For the rest of data, as you said, it
> is more about retrieval, as you said, and I can use alternatives.
>
> The queries to obtain count for pagelinks is something that worked before
> with the database replicas and a database with more power like Hadoop would
> do with certain ease. The problem is both a mixture of retrieval but also
> computing power.
>
> We'd like to start the formal process to have an active collaboration, as
> it seems there is no other solution available and we cannot be stuck and
> not deliver the work promised. I'll let you know when I have more info.
>
> Thanks again.
> Best,
>
> Marc Miquel
>
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 1:44:
>
>> >Will there be a release for these two tables?
>> No, sorry, there will not be. The dataset release is about pages and
>> users. To be extra clear though, it is not tables but a denormalized
>> reconstruction of the edit history.
>>
>> > Could I connect to the Hadoop to see if the queries on pagelinks and
>> categorylinks run faster?
>> It is a bit more complicated that just "connecting"  but I do not think
>> we have to dwell on that, cause, as far as I know, there is no categorylink
>> info in hadoop.
>>
>
>> Hadoop has the set of data from mediawiki that we use to create the
>> dataset I pointed you to:
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>  and
>> a bit more.
>>
>> Is it possible to extract some of this information from the xml dumps?
>> Perhaps somebody in the list has other ideas?
>>
>> Thanks,
>>
>> Nuria
>>
>> P.S. So you know in order to facilitate access to our computing resources
>> and private data (there is no way for us to give access to only "part" of
>> the data we hold in hadoop)  we require an active collaboration with our
>> research team. We cannot support ad-hoc access to hadoop for community
>> members.
>> Here is some info:
>> https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:
>>
>>> Hello Nuria,
>>>
>>> This seems like an interesting alternative for some data (page, users,
>>> revision). It can really help and make some processes faster (at the moment
>>> we gave up running again the revision, as the new user_agent change made it
>>> also slower). So we will take a look at it as soon as it is ready.
>>>
>>> However, the scripts are struggling with other tables: pagelinks and
>>> category graph.
>>>
>>> For instance, we need to count the percentage of links an article
>>> directs to other pages or the percentage of links it receives from a group
>>> of pages. Likewise, we need to run down the category graph starting from a
>>> specific group of categories. At the moment, the query that uses pagelinks
>>> is not really working when counting when passing parameters for the entire
>>> table or for specific parts (using batches).
>>>
>>> Will there be a release for these two tables? Could I connect to the
>>> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>>>
>>> If there is any other alternative we'd be happy to try as we cannot
>>> progress for several weeks.
>>> Thanks again,
>>>
>>> Marc
&

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
>Will there be a release for these two tables?
No, sorry, there will not be. The dataset release is about pages and users.
To be extra clear though, it is not tables but a denormalized
reconstruction of the edit history.

> Could I connect to the Hadoop to see if the queries on pagelinks and
categorylinks run faster?
It is a bit more complicated that just "connecting"  but I do not think we
have to dwell on that, cause, as far as I know, there is no categorylink
info in hadoop.

Hadoop has the set of data from mediawiki that we use to create the dataset
I pointed you to:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
and
a bit more.

Is it possible to extract some of this information from the xml dumps?
Perhaps somebody in the list has other ideas?

Thanks,

Nuria

P.S. So you know in order to facilitate access to our computing resources
and private data (there is no way for us to give access to only "part" of
the data we hold in hadoop)  we require an active collaboration with our
research team. We cannot support ad-hoc access to hadoop for community
members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations






On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:

> Hello Nuria,
>
> This seems like an interesting alternative for some data (page, users,
> revision). It can really help and make some processes faster (at the moment
> we gave up running again the revision, as the new user_agent change made it
> also slower). So we will take a look at it as soon as it is ready.
>
> However, the scripts are struggling with other tables: pagelinks and
> category graph.
>
> For instance, we need to count the percentage of links an article directs
> to other pages or the percentage of links it receives from a group of
> pages. Likewise, we need to run down the category graph starting from a
> specific group of categories. At the moment, the query that uses pagelinks
> is not really working when counting when passing parameters for the entire
> table or for specific parts (using batches).
>
> Will there be a release for these two tables? Could I connect to the
> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>
> If there is any other alternative we'd be happy to try as we cannot
> progress for several weeks.
> Thanks again,
>
> Marc
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 0:56:
>
>> Hello,
>>
>> From your description seems that your problem is not one of computation
>> (well,  your main problem) but rather data extraction. The labs replicas
>> are not meant for big data extraction jobs as you have just found out.
>> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
>> denormalized data that you can probably use, it is still up for discussion
>> whether the data will be released as a JSON dump or other but basically is
>> a denormalized version of all the data held in the replicas that will be
>> created monthly.
>>
>> Please take a look at the documentation of the dataset:
>>
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>
>> This is the phab ticket:
>> https://phabricator.wikimedia.org/T208612
>>
>> So, to sum up, once this dataset is out (we hope late this quarter or
>> early next) you can probably build your own datasets from it thus rendering
>> your usage of the replicas obsolete. Hopefully this makes sense.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
>>
>> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:
>>
>>> To whom it might concern,
>>>
>>> I am writing in regards of the project *Cultural Diversity Observatory*
>>> and the data we are collecting. In short, this project aims at bridging the
>>> content gaps between language editions that relate to cultural and
>>> geographical aspects. For this we need to retrieve data from all language
>>> editions and Wikidata, and run some scripts in order to crawl down the
>>> category and the link graph, in order to create some datasets and
>>> statistics.
>>>
>>> The reason that I am writing is because we are stuck as we cannot
>>> automatize the scripts to retrieve data from the Replicas. We could create
>>> the datasets few months ago but during the past months it is impossible.
>>>
>>> We are concerned because one thing is to create the dataset once for
>>> research purposes and another thing is to create them on monthly basis.
>>> This is what we promised in the project grant
>>> <https://meta.wikimedia.org/wiki/Grants:Project/WCD

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
Hello,

>From your description seems that your problem is not one of computation
(well,  your main problem) but rather data extraction. The labs replicas
are not meant for big data extraction jobs as you have just found out.
Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
denormalized data that you can probably use, it is still up for discussion
whether the data will be released as a JSON dump or other but basically is
a denormalized version of all the data held in the replicas that will be
created monthly.

Please take a look at the documentation of the dataset:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history

This is the phab ticket:
https://phabricator.wikimedia.org/T208612

So, to sum up, once this dataset is out (we hope late this quarter or early
next) you can probably build your own datasets from it thus rendering your
usage of the replicas obsolete. Hopefully this makes sense.

Thanks,

Nuria




On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:

> To whom it might concern,
>
> I am writing in regards of the project *Cultural Diversity Observatory*
> and the data we are collecting. In short, this project aims at bridging the
> content gaps between language editions that relate to cultural and
> geographical aspects. For this we need to retrieve data from all language
> editions and Wikidata, and run some scripts in order to crawl down the
> category and the link graph, in order to create some datasets and
> statistics.
>
> The reason that I am writing is because we are stuck as we cannot
> automatize the scripts to retrieve data from the Replicas. We could create
> the datasets few months ago but during the past months it is impossible.
>
> We are concerned because one thing is to create the dataset once for
> research purposes and another thing is to create them on monthly basis.
> This is what we promised in the project grant
> 
> details and now we cannot do it because of the infrastructure. It is
> important to do it on monthly basis because the data visualizations and
> statistics Wikipedia communities will receive need to be updated.
>
> Lately there had been some changes in the Replicas databases and the
> queries that used to take several hours are getting stuck completely. We
> tried to code them in multiple ways: a) using complex queries, b) doing the
> joins as code logics and in-memory, c) downloading the parts of the table
> that we require and storing them in a local database. *None is an option
> now *considering the current performance of the replicas.
>
> Bryan Davis suggested that this might be a moment to consult the Analytics
> team, considering the Hadoop environemnt is design to run long, complex
> queries and it has massively more compute power than the Wiki Replicas
> cluster. We would certainly be relieved If you considerd we could connect
> to these Analytics databases (Hadoop).
>
> Let us know if you need more information on the specific queries or the
> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
> will be happy to explain in detail anything you require.
>
> Thanks.
> Best regards,
>
> Marc Miquel
>
> PS: You can read about the method we follow to retrieve data and create
> the dataset here:
>
> *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
> 2334-0770 *
> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Superset 0.32 upgrade coming tomorrow (May 15th, early EU morning)

2019-05-15 Thread Nuria Ruiz
Hello,

Superset is now been upgraded, there are notable fixes on this version and
now you can go crazy creating histograms cause they actually work.

An example: histogram of response sizes as reported by varnish last week:
https://bit.ly/2vYB966

Also, there is a new dataset available called edit_hourly that is the edit
equivalent of pageview_hourly.

Examples of graphs on top of this data:

Pages created per project last month (in content namespaces):
https://bit.ly/2LUdBua

Edits per platform in Indonesia and Arabic Wikipedia last month:
https://bit.ly/2JlKNIG

Thanks,

Nuria

On Tue, May 14, 2019 at 10:57 AM Luca Toscano 
wrote:

> Hi everybody,
>
> as FYI I am going to upgrade Superset tomorrow (May 15th) to 0.32. This
> will involve moving to a new host based on Debian Buster and Python 3.7, so
> the move will require some time and it will be hopefully fully done early
> during the EU morning.
>
> Tracking task: https://phabricator.wikimedia.org/T211706
>
> Luca (on behalf of the Analytics team)
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [ISSUE] dumps.wikimedia.org stop working

2019-04-04 Thread Nuria Ruiz
Hello,

This issue should be corrected by now. Please check.

Thanks,

Nuria

On Wed, Apr 3, 2019 at 9:18 AM Nuria Ruiz  wrote:

>
> Sorry this has broken, Erik Z. retired recently and we are moving some of
> the work he did to run somewhat differently. You can follow this issue:
>
> https://phabricator.wikimedia.org/T220012
>
>
> On Wed, Apr 3, 2019 at 6:36 AM Mauro Mascia 
> wrote:
>
>> Hi,
>>
>> it seems that the daily dumps of pagecounts, i.e.
>> https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-03/
>>
>> are not working anymore from March 25th.
>> I can't find on the net if it is a temporarily issue or if it will be
>> permanently abandoned.
>>
>> Did you have some informations about that?
>>
>> Please let me know and thanks in advance
>>
>> ---
>> Mauro Mascia
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [ISSUE] dumps.wikimedia.org stop working

2019-04-03 Thread Nuria Ruiz
Sorry this has broken, Erik Z. retired recently and we are moving some of
the work he did to run somewhat differently. You can follow this issue:

https://phabricator.wikimedia.org/T220012


On Wed, Apr 3, 2019 at 6:36 AM Mauro Mascia 
wrote:

> Hi,
>
> it seems that the daily dumps of pagecounts, i.e.
> https://dumps.wikimedia.org/other/pagecounts-ez/merged/2019/2019-03/
>
> are not working anymore from March 25th.
> I can't find on the net if it is a temporarily issue or if it will be
> permanently abandoned.
>
> Did you have some informations about that?
>
> Please let me know and thanks in advance
>
> ---
> Mauro Mascia
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Trouble getting yesterday's pageviews data

2019-04-02 Thread Nuria Ruiz
Outage docs now available:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo

On Tue, Apr 2, 2019 at 6:15 AM Luca Toscano  wrote:

> Hi Collin,
>
> you have anticipated my email :) We are tracking the issue in
> https://phabricator.wikimedia.org/T219842, we had a Kafka outage
> yesterday and we are still fixing jobs that didn't run.
>
> Luca
>
> Il giorno mar 2 apr 2019 alle ore 14:47 Collin Stedman 
> ha scritto:
>
>> Hello,
>>
>> I'm having trouble getting yesterday's pageviews data. There is no hourly
>> dump file for 23:00 4/1/19 (though all other hours are accounted for), and
>> the pageviews API is returning "not found" errors for requests like the
>> following:
>> https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wikipedia/desktop/user/1/daily/2019040100/2019040200.
>>
>>
>> In the past I've been told that it's not unusual for there to be
>> occasional long delays in pageviews data becoming available at the start of
>> the month. Does this explain the outage? Is there any way to predict when
>> the data will be available, or do I just have to continue checking back?
>> And does anybody know what caused the slowdown, and if I should expect it
>> to continue?
>>
>> Thank you very much,
>> -CS
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Easier mapping from Wikistats1 to Wikistats2 metrics

2019-03-28 Thread Nuria Ruiz
Hello!

Analytics team would like to announce couple changes. We are working
towards an easier way to navigate metrics that appear in both Wikistats1
and Wikistats2 and compare numbers, please take a look at changes deployed
today for (for example) Italian Wikipedia:
https://stats.wikimedia.org/v2/#/metrics/it.wikipedia.org

This is an example of a definition of active editors that matches
Wikistats1:

https://stats.wikimedia.org/v2/#/it.wikipedia.org/contributing/active-editors/normal|line|2-Year|~total

As always please file bugs (suggestions are fine too) on Phab using this
handy link:

https://phabricator.wikimedia.org/maniphest/task/edit/?title=Wikistats%20Bug=Analytics-Wikistats,Analytics

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Availability of data on Wikipedia Zero rollout

2019-03-25 Thread Nuria Ruiz
Sneha,

Some of the data that would be key to estimate the  "increase of
participation"  you mention has either never been collected ("Whether those
edits were being made using a device that accessed WP through WP Zero") or
it was only retained short term, 90 days (" The kind of device being used
for editing").

Thanks,

Nuria



On Mon, Mar 25, 2019 at 9:38 AM Sneha Narayan 
wrote:

> Hello everyone!
>
> I'm a CS professor at Carleton College (formerly a PhD student at
> Northwestern), and I've collaborated with folks at WMF on WP research in
> the past. Most notably, I was the lead author on a paper written with
> Jonathan Morgan and Jake Orlowitz evaluating the Wikipedia Adventure
> . I hope to continue having
> productive collaborations with other people who care about WP, and keep
> producing research that supports the future of the project.
>
> A potential research idea I'd like to explore is understanding whether and
> how Wikipedia Zero impacted the amount and nature of participation on WP in
> the projects that were affected by its rollout. Since the Wikipedia Zero
> program lasted during a particular time period and then ended, it also sets
> up a potentially good avenue for a comparative study.
>
> I was wondering if any of you were aware of any datasets that log
> information about the rollout of/participation in Wikipedia Zero.
> Specifically, some of the data I'm interested in include:
>
> - Countries that had access to WP Zero, including dates/times that this
> was turned on and off
> - Any information about whether access through WP Zero meant that you
> could only visit/edit particular parts or language editions of Wikipedia (I
> don't think this was the case, but I wanted to make sure)
> - Edits made to any WP language edition by IP addresses from those
> countries during that time period
> - Whether those edits were being made using a device that accessed WP
> through WP Zero
> - The kind of device being used for editing
> - Other generic features of the edit (number of characters, namespace,
> registered/unregistered etc)
>
> I'm aware that the data may not be recorded exactly along those lines, but
> I was still curious to know what data about Wikipedia Zero is out there,
> and whether or not it was publicly available.
>
> Thank you for your help!
> -Sneha
>
>
> *Sneha Narayan*
> Department of Computer Science
> Carleton College
> snehanarayan.com
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] R: Analytics Digest, Vol 85, Issue 3

2019-03-11 Thread Nuria Ruiz
>but I'd like to follow the behaviour flow when an user access some
Wikipedia page following a link by my website.
>I don't know if that is possible somehow and if it makes sense for you.
I see, It does make sense but that is not data we have.

Thanks,

Nuria

On Sat, Mar 9, 2019 at 5:19 AM viviana paga  wrote:

> Hi Nuria,
> thanks for your reply and tips!
> As you propose, I use matomo to get data from my client, but I'd like to
> follow the behaviour flow when an user access some Wikipedia page following
> a link by my website.
> I don't know if that is possible somehow and if it makes sense for you.
> Many thanks,
> Viviana
>
>
> --
> *Da:* Analytics  per conto di
> analytics-requ...@lists.wikimedia.org <
> analytics-requ...@lists.wikimedia.org>
> *Inviato:* venerdì 8 marzo 2019 17:00
> *A:* analytics@lists.wikimedia.org
> *Oggetto:* Analytics Digest, Vol 85, Issue 3
>
> Send Analytics mailing list submissions to
> analytics@lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-requ...@lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-ow...@lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
>1. R: Analytics Digest, Vol 85, Issue 2 (viviana paga)
>2. Re: R: Analytics Digest, Vol 85, Issue 2 (Nuria Ruiz)
>
>
> --
>
> Message: 1
> Date: Fri, 8 Mar 2019 13:21:34 +
> From: viviana paga 
> To: "analytics@lists.wikimedia.org" 
> Subject: [Analytics] R: Analytics Digest, Vol 85, Issue 2
> Message-ID:
> <
> pr1pr06mb4698d7f323ae49a1cc429c23e4...@pr1pr06mb4698.eurprd06.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Dan,
> thanks for your reply !
> I agree with you and in fact I do that in my front-end, but I think that
> it would be interesting have same general stats from Wikimedia too; in
> particular to understand which impact could my project have on general
> Wikimedia stats and which will be the behaviour of the users arriving to
> Wikimedia from my site (if the attended one or not).
> I thought having some stats by api-user-agent from backend could help me
> to understand these points and improve in the future my project in the best
> way. What do you think ? Is there a procedure that can I follow to have
> these stats?
> Many thanks,
> Viviana
>
> 
> Da: Analytics  per conto di
> analytics-requ...@lists.wikimedia.org <
> analytics-requ...@lists.wikimedia.org>
> Inviato: venerdì 8 marzo 2019 13:00
> A: analytics@lists.wikimedia.org
> Oggetto: Analytics Digest, Vol 85, Issue 2
>
> Send Analytics mailing list submissions to
> analytics@lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-requ...@lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-ow...@lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
>1. Stats of mediawiki API / Access to non-public data (viviana paga)
>2. Re: Stats of mediawiki API / Access to non-public data
>   (Dan Andreescu)
>
>
> --
>
> Message: 1
> Date: Thu, 7 Mar 2019 14:15:27 +
> From: viviana paga 
> To: "analytics@lists.wikimedia.org" 
> Subject: [Analytics] Stats of mediawiki API / Access to non-public
> data
> Message-ID:
> <
> pr1pr06mb4698b179051d012c970a8928e4...@pr1pr06mb4698.eurprd06.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi all,
>
> I’m working on a project about the sharing of the cultural heritage and,
> more in general, about the sharing of the open knowledges.
> In particular, I'm developing a webservice that use the Mediawiki API and
> I'd like to have some stats about the traffic of my api calls to the
> commons.wikipedia.org domain.
>
> More specifically,  I'd like to have:
> - the number of GET Requests by Api-User-Agent
> - the number of views/edit by A

Re: [Analytics] R: Analytics Digest, Vol 85, Issue 2

2019-03-08 Thread Nuria Ruiz
>I thought having some stats by api-user-agent from backend could help me
to understand these points and improve in the future my project in the best
way. What do you >think ? Is there a procedure that can I follow to have
these stats?
The stats would be the same, viviana, raw counts of call from your client
to the API, we do not have the ability to provide those upon requests but
it will be easy for you to gather the data from your client, there are open
source solutions like https://matomo.org/ that you can use to keep track of
stats on your end.

Thanks,

Nuria

On Fri, Mar 8, 2019 at 5:21 AM viviana paga  wrote:

> Hi Dan,
> thanks for your reply !
> I agree with you and in fact I do that in my front-end, but I think that
> it would be interesting have same general stats from Wikimedia too; in
> particular to understand which impact could my project have on general
> Wikimedia stats and which will be the behaviour of the users arriving to
> Wikimedia from my site (if the attended one or not).
> I thought having some stats by api-user-agent from backend could help me
> to understand these points and improve in the future my project in the best
> way. What do you think ? Is there a procedure that can I follow to have
> these stats?
> Many thanks,
> Viviana
>
> --
> *Da:* Analytics  per conto di
> analytics-requ...@lists.wikimedia.org <
> analytics-requ...@lists.wikimedia.org>
> *Inviato:* venerdì 8 marzo 2019 13:00
> *A:* analytics@lists.wikimedia.org
> *Oggetto:* Analytics Digest, Vol 85, Issue 2
>
> Send Analytics mailing list submissions to
> analytics@lists.wikimedia.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.wikimedia.org/mailman/listinfo/analytics
> or, via email, send a message with subject or body 'help' to
> analytics-requ...@lists.wikimedia.org
>
> You can reach the person managing the list at
> analytics-ow...@lists.wikimedia.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Analytics digest..."
>
>
> Today's Topics:
>
>1. Stats of mediawiki API / Access to non-public data (viviana paga)
>2. Re: Stats of mediawiki API / Access to non-public data
>   (Dan Andreescu)
>
>
> --
>
> Message: 1
> Date: Thu, 7 Mar 2019 14:15:27 +
> From: viviana paga 
> To: "analytics@lists.wikimedia.org" 
> Subject: [Analytics] Stats of mediawiki API / Access to non-public
> data
> Message-ID:
> <
> pr1pr06mb4698b179051d012c970a8928e4...@pr1pr06mb4698.eurprd06.prod.outlook.com
> >
>
> Content-Type: text/plain; charset="utf-8"
>
> Hi all,
>
> I’m working on a project about the sharing of the cultural heritage and,
> more in general, about the sharing of the open knowledges.
> In particular, I'm developing a webservice that use the Mediawiki API and
> I'd like to have some stats about the traffic of my api calls to the
> commons.wikipedia.org domain.
>
> More specifically,  I'd like to have:
> - the number of GET Requests by Api-User-Agent
> - the number of views/edit by Api-User-Agent
> - the stats of the Wikipedia traffic from inbound links by a specif domain
> or url
>
> Is this possible somehow to access to these limited non-public data?
> Is there a procedure that I can follow?
>
> The project is still in development, but next April we will release a beta
> version for a limited range of users-testers.
> The project is completely non-profit and it would provide maximum freedom,
> independence and privacy for its users.
> That’s why, I’d like to have from backend some stats by api-user-agent:
> that would guarantees the total privacy of the user, and, at the same time,
> the project could have some general stats about the traffic, the
> utilisation and its impact on the general Wikimedia stats.
>
> If someone among you is interested in these issues (open-shared-cultural
> heritage, open linked data), I’d like to keep in touch and, even, to
> propose to partecipate as tester in April.
>
> Thank you in advance,
> Kind regards,
> Viviana Paga
> https://www.linkedin.com/in/viviana-paga-42bb8b44/
>
> -- next part --
> An HTML attachment was scrubbed...
> URL: <
> https://lists.wikimedia.org/pipermail/analytics/attachments/20190307/7f195cb4/attachment-0001.html
> >
>
> --
>
> Message: 2
> Date: Thu, 7 Mar 2019 11:14:32 -0500
> From: Dan Andreescu 
> To: "A mailing list for the Analytics Team at WMF and everybody who
> has an interest in Wikipedia and analytics."
> 
> Subject: Re: [Analytics] Stats of mediawiki API / Access to non-public
> data
> Message-ID:
>  qby1k8+n47zmh...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Hi Viviana,
>
> Great project!  The first thought I had looking at your question is that
> you can collect all the data you're asking about.  If your 

Re: [Analytics] Further Development of Wikipedia statistics

2019-02-07 Thread Nuria Ruiz
Hello,

Several things come to mind:

Top views provides much of this info digested in a way that would not be
hard to calculate what you want, gets data from pageviewAPI and does some
useful filtering:
https://tools.wmflabs.org/topviews/?project=de.wikipedia.org=all-access=last-month=

You probably know this but let me point out that there are pageview dumps
from which you could calculate 1, 2 and 3 . We understand that perhaps your
request is not having to do this calculation yourself for dewiki but just
in case you do not know all data about detailed pageviews for a project is
available:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews#Differences_to_the_pagecounts-raw_dataset

If you think top views is not sufficient for your use case, please do file
a phab ticket with your request:
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Wikistats%20New%20Feature=Analytics-Wikistats,Analytics

Finally, I think any pageview focused effort needs to take into account the
very significant number of bots that are not marked as such in our system
(and the top1000 list is an example of data skewed by bot traffic), top
views does some "crowsourced" filtering and again, I think for use case
will be very useful.

Thanks,

Nuria

On Thu, Feb 7, 2019 at 1:47 AM WikiPeter-HH  wrote:

> Hi,
>
> in light of the current switch from Wikistats 1 to Wikistats 2 I would
> like to express a strong desire to get some additional features for the
> statistcs. The rationale for this request is described below:
>
> 1. How many articles make up for 90 / 95 / 99 percent of all page views
> over a certain period? Which articles are these?  (live articles excluding
> WP:xx and Spezial:xxx etc.)
>
> 2. Which share of page views goes to the Top 5% / Top 10% of our pages?
>
> 3. List of pages with less than x views per year (x = e.g. 12, 25, 100)
>
>
> *Reasoning / Rationale*:
>
> I am limiting myself to the German Wikipedia, assuming that the situation
> is similar in other language WP's.
>
> In the German Wikipedia we meanwhile have more than 2.2 million articles.
> This produces an enormous maintenance workload, to keep up the quality of
> articles. We clearly have too few people to do that maintenance work.
>
> This means that we have to focus our maintenance efforts on those
> articles, which really matter and those are much more than the list of 1000
> we get from the Top-Views stats. The reports suggested above will give us
> exactly the information required to have an informed discussion about the
> articles we should focus on. And the report under #3 will give us a means
> to identify articles which we could either delete or clearly label as 'out
> of maintenance'.
>
> We do know that certain articles are 'en vogue' for a short period, as
> they relate to current news topics. Therefore we must be able to have above
> reports over a longer period (at least a year, maybe two) to identify those
> articles which are really the long term favourites.
>
> Best regards
>
> Peter
>
> my user page: Wikipeter-HH
> 
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Research-Internal] Article about ML in production woes

2019-02-07 Thread Nuria Ruiz
Team,

Since everyone is here, we will be working on a machine learning
infrastructure program this year. I will set up meetings with everyone on
this thread and some others in SRE and Audiences to get a "bag of requests"
of things that are missing, first round of talks that I hope to finish next
week is to hear what everyone requests/ideas are.  Will be sending meeting
invites today and tomorrow.  I think from those some themes will emerge.
Thus far,  it is pretty clear we need a better way to deploy models to
production (right now we deploy those to elastic search in very crafty
manners, for example) , we need to have an answer to GPU issues to train
models, we need to have a "recommended way" in which we train and compute,
some unified system for tracking models+data+tests and finally, there are
probably many learnings the work been done in Ores thus far.

Thanks,

Nuria


On Thu, Feb 7, 2019 at 8:40 AM Miriam Redi  wrote:

> Hey Andrew!
>
> Thank you so much for sharing this and start this conversation. We had a
> meeting at All Hands with all people interested in "Image Classification"
> https://phabricator.wikimedia.org/T215413 , and one of the open questions
> was exactly how to find a "common repository" for ML models that different
> groups and products within the organization can use. So, please, count me
> in!
>
> Thanks,
>
> M
>
>
> On Thu, Feb 7, 2019 at 4:38 PM Aaron Halfaker 
> wrote:
>
>> Just gave the article a quick read.  I think this article pushes on some
>> key issues for sure.  I definitely agree with the focus on python/jupyter
>> as essential for a productive workflow that leverages the best from
>> research scientists.  We've been thinking about what ORES 2.0 would look
>> like and event streams are the dominant proposal for improving on the
>> limitations of our queue-based worker pool.
>>
>> One of the nice things about ORES/revscoring is that it provides a nice
>> framework for operating using the *exact same code* no matter the
>> environment.  E.g. it doesn't matter if we're calling out to an API to get
>> data for feature extraction or providing it via a stream.  By investing in
>> a dependency injection strategy, we get that flexibility.  So to me, the
>> hardest problem -- the one I don't quite know how to solve -- is how we'll
>> mix and merge streams to get all of the data we want available for feature
>> extraction.  If I understand correctly, that's where Kafka shines.  :)
>>
>> I'm definitely interested in fleshing out this proposal.  We should
>> probably be exploring the processes for training new types of models (e.g.
>> image processing) using different strategies than ORES.  In ORES, we're
>> almost entirely focused on using sklearn but we have some basic
>> abstractions for other estimator libraries.  We also make some strong
>> assumptions about running on a single CPU that could probably be broken for
>> some performance gains using real concurrency.
>>
>> -Aaron
>>
>> On Thu, Feb 7, 2019 at 10:05 AM Goran Milovanovic <
>> goran.milovanovic_...@wikimedia.de> wrote:
>>
>>> Hi Andrew,
>>>
>>> I have recently started a six month AI/Machine Learning Engineering
>>> course which focuses exactly on the topics that you've shown interest in.
>>>
>>> So,
>>>
>>> >>>  I'd love it if we had a working group (or whatever) that focused
>>> on how to standardize how we train and deploy ML for production use.
>>>
>>> Count me in.
>>>
>>> Regards,
>>> Goran
>>>
>>>
>>> Goran S. Milovanović, PhD
>>> Data Scientist, Software Department
>>> Wikimedia Deutschland
>>>
>>> 
>>> "It's not the size of the dog in the fight,
>>> it's the size of the fight in the dog."
>>> - Mark Twain
>>> 
>>>
>>>
>>> On Thu, Feb 7, 2019 at 4:16 PM Andrew Otto  wrote:
>>>
 Just came across

 https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tensorflow

 In it, the author discusses some of what he calls the 'impedance
 mismatch' between data engineers and production engineers.  The links to
 Ubers Michelangelo  (which as far
 as I can tell has not been open sourced) and the Hidden Technical Debt
 in Machine Learning Systems paper
 
  are
 also very interesting!

 At All hands I've been hearing more and more about using ML in
 production, so these things seem very relevant to us.  I'd love it if we
 had a working group (or whatever) that focused on how to standardize how we
 train and deploy ML for production use.

 :)
 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

>>>
>>
>> --
>>
>> Aaron Halfaker
>>
>> Principal Research Scientist
>>
>> Head of the Scoring 

Re: [Analytics] Does prefetch count as a pageview?

2018-12-20 Thread Nuria Ruiz
> in Safari and Chrome with the default settings, there is a native browser
feature that, when searching through the address bar (Google powered) by
default silently starts loading the url of >the top result shown below the
address bar.
Ah, I see, you mean searches happening "outside" the application just
through the search bar. To avoid confusion let's call those "eagerly
loading" a page (to distinguish them from the link directives for
prefetching). If those are happening: would they be counted as pageviews?
Yes.  Now, are they happening? That is a harder question to answer cause in
the absence of  a header you cannot distinguish those requests from
everything else. In any case, they do not happen deterministically for
everyone cause they rely on the prediction system in chrome.
See chrome://predictors/ [1] The fact that you have to enable them on your
browser  tells you usage is minimal so is this a concern for pageview
numbers? I do not think so.

>According to https://stackoverflow.com/a/9852667/4746236
>Some browsers do send a header that could be detected for these prefetches.
Please note that the answer for the question is for 2012 and it mixes
"pre-fetches [of] content (at the behest of the referrer page’s markup)"
(which are link directives)  [2] and "eagerly loading" of  a page (done
through "network action predictions") [1]

[1]
https://www.igvita.com/posa/high-performance-networking-in-google-chrome/#predictor
[2] https://developer.mozilla.org/en-US/docs/Web/HTTP/Link_prefetching_FAQ


On Thu, Dec 20, 2018 at 6:41 AM Isaac Johnson  wrote:

> Yes, I checked Chrome/Mozilla as well and saw no evidence of prefetching
> for Wikipedia pages.
>
> Related point: it does appear that for media counts, that prefetching by
> the Media Viewer might change the counts:
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Mediacounts#Corner_cases
>
> On Thu, Dec 20, 2018 at 4:19 AM Addshore  wrote:
>
>> I just quickly tried this out live looking for a prefetch webrequest from
>> my chrome browser for a enwiki, enwiktionary and wikidata page but no
>> prefetching happened.
>>
>> After a quick search I found
>> https://www.technipages.com/google-chrome-prefetch and confirmed that
>> the prefetch setting is ON in my browser.
>>
>> Maybe prefetching doesn't happen for our sites? Or maybe something else
>> in my environment makes Chrome decide that it doesn't need to?
>>
>> According to https://stackoverflow.com/a/9852667/4746236
>> Some browsers do send a header that could be detected for these
>> prefetches.
>> But apparently Chrome is no longer one of those browsers,
>> https://bugs.chromium.org/p/chromium/issues/detail?id=86175
>>
>>
>> On Thu, 20 Dec 2018 at 02:58, Timo Tijhof  wrote:
>>
>>> Ugh, this wasn't meant to go on-list (obviously). Sorry!
>>>
>>> -- Timo
>>>
>>> On Wed, 19 Dec 2018 at 18:55, Timo Tijhof  wrote:
>>>
>>>> Hi Nuria,
>>>>
>>>> As I understand it, in Safari and Chrome with the default settings,
>>>> there is a native browser feature that, when searching through the address
>>>> bar (Google powered) by default silently starts loading the url of the top
>>>> result shown below the address bar. Maybe there's a way we opted out, but I
>>>> think it applies to Wikipedia as well.
>>>>
>>>> I'm replying privately because I didn't understand the last part of
>>>> your email, and maybe we are saying the same thing :)
>>>>
>>>> -- Timo
>>>>
>>>>
>>>>
>>>> On Wed, 19 Dec 2018 at 14:14, Nuria Ruiz  wrote:
>>>>
>>>>> > I think that's for the Page Previews feature (i.e., when a user
>>>>> hovers over a link on desktop Wikipedia) or
>>>>> > its corresponding feature in the the Wikipedia for Android
>>>>> (triggered by default on link tap)
>>>>> The code that Fran pointed to only discounts "previews" by Android app
>>>>> as we stablished that convention a while back. Page previews (hovers over
>>>>> wikipedia links that display a short popup) are not counted as pageviews 
>>>>> at
>>>>> all at this time.
>>>>>
>>>>> >By "prefetching", I meant X's Wikipedia page shows up in the search
>>>>> results and the browser prefetches/preloads the search results but I do 
>>>>> not
>>>>> click on X's Wikipedia page. If so, the >pageview data seem to over-count
>>>>> the number of visits to 

Re: [Analytics] Does prefetch count as a pageview?

2018-12-19 Thread Nuria Ruiz
> I think that's for the Page Previews feature (i.e., when a user hovers
over a link on desktop Wikipedia) or
> its corresponding feature in the the Wikipedia for Android (triggered by
default on link tap)
The code that Fran pointed to only discounts "previews" by Android app as
we stablished that convention a while back. Page previews (hovers over
wikipedia links that display a short popup) are not counted as pageviews at
all at this time.

>By "prefetching", I meant X's Wikipedia page shows up in the search
results and the browser prefetches/preloads the search results but I do not
click on X's Wikipedia page. If so, the >pageview data seem to over-count
the number of visits to X's Wikipedia page.
This functionality needs to be implemented by the client (it is not
automagically implemented by the browser) and it is not implemented on
Wikipedia. Searches trigger requests to the API, that return pageview urls,
not pageview prefetches. You can follow these workflows on the dev tools of
the browser you might be using. chrome://net-export/ is a new addition to
the toolset that gives you a readable dump.

>Are you saying that browser-based prefetch activity (e.g., with resource
hinting like https://www.w3.org/TR/resource-hints/ ) is also tagged the
same way?
No. Browser prefetches cannot be tagged, they are initiated by the browser.
Wikipedia's pages do not do any prefetches and or pre-renderings of content
using those directives as far as I can see. dns-prefetches are done for two
domains: login and meta, neither of which counts as a pageview cause a dns
prefetch does not receive an http response. Just instantiates a connection
and resolves TLS if any. Prefetches might be indicated by a standard set of
headers like "Link:" in the future (this would be initiated by the browser)
but that seems on the works.

Thanks,

Nuria


On Wed, Dec 19, 2018 at 9:43 AM Adam Baso  wrote:

> Fran, the preview to which you refer, I think that's for the Page Previews
> feature (i.e., when a user hovers over a link on desktop Wikipedia) or its
> corresponding feature in the the Wikipedia for Android (triggered by
> default on link tap) and Wikipedia for iOS (force press) apps, is that
> right? Are you saying that browser-based prefetch activity (e.g., with
> resource hinting like https://www.w3.org/TR/resource-hints/ ) is also
> tagged the same way?
>
> Chenqi Zhu, I think what you're suggesting is the possibility that
> browsers might be issuing HTTP prefetches for Wikimedia-hosted pages and
> that could inflate pageviews. I'm not sure, but have you happened to
> observe user agents making prefetches when resource hinting (
> https://www.w3.org/TR/resource-hints/ ) is absent? I'm not sure how
> often, if at all, discovery platforms like search engines are actually
> placing resource hints into markup (which is mostly deterministic as far as
> browser behavior) for Wikimedia content...nor to what degree there might be
> heuristics being used for prefetching independently of any resource hints.
> Do you have any data or field observations to help clarify?
>
> Browser settings allude to this sort of behavior (e.g., in Chrome, there's
> "Use a prediction service to load pages more quickly", which is described
> at https://support.google.com/chrome/answer/114836 ), although I think
> without digging into browser source code it's a bit hard to know for
> certain what's going on "under the hood". We do use preconnect and prefetch
> semantics and the like in different contexts (cf.
> https://phabricator.wikimedia.org/search/query/G2tr8i0YZii9 ,
> https://phabricator.wikimedia.org/search/query/.dtx_hqaj3wj , ...there
> may be more).
>
> -Adam
>
>
>
> On Wed, Dec 19, 2018 at 11:00 AM Francisco Dans 
> wrote:
>
>> Hi Chenqi,
>>
>> You can find out more about what constitutes a pageview in its Metawiki
>> article:
>>
>> https://meta.wikimedia.org/wiki/Research:Page_view#Definition
>>
>> As you can see, one of the conditions is whether the request being
>> examined is a preview or not, in which case it is not counted as a page
>> view. Hope this helps!
>>
>> Fran
>>
>> On Wed, Dec 19, 2018 at 5:30 PM Chenqi Zhu  wrote:
>>
>>> Hi everyone,
>>>
>>> I am trying to better understand the pageview data. I have a quick
>>> question. I apologize if the question has been asked or it is so naive.
>>>
>>> If the web browser prefetches a Wikipedia page, does it count as one
>>> pageview in the pageview data? By "prefetching", I meant X's Wikipedia page
>>> shows up in the search results and the browser prefetches/preloads the
>>> search results but I do not click on X's Wikipedia page. If so, the
>>> pageview data seem to over-count the number of visits to X's Wikipedia page.
>>>
>>> Thanks in advance for any insight.
>>>
>>>
>>> Chenqi Zhu
>>> New York University
>>> 44 W 4th St., Suite 10-185(B),
>>> New York, NY 10012, U.S.A.
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> 

Re: [Analytics] Superset going down for a few hours

2018-12-13 Thread Nuria Ruiz
Superset is back up (should have said: "going down for a few minutes") , We
have rolled back the upgrade in progress.

Thanks,

Nuria

On Thu, Dec 13, 2018 at 1:00 PM Nuria Ruiz  wrote:

> Team:
>
> Superset will be going down for a few hours today as we rollback the
> update we were trying to do. It turns out that the newest versions of
> superset are VERY non backwards compatible, they use python 3.6 which is
> not available on our debian distro and they introduce a bunch of other
> bugs.  We will be working on our fork from now on so we have a more stable
> basis for changes: https://github.com/wikimedia/incubator-superset
>
> More updates here: https://phabricator.wikimedia.org/T211605
>
> Thanks,
>
> Nuria
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Superset going down for a few hours

2018-12-13 Thread Nuria Ruiz
Team:

Superset will be going down for a few hours today as we rollback the update
we were trying to do. It turns out that the newest versions of superset are
VERY non backwards compatible, they use python 3.6 which is not available
on our debian distro and they introduce a bunch of other bugs.  We will be
working on our fork from now on so we have a more stable basis for changes:
https://github.com/wikimedia/incubator-superset

More updates here: https://phabricator.wikimedia.org/T211605

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Wikistats2 - Metrics available for project families

2018-12-12 Thread Nuria Ruiz
Hello!

The Analytics team would like to announce that we have now in Wikistats2
metrics available for what we are calling (for the lack of a better name)
"project families". That is, "all wikipedias", "all wikibooks"..etc

See, for example, bytes added by users to all wikibooks in the last month:
https://stats.wikimedia.org/v2/#/all-wikibooks-projects/content/net-bytes-difference/normal|bar|1-Month|editor_type~user

And "all wikibooks top editors" [2]:

https://stats.wikimedia.org/v2/#/all-wikibooks-projects/contributing/top-editors/normal|table|1-Month|editor_type~user

Not all metrics are available per project, most notably we (yet) do not
have pageviews. As always please file bugs [2] if you find any, and let us
know what can we do better.

Thanks,

Nuria

[1] https://meta.wikimedia.org/wiki/Research:Wikistats_metrics/Top_editors
[2]
https://phabricator.wikimedia.org/maniphest/task/edit/?title=Wikistats%20Bug=Analytics-Wikistats,Analytics
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-19 Thread Nuria Ruiz
Following up on this, data is been re-ingested into druid so turnilo should
not show up any holes related to this event.

Thanks,

Nuria

On Thu, Nov 15, 2018 at 8:45 AM Marcel Ruiz Forns 
wrote:

> Not all data sources are populated at the same time, the data on Druid is
>> ingested twice, once per hour and once daily looking 4 days back. Data
>> should appear once daily job runs for the "holes" missing.
>
> +1  The EL2Druid daily loading job will cover up the holes for the 12th
> and 13th in 1 or 2 days.
>
> On Thu, Nov 15, 2018 at 5:03 PM Nuria Ruiz  wrote:
>
>> Hello,
>>
>> Not all data sources are populated at the same time, the data on Druid is
>> ingested twice, once per hour and once daily looking 4 days back. Data
>> should appear once daily job runs for the "holes" missing.
>>
>> Thanks,
>>
>> Nuria
>>
>> On Thu, Nov 15, 2018 at 7:49 AM Andrew Otto  wrote:
>>
>>> > Does "fixed" mean that the missing data already been backfilled? I’m
>>> seeing gaps (zero events) in Turnilo for Druid-ingested EL data, for the
>>> timespans between around 6am-16pm on November 13, and 7am-10am on November
>>> 12.
>>>
>>> Hm.  Fixed means the data has been refined to Hive.  I didn’t check on
>>> Druid EL data.  Marcel, can you check up on that?  This sounds related, we
>>> might need to figure out how to re run the Druid ingestion if/when base
>>> data partitions change.
>>>
>>>
>>>
>>> On Thu, Nov 15, 2018 at 10:17 AM Tilman Bayer 
>>> wrote:
>>>
>>>> Hi Andrew,
>>>>
>>>> PS, offtopic: reminder that data-analy...@wikimedia.org has been
>>>> deprecated/retired a while ago; our team can be reached at
>>>> product-analyt...@wikimedia.org (CCed)
>>>>
>>>> On Thu, Nov 15, 2018 at 7:13 AM Tilman Bayer 
>>>> wrote:
>>>>
>>>>> Does "fixed" mean that the missing data already been backfilled? I'm
>>>>> seeing gaps (zero events) in Turnilo for Druid-ingested EL data, for the
>>>>> timespans between around 6am-16pm on November 13, and 7am-10am on November
>>>>> 12.
>>>>>
>>>>> On Thu, Nov 15, 2018 at 6:51 AM Andrew Otto 
>>>>> wrote:
>>>>>
>>>>>> OH I’m sorry!  There is a Phab task, and this is fixed.
>>>>>> https://phabricator.wikimedia.org/T209407
>>>>>>
>>>>>> Very sorry, I should have updated and linked the task.
>>>>>>
>>>>>> On Thu, Nov 15, 2018 at 9:50 AM Gilles Dubuc 
>>>>>> wrote:
>>>>>>
>>>>>>> Is this issue still ongoing? Is there a corresponding Phabricator
>>>>>>> task to follow?
>>>>>>>
>>>>>>> On Tue, Nov 13, 2018 at 6:27 PM Andrew Otto 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> Yesterday we upgraded the Hadoop cluster to a newer version.  It
>>>>>>>> seems that along the way the job that imports EventLogging data into 
>>>>>>>> Hive
>>>>>>>> has started failing for some EventLogging datasets.  I’m still
>>>>>>>> investigating, but it seems that tables (I believe ones that need to 
>>>>>>>> have
>>>>>>>> their schema modified during refinement) are failing refinement.  The 
>>>>>>>> list
>>>>>>>> of affected tables at this moment is large, and will likely only grow 
>>>>>>>> as
>>>>>>>> time goes on.
>>>>>>>>
>>>>>>>> We’re looking into this and will try to have things fixed ASAP.
>>>>>>>> For now you might expect Hive EventLogging data to be delayed.
>>>>>>>>
>>>>>>>> Sorry for the trouble, and we’ll keep you updated!
>>>>>>>> -Andrew Otto
>>>>>>>>  Systems Engineer, WMF
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> The list of currently affected tables is:
>>>>>>>>
>>>>>>>> AdvancedSearchRequest
>>>>>>>> CentralAuth
>>>>>>>> CentralNoticeBannerHistory

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-15 Thread Nuria Ruiz
Hello,

Not all data sources are populated at the same time, the data on Druid is
ingested twice, once per hour and once daily looking 4 days back. Data
should appear once daily job runs for the "holes" missing.

Thanks,

Nuria

On Thu, Nov 15, 2018 at 7:49 AM Andrew Otto  wrote:

> > Does "fixed" mean that the missing data already been backfilled? I’m
> seeing gaps (zero events) in Turnilo for Druid-ingested EL data, for the
> timespans between around 6am-16pm on November 13, and 7am-10am on November
> 12.
>
> Hm.  Fixed means the data has been refined to Hive.  I didn’t check on
> Druid EL data.  Marcel, can you check up on that?  This sounds related, we
> might need to figure out how to re run the Druid ingestion if/when base
> data partitions change.
>
>
>
> On Thu, Nov 15, 2018 at 10:17 AM Tilman Bayer 
> wrote:
>
>> Hi Andrew,
>>
>> PS, offtopic: reminder that data-analy...@wikimedia.org has been
>> deprecated/retired a while ago; our team can be reached at
>> product-analyt...@wikimedia.org (CCed)
>>
>> On Thu, Nov 15, 2018 at 7:13 AM Tilman Bayer 
>> wrote:
>>
>>> Does "fixed" mean that the missing data already been backfilled? I'm
>>> seeing gaps (zero events) in Turnilo for Druid-ingested EL data, for the
>>> timespans between around 6am-16pm on November 13, and 7am-10am on November
>>> 12.
>>>
>>> On Thu, Nov 15, 2018 at 6:51 AM Andrew Otto  wrote:
>>>
 OH I’m sorry!  There is a Phab task, and this is fixed.
 https://phabricator.wikimedia.org/T209407

 Very sorry, I should have updated and linked the task.

 On Thu, Nov 15, 2018 at 9:50 AM Gilles Dubuc 
 wrote:

> Is this issue still ongoing? Is there a corresponding Phabricator task
> to follow?
>
> On Tue, Nov 13, 2018 at 6:27 PM Andrew Otto 
> wrote:
>
>> Hi all,
>>
>> Yesterday we upgraded the Hadoop cluster to a newer version.  It
>> seems that along the way the job that imports EventLogging data into Hive
>> has started failing for some EventLogging datasets.  I’m still
>> investigating, but it seems that tables (I believe ones that need to have
>> their schema modified during refinement) are failing refinement.  The 
>> list
>> of affected tables at this moment is large, and will likely only grow as
>> time goes on.
>>
>> We’re looking into this and will try to have things fixed ASAP.  For
>> now you might expect Hive EventLogging data to be delayed.
>>
>> Sorry for the trouble, and we’ll keep you updated!
>> -Andrew Otto
>>  Systems Engineer, WMF
>>
>>
>>
>> The list of currently affected tables is:
>>
>> AdvancedSearchRequest
>> CentralAuth
>> CentralNoticeBannerHistory
>> CentralNoticeImpression
>> CentralNoticeTiming
>> ChangesListFilterGrouping
>> ChangesListFilters
>> CitationUsage
>> CitationUsagePageLoad
>> ContentTranslation
>> ContentTranslationAbuseFilter
>> ContentTranslationCTA
>> ContentTranslationError
>> ContentTranslationSuggestion
>> CpuBenchmark
>> EchoInteraction
>> EchoMail
>> EditAttemptStep
>> EditConflict
>> EditorActivation
>> EUCCVisit
>> EventError
>> FlowReplies
>> GettingStartedRedirectImpression
>> GuidedTourButtonClick
>> GuidedTourExited
>> GuidedTourExternalLinkActivation
>> GuidedTourGuiderHidden
>> GuidedTourGuiderImpression
>> LandingPageImpression
>> MediaViewer
>> MediaWikiPingback
>> MobileAppCategorizationAttempts
>> MobileAppUploadAttempts
>> MobileWebMainMenuClickTracking
>> MobileWebSearch
>> MobileWebUIClickTracking
>> MobileWikiAppAppearanceSettings
>> MobileWikiAppArticleSuggestions
>> MobileWikiAppCreateAccount
>> MobileWikiAppDailyStats
>> MobileWikiAppEdit
>> MobileWikiAppFeed
>> MobileWikiAppFeedConfigure
>> MobileWikiAppFindInPage
>> MobileWikiAppInstallReferrer
>> MobileWikiAppIntents
>> MobileWikiAppiOSFeed
>> MobileWikiAppiOSLoginAction
>> MobileWikiAppiOSReadingLists
>> MobileWikiAppiOSSearch
>> MobileWikiAppiOSSessions
>> MobileWikiAppiOSSettingAction
>> MobileWikiAppiOSUserHistory
>> MobileWikiAppLangSelect
>> MobileWikiAppLanguageSearching
>> MobileWikiAppLanguageSettings
>> MobileWikiAppLinkPreview
>> MobileWikiAppLogin
>> MobileWikiAppMediaGallery
>> MobileWikiAppNavMenu
>> MobileWikiAppNotificationInteraction
>> MobileWikiAppNotificationPreferences
>> MobileWikiAppOfflineLibrary
>> MobileWikiAppOnboarding
>> MobileWikiAppOnThisDay
>> MobileWikiAppPageScroll
>> MobileWikiAppProtectedEditAttempt
>> MobileWikiAppRandomizer
>> MobileWikiAppReadingLists
>> MobileWikiAppSavedPages
>> MobileWikiAppSearch
>> MobileWikiAppSessions
>> MobileWikiAppShareAFact
>> MobileWikiAppStuffHappens
>> MobileWikiAppTabs
>> MobileWikiAppToCInteraction
>> 

Re: [Analytics] Pageviews by agent for May 18-21 2015

2018-11-13 Thread Nuria Ruiz
Hello,

> One question we have is whether the pageviews we observe are driven by
bots and spiders. We know that the > wikimedia rest api provides this
information going back to July 1 2015.
Please have in mind that these are only self-identified bots, there is
probably about 1-5% of bot pageview traffic that gets wrongly labeled as
"user", a project is on its way to better label this traffic as coming from
bots.




On Tue, Nov 13, 2018 at 6:41 AM Jennifer Pan  wrote:

> Hi there,
>
>
> I'm an assistant professor in the Department of Communication at Stanford.
> My co-author, Molly Roberts (Political Science, UCSD), and I are working on
> a paper examining the effect of China's 2015 block of Chinese language
> wikipedia on pageviews, which builds on our previous work on censorship in
> China.
>
> We are using the block to conduct a interrupted time series design to
> measure the effect of censorship on Chinese users. Our main finding is that
> Chinese users were using Wikipedia to browse (starting at the home page),
> and the block influenced users' ability to explore and encounter unexpected
> information. One question we have is whether the pageviews we observe are
> driven by bots and spiders. We know that the wikimedia rest api provides
> this information going back to July 1 2015. Since the China block of
> Wikipedia was on May 19, 2015, we are wondering if there is pageview data
> by agent type for zh.wikipedia.org pages (all or some subset like most
> popular) going back to May 2015 (specifically May 18-21, 2015)? From
> https://meta.wikimedia.org/wiki/Research:Timeline_of_Wikimedia_analytics,
> it says that pageview data is available in bulk starting on May 1, 2015,
> so we thought maybe there was some chance this data exists.
>
> Any suggestions would be greatly appreciated, and if this is not possible,
> please let us know.
>
> Thank you!
> Jennifer Pan
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Wiktionary word page views?

2018-10-23 Thread Nuria Ruiz
The pageview API has that data as long as "individual words" are considered
"articles". See sample query:

https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wiktionary/all-access/all-agents/table/daily/2017100100/2017103100

Docs: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews

On Tue, Oct 23, 2018 at 3:43 PM Goran Milovanovic <
goran.milovanovic_...@wikimedia.de> wrote:

> @James Salsman I am not sure if we have a tool somewhere designed
> specifically for that purpose, but you can get many important statistics on
> Wiktionary from http://wdcm.wmflabs.org/Wiktionary_CognateDashboard/
>
> Regards,
> Goran
>
> Goran S. Milovanović, PhD
> Data Scientist, Software Department
> Wikimedia Deutschland
>
> 
> "It's not the size of the dog in the fight,
> it's the size of the fight in the dog."
> - Mark Twain
> 
>
>
> On Wed, Oct 24, 2018 at 12:34 AM James Salsman  wrote:
>
>> How can I get pageview statistics for individual words in the English
>> Wiktionary?
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Academic paper of Wikimedia' statistics v2?

2018-10-23 Thread Nuria Ruiz
Abel,

If you are talking about http://stats.wikimedia.org/v2 the metric
definition has not changed from the (now-called)  "legacy wikistats 1" (
http://stats.wikimedia.org) . In the V2 system metrics are surfaced over a
new UI and also new APIs so they are available programatically.

Some docs:
Metric definition:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Metrics_Definition
Vetting of calculations metrics using the "legacy"metrics as baseline:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2/Data_Quality
The edit metrics API:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
The UI: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Wikistats_2

Thanks,

Nuria


On Tue, Oct 23, 2018 at 5:08 AM ABEL SERRANO JUSTE  wrote:

> Hello!
>
> Is there any academic paper published about Wikimedia' statistics v2?
>
> Thank you.
> --
> Saludos,
> Abel.
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Community health metrics kit: Input needed!

2018-10-22 Thread Nuria Ruiz
This seems a start towards way to message "community health" that anyone
can grasp:
https://meta.m.wikimedia.org/wiki/Grants:IdeaLab/Health_rating_radio_button_template_on_talk_pages

On Mon, Oct 22, 2018 at 4:10 AM ABEL SERRANO JUSTE  wrote:

> Thank you for opening the discussion. In our research group
> , we are working
> specifically on this.
>
> You can get a lot of good ideas and pointers to other research from the 
> Wikimedia
> research page  and
> from the last Inspire Campaign of Wikimedia, which it was precisely about
> Measuring editing community health:
> https://meta.m.wikimedia.org/wiki/Grants:IdeaLab/Inspire
>
> Thank you also Marc for sharing your ideas, they are very interesting. We
> have already been working with inequality metrics
> 
> .
>
> El vie., 19 oct. 2018 a las 16:35, Marc Miquel ()
> escribió:
>
>> Hi Joe,
>>
>> I think this project is fundamental. I'm glad you are working on it.
>>
>> I have researched this topic in my PhD thesis and I went through a review
>> of the online communities engagement literature.
>>
>> Few ideas for metrics:
>> - Contributions inequality measurements (gini coefficients as a start).
>> - Multilingual editors contributions (to see whether they see Wikipedia
>> as a global project or prefer to focus on one language).
>> - Core-periphery social interactions (admins-newbees, in order to detect
>> communities more prone to mentoring)
>> - Rate of newbies completing the first article, rate of newbies
>> completing the first translation, etc.
>> - Recency measures for newbies (different measures on editor retention).
>> - Community/functional roles renewal (admin, autopatrolled, etc. to see
>> how good a community is at renewing its core along the years).
>>
>> I'd be happy to further discuss the topic. At your disposal.
>> Best regards,
>>
>> Marc Miquel
>>
>>
>> El dv., 5 d’oct. 2018 a les 23:29, Joe Sutherland (<
>> jsutherl...@wikimedia.org>) va escriure:
>>
>>> Hello everyone - apologies for cross-posting! *TL;DR*: We would like
>>> your feedback on our Metrics Kit project. Please have a look and comment on
>>> Meta-Wiki:
>>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>>
>>>
>>> The Wikimedia Foundation's Trust and Safety team, in collaboration with
>>> the Community Health Initiative, is working on a Metrics Kit designed to
>>> measure the relative "health"[1] of various communities that make up the
>>> Wikimedia movement:
>>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>>
>>> The ultimate outcome will be a public suite of statistics and data
>>> looking at various aspects of Wikimedia project communities. This could be
>>> used by both community members to make decisions on their community
>>> direction and Wikimedia Foundation staff to point anti-harassment tool
>>> development in the right direction.
>>>
>>> We have a set of metrics we are thinking about including in the kit,
>>> ranging from the ratio of active users to active administrators,
>>> administrator confidence levels, and off-wiki factors such as freedom to
>>> participate. It's ambitious, and our methods of collecting such data will
>>> vary.
>>>
>>> Right now, we'd like to know:
>>> * Which metrics make sense to collect? Which don't? What are we missing?
>>> * Where would such a tool ideally be hosted? Where would you normally
>>> look for statistics like these?
>>> * We are aware of the overlap in scope between this and Wikistats <
>>> https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
>>> coexist?
>>>
>>> Your opinions will help to guide this project going forward. We'll be
>>> reaching out at different stages of this project, so if you're interested
>>> in direct messaging going forward, please feel free to indicate your
>>> interest by signing up on the consultation page.
>>>
>>> Looking forward to reading your thoughts.
>>>
>>> best,
>>> Joe
>>>
>>> P.S.: Please feel free to CC me in conversations that might happen on
>>> this list!
>>>
>>> [1] What do we mean by "health"? There is no standard definition of what
>>> makes a Wikimedia community "healthy", but there are many indicators that
>>> highlight where a wiki is doing well, and where it could improve. This
>>> project aims to provide a variety of useful data points that will inform
>>> community decisions that will benefit from objective data.
>>>
>>> --
>>> *Joe Sutherland* (he/him or they/them)
>>> Trust and Safety Specialist
>>> Wikimedia Foundation
>>> joesutherland.rocks
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> 

[Analytics] New reports in wikistats2: "top editors" (a.k.a most prolific contributors) and "top edited articles"

2018-10-11 Thread Nuria Ruiz
Hello,

The analytics team would like to announce two new metrics available in
wikistats2:

1. Top editors (a.k.a most prolific contributors)
See example for Italian wikipedia:

https://stats.wikimedia.org/v2/#/it.wikipedia.org/contributing/top-editors/normal|table|1-Month|~total

2. Top edited articles (pages with most edits, not most contributors):
Again, example for Italian wikipedia:

https://stats.wikimedia.org/v2/#/it.wikipedia.org/contributing/top-edited-pages/normal|table|1-Month|~total

Please take a look and, as always, send feedback via phab or irc
(#wikimedia-analytics)

The tasks we have in our radar for wikistats2 are metrics "per family",
that is. "edits for all wikitionary projects"  or "unique devices for all
wikipedias".

Some of these metrics are already available in the API. See for example,
the daily number of edits for all wikitionary.org projects for August 2018,
made by registered users on articles (pages in content namespace):

https://wikimedia.org/api/rest_v1/metrics/edits/aggregate/all-wiktionary-projects/user/content/daily/20180801/20180901

More info about edit data apis here:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] When is the new pages API updated?

2018-10-10 Thread Nuria Ruiz
>Wikistats 1 generates data on content pages with a delay of 10-15 days
after the end of the month
This is true for full snapshots (for the reasons we have discussed before
and that Dan has described on this thread). You can expect data to be
available on the API soon after the 10th, but it is unlikely that it will
be there before the 10th as we do not start the process until the 5th.

Now, data - as you now- is streamed real time, every second. So it is only
the full reconstruction of events, the full snapshot, that takes several
days to build. Have you looked into using the real time events when the
next month snapshot is yet not available?


On Wed, Oct 10, 2018 at 7:48 PM Dan Andreescu 
wrote:

> It should be updated soon, the jobs are all done successfully.  But
> currently we do expect this kind of lag, I'll explain why.
>
> When we started we were sqooping at the beginning of the month and the
> processing takes something like 4 days total, most of it sqooping.  But
> this put too much load on the database serves too close to the beginning of
> the month when a bunch of other stuff is running.  So we had to move it
> back to the 5th of the month [1].  Add 4 days onto that and we end up
> finishing around the 9th of the month.  We don't like this at all and we're
> trying to figure out a better way to import the data incrementally so we
> can just start processing when we have all of it.  It's unfortunate but we
> couldn't foresee the infrastructure limitation, too much was up in the air
> about even where we would sqoop from when we started this work.  Joseph and
> I have a weekly meeting to discuss moving towards a more incremental
> approach, and this task is the parent task to watch for now:
> https://phabricator.wikimedia.org/T193650 (priority is low because we
> have too many other commitments, but it's something I'd love to see before
> we call wikistats 2 "production" quality)
>
> [1]
> https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0dfc2ed7/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp#L43
>
> On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn 
> wrote:
>
>> Hey there!
>>
>> I just wrote a script that fetches data from the AQS new pages endpoint
>> 
>> in order to prepare the our monthly health metrics (T199459
>> ).
>>
>> However, it seems like that endpoint doesn't yet have monthly data for
>> September. For example, a query for Commons with a start of July 1 and
>> and an end of October 1
>> 
>> returns only data for July and August. What's the schedule for updating
>> this data?
>>
>> To be honest, I feel pretty frustrated by this. Wikistats 1 generates
>> data on content pages with a delay of 10-15 days after the end of the
>> month, which has made it difficult for us to provide timely metrics to
>> executives and the board. I had assumed (to a degree that I didn't even
>> check) that by switching to this API, we would instead only have to deal
>> with the delay in generating the mediawiki_history snapshot (5-7 days after
>> the end of the month). But that doesn't seem to be the case.
>> --
>> Neil Patel Quinn 
>> (he/him/his)
>> product analyst, Wikimedia Foundation
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Wikistats2 Better maps and new metric: Legacy Pageviews (a.k.a Pagecounts)

2018-07-11 Thread Nuria Ruiz
Hello!

Just  a brief note to announce that we have two new things in Wikistats2
this quarter. We have reviewed maps and we now report more precise
pageviews per country.

Check, for example, pageviews for Portuguese Wikipedia on the world for
last month:

https://stats.wikimedia.org/v2/#/pt.wikipedia.org/reading/page-views-by-country/normal|map|2-Year~2016060100~2018071100|~total

Also, we have included legacy pageviews in the UI, we used to call these
pagecounts and prior to June 2015 this is the metric that we reported as
pageviews for all wikimedia sites.

See, for example, pagecounts for portuguese wikipedia from 2008 to 2016:

https://stats.wikimedia.org/v2/#/pt.wikipedia.org/reading/legacy-page-views/normal|bar|All~1980010100~2018071100|~total


Info about metric:
https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-raw

Also, all urls are now bookmarkable.

As always suggestions welcome, please file bug reports on phabricator.

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] most popular articles per country

2018-07-09 Thread Nuria Ruiz
Amir:

FYI that this data has couple caveats:

1) the "-" is pageviews for  a page for which we cannot extract a title.

2) data very much affected by bot spikes (you can mitigate that by
filtering by agent_type="user" but still, a significant portion of bot
traffic is not label as such).
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly#Changes_and_known_problems_since_2015-06-16

3) there are privacy considerations when number of views are small:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/
Pageviews/Pageviews_by_country#Is_Pageviews_by_country_privacy_sensitive


>Is anything like this already published anywhere? If it isn't, it may be
nice to publish such a thing, similarly to Google Zeitgeist.
We do not have immediate plans to do so due to privacy considerations. Now,
Dario's team has a project on this regard that might render datasets to be
published this year: https://meta.wikimedia.org/
wiki/Research:Quantifying_the_global_attention_to_public_
health_threats_through_Wikipedia_pageview_data

See also:
https://phabricator.wikimedia.org/T189339

Thanks,

Nuria

On Mon, Jul 9, 2018 at 5:41 AM, Amir E. Aharoni <
amir.ahar...@mail.huji.ac.il> wrote:

> Thanks. Another question: For some countries, the result is "-", for
> example Germany:
>
> Germany-en.wikipedia1275634
>
> Any idea why?
>
> (I modified the query a bit and added the "project" column. And yes, the
> fact that en.wikipedia is at the top in Germany is also quite odd.)
>
>
> --
> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
> http://aharoni.wordpress.com
> ‪“We're living in pieces,
> I want to live in peace.” – T. Moore‬
>
> 2018-07-09 15:17 GMT+03:00 Francisco Dans :
>
>> I think as long as you put in a filter so that the minimum pageviews is
>> maybe 1000, you should be fine privacy wise. I can't speak too much to your
>> second question.
>>
>> On Mon, Jul 9, 2018 at 1:59 PM, Amir E. Aharoni <
>> amir.ahar...@mail.huji.ac.il> wrote:
>>
>>> Thank you so much! In many countries it's
>>>
>>> A couple of questions:
>>> 1. Are any of the results of this query private? Or can I talk about
>>> them to people?
>>> 2. Is anything like this already published anywhere? If it isn't, it may
>>> be nice to publish such a thing, similarly to Google Zeitgeist.
>>>
>>>
>>> --
>>> Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
>>> http://aharoni.wordpress.com
>>> ‪“We're living in pieces,
>>> I want to live in peace.” – T. Moore‬
>>>
>>> 2018-07-09 13:19 GMT+03:00 Francisco Dans :
>>>
 Hi Amir,

 As Tilman has suggested, your best bet is to query the pageview_hourly
 table. I was going to be lazy and give you a query to just find out the
 most viewed article for a given country, but then I made a few experiments
 and this is the query I came up with to generate a list of countries and
 their respective most viewed articles and view counts. It takes a few
 minutes to run for a single day, so I'm sure someone here could suggest a
 better approach.

 WITH articles_countries AS (
> SELECT country, page_title, sum(view_count) AS views
> FROM pageview_hourly
> WHERE year=2018 AND month=3 AND day=15
> GROUP BY country, page_title
> )
> SELECT s.country as country, s.page_title as page_title, s.views as
> views
> FROM (
> SELECT max(named_struct('views', views, 'country', country,
> 'page_title', page_title)) as s from articles_countries group by country
> ) t;


 Cheers / see you in ZA,
 Fran


 On Mon, Jul 9, 2018 at 10:18 AM, Amir E. Aharoni <
 amir.ahar...@mail.huji.ac.il> wrote:

> Hi,
>
> Is there a way to find what are the most popular articles per country?
>
> Finding the most popular articles per language is easy with the
> Pageviews tool, but languages and countries are of course not the same.
>
> One thing I tried is going to Turnilo, webrequest_sampled_128, and
> filtering by country. But here it gets troublesome:
> * Splitting can be done by Uri host, which is *more or less* the
> project, or by Uri path, which is *more or less* the article (but see
> below), and I couldn't find a convenient way to combine them.
> * Mobile (.m.) and desktop hosts are separate. It may actually
> sometimes be useful to see differences (or lack thereof) between desktop
> and mobile, but combining them is often useful, too. This can probably be
> done with regular expressions, but this brings us to the biggest problem:
> * Filtering by Uri path would be useful if it didn't have so many
> paths for images, beacons, etc. Filtering using the regular expression
> "\/wiki\/.+" may be the right thing functionally, but in practice it's 
> very
> slow or doesn't work at all.
> * I don't know what exactly is logged in webrequest_sampled_128, but
> the name hints that it doesn't include 

[Analytics] Backfilling some eventlogging data on hadoop

2018-07-06 Thread Nuria Ruiz
Hello:

An FYI that we are rerunning some of our jobs to backfill some eventlogging
data on hadoop. Job should take a bout a day. Schemas affected are listed
on ticket:

https://phabricator.wikimedia.org/T198906

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] EventLogging MariaDB indexes

2018-05-27 Thread Nuria Ruiz
You can open a ticket and either our team or the dbas might be able to do
it. Best might be looking at data in hadoop where you can query big amounts
of it more easily.

Evenloggibg data can be found on the “events” db on hive.

Thanks,

Nuria




On Fri, May 25, 2018 at 11:22 AM Gilles Dubuc  wrote:

> Hi,
>
> I see that some EventLogging tables have custom indexes. What's the
> process to get indexes added to a couple of schemas I need extra DB indexes
> for?
>
> The "research" user on the analytics slave doesn't have ALTER rights and I
> couldn't find any documentation about that topic.
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Content of wmf.wdqs_extract

2018-05-08 Thread Nuria Ruiz
Adrian:

Please note that this table might disappear soon as the reserach it was
created for has finished. Also, we will be rolling out (hopefully) next
quarter similar tables that split our large dataset into smaller ones. That
work is still WIP.

Thanks,

Nuria


On Tue, May 8, 2018 at 12:22 AM, Leila Zia  wrote:

> A couple of pointers as Stas was not involved in the details of the
> extraction.
>
> Adrian: you can dig the history behind the extraction at
> https://phabricator.wikimedia.org/T146064
>
> Please also check the codes at https://gerrit.wikimedia.org/r/#/c/311964/
> for details, specifically wdqs_extract.hql .
>
> Best,
> Leila
>
>
>
> On Mon, May 7, 2018, 18:15 Andrew Otto  wrote:
>
>> CCing Stas, he might know more.
>>
>> On Sun, May 6, 2018 at 9:58 AM, Adrian Bielefeldt <
>> adrian.bielefe...@mailbox.tu-dresden.de> wrote:
>>
>>> Hello everyone,
>>>
>>> I wanted to ask if anyone can tell me what wmf.wdqs_extract contains. I
>>> know generally that it is the query log of the SPARQL endpoint. However,
>>> I do not know if it is all requests, only uncached requests etc.
>>>
>>> If anyone knows or knows where I can read up on it that would be great.
>>>
>>> Greetings,
>>>
>>> Adrian
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Wikistats Data Outage issues

2018-04-23 Thread Nuria Ruiz
Hello!

We are investigating a recent outage with data in wikistats.
We shall report more as our understanding of issues progresses.


Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-18 Thread Nuria Ruiz
> Is there any download link available for  the  *webrequest *datasets ?
No, sorry, there is no download of webrequest data nor is it kept long
term.

As I mentioned before the best dataset that might fit your needs is this
one: https://analytics.wikimedia.org/datasets/archive/public-
datasets/analytics/caching/  which is a different dataset than webrequest
and does not include the same fields, just a subset.



On Wed, Apr 18, 2018 at 8:25 AM, Ta-Yuan Hsu  wrote:

> Hi, Nuria:
>
>   I reviewed the closest data to what I am looking for, phabricator
> T128132, from  https://analytics.wikimedia.org/datasets/archive/public-
> datasets/analytics/caching/
> and the *webrequest* datasets : https://wikitech.wikimedia.org/wik
> i/Analytics/Data_Lake/Traffic/Webrequest. I still have a few questions.
>
> 1. Is `hashed_host_path' (in the cache dataset) the `hostname' or `
> uri_host '?  Phabricator T128132 shows the two fields. However, the
> available data only shows ` hashed_host_path'.
>
> 2. There are 6 fields - hashed_host_path, uri_query,
> content_type, response_size, time_firstbyte, and x_cache - in the caching
> dataset, as shown in the attachment screen snapshot.
>  Does the caching dataset not include page_id?   The *webrequest* dataset
> seems to contain page_id.
> 3. I didn't find the sequence field in the caching dataset. I learned
> that  sequence replaces time stamp. Is ` sequence' the file name of
> downloads in the caching dataset?
> 4. Does `dt' (in the  *webrequest* dataset)  mean a timestamp with  ISO
> 8601  format ? Probably, the
> *webrequest* dataset might be what I am looking for, if it can provide
> access traces per-second.
>
> 5. According the the descriptions in  the *webrequest* webpage, the  
> *webrequest
> *datasets should contain at least `hostname', `page_id', and `dt'. If
> true,  the  *webrequest *datasets  seem to cover most of my requirements.
> Is there any download link available for  the  *webrequest *datasets ?
>
> --
> Sincerely,
> TA-YUAN
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Licensing for screenshots of pageviews data

2018-04-13 Thread Nuria Ruiz
My 2 cents:

Data on pageviews endpoint is available under:
https://creativecommons.org/publicdomain/zero/1.0/ (you need to expand each
endpoint to see this, sorry, that  UX could be better).

You can add to pageview tool a note about licensing of the features it
provides. For example: see the footer on wikistats:
https://stats.wikimedia.org/v2/#/all-projects

"All data, charts, and other content is available under the Creative
Commons CC0 dedication."


Thanks,

Nuria

On Fri, Apr 13, 2018 at 11:48 AM, Leon Ziemba 
wrote:

> Hello Analytics!
>
> I have a licensing question. If someone were to share a screenshot of the 
> Pageviews
> Analysis  tool (or similar), are
> they bound the REST API licenses described at https://wikimedia.org/api/
> rest_v1/ ?
>
> I assume so but wanted to make sure. I was prompted with this question
> because a user was considering using screenshots in a scholarly article.
> They wish to publish it under a CC-BY license.
>
> Please tell me if this statement is accurate:
>
>
> *The underlying pageviews data shown in Pageviews Analysis is provided by
> the Wikimedia RESTBase API , released
> under the CC-BY-SA 3.0 
> and GFDL  licenses. Any use of this
> data, including screenshots, is bound by these licenses, and you
> irrevocably agree to release modifications or additions under these
> licenses.*
>
>
> Thanks!
>
> ~Leon
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Research-Internal] Spark2 upgraded to Spark 2.3.0, Spark 1 on the way out

2018-04-10 Thread Nuria Ruiz
FYI that this is happening today. Users may see slowness and paused jobs.
We will send a note when upgrade is complete.

Thanks,

Nuria

On Thu, Apr 5, 2018 at 1:22 PM, Andrew Otto  wrote:

> Hi all!
>
> I just upgraded spark2 across the cluster to Spark 2.3.0
> .  If you are
> using the pyspark2*, spark2-*, etc. executables, you will now be using
> Spark 2.3.0.
>
> We are moving towards making Spark 2 the default Spark for all Analytics
> production jobs.  We don’t have a deprecation plan for Spark 1 yet, so you
> should be able to continue using Spark 1 for the time being.  However, in
> order to support large yarn Spark 2 jobs, we need to upgrade the default
> Yarn Spark Shuffle Service to Spark 2.  This means that large Spark 1 jobs
> may no longer work properly.  We don’t know of any large productionized
> Spark 1 jobs other than the ones the Analytics team manages, but if you
> have any that you are worried about, please let us know ASAP.
>
> -Andrew & Analytics Engineering
>
>
>
> ___
> Research-Internal mailing list
> research-inter...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-09 Thread Nuria Ruiz
Hello,

I do not think our downloads or API provide a dataset like the one you are
interested on. From your question I get the feeling that your assumptions
on how our system works does not match reality, wikipedia might not be the
best fit for your study.

The closest data to what you are asking might be this one:
https://analytics.wikimedia.org/datasets/archive/public-datasets/analytics/caching/README,
I would read this ticket to understand the inners of dataset:
https://phabricator.wikimedia.org/T128132

Thanks,

Nuria



On Mon, Apr 9, 2018 at 10:48 AM, Ta-Yuan Hsu  wrote:

> Dear all,
>
>Since we are studying workloads including a sample of Wikipedia's
> traffic over a certain period of time, what we need is patterns of user
> access to web servers  in a decentralized hosting environment. The access
> patterns need to include real hits on their servers per time for one
> language. In other words, one trace record we require should contain at
> least four features - timestamp (like MM:DD:SS), web server id, page size,
> and operations (e.g., create, read, or update a page).
>
>We already reviewed some available downloaded datasets, such as
> https://dumps.wikimedia.org/other/pagecounts-raw/. However, they do not
> match our requirement. Does anyone know if it is possible to download a
> dataset with four features from Wikimedia website? Or should we use REST
> API to acquire it?   Thank you!
> --
> Sincerely,
> TA-YUAN
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site

2018-04-03 Thread Nuria Ruiz
Zainan:

Labs is our cloud environment for volunteers, you can direct questions
about that to cloud e-mail list.

https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction

Thanks,

Nuria

On Mon, Apr 2, 2018 at 7:44 PM, Zainan Zhou (a.k.a Victor) 
wrote:

> Thanks Dan, that's very helpful, I asked two follow-up questions inline
> below
>
>
> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * 
> * •  *Software Engineer, Data Engine
> * •*  Google Inc.
> * •  *z...@google.com  - 650.336.5691
> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>
> On Sat, Mar 31, 2018 at 12:34 AM, Dan Andreescu 
> wrote:
>
>> Thanks to Tilman for pointing out that this data is still being worked
>> on.  So, yes, there are lots of subtleties in how we count articles,
>> redirects, content vs. non-content, etc.  I don't have the answer to all of
>> the discrepancies that Tilman found, but if you need a very accurate
>> answer, the only way is to get an account on labs and start digging into
>> how exactly you want to count the articles.
>>
>
> What's the best way to signup the labs account? (does it require certain
> qualifications?)
> And could you point us to the code or entry of the code repository?
>
>
>
>> As our datasets and APIs get more mature, we're hoping to give as much
>> flexibility as everyone needs, but not so much as to drive people crazy.
>> Until then, we're slowly improving our docs.
>>
>> And yes, don't read some of this stuff alone at night, the buddy system
>> works well for data analysis, lol
>>
>> On Fri, Mar 30, 2018 at 6:43 AM, Zainan Zhou (a.k.a Victor) <
>> z...@google.com> wrote:
>>
>>> Thank you very much Dan, this turns out to be very helpful. My teammates
>>> has started looking into it.
>>>
>>>
>>> * •  **Zainan Zhou(**周载南**) a.k.a. "Victor" * 
>>> * •  *Software Engineer, Data Engine
>>> * •*  Google Inc.
>>> * •  *z...@google.com  - 650.336.5691
>>> * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 94043
>>>
>>> On Fri, Mar 30, 2018 at 5:12 AM, Dan Andreescu >> > wrote:
>>>
 Forwarding this question to the public Analytics list, where it's good
 to have these kinds of discussions.  If you're interested in this data and
 how it changes over time, do subscribe and watch for updates, notices of
 outages, etc.

 Ok, so on to your question.  You'd like the *total # of articles for
 each wiki*.  I think the simplest way right now is to query the AQS
 (Analytics Query Service) API, documented here:
 https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2

 To get the # of articles for a wiki, let's say en.wikipedia.org, you
 can get the timeseries of new articles per month since the beginning of
 time:

 *https://wikimedia.org/api/rest_v1/metrics/edited-pages/new/en.wikipedia.org/all-editor-types/all-page-types/monthly/2001010100/2018032900
 *

 And to get a list of all wikis, to plug into that URL instead of "
 en.wikipedia.org", the most up-to-date information is here:
 https://meta.wikimedia.org/wiki/Special:SiteMatrix in table form or
 via the mediawiki API: https://meta.wikimedia.or
 g/w/api.php?action=sitematrix=2=json
 xage=3600=3600.  Sometimes new sites won't have data in the
 AQS API for a month or two until we add them and start crunching their
 stats.

 The way I figured this out is to look at how our UI uses the API:
 https://stats.wikimedia.org/v2/#/en.wikipedia.org/contr
 ibuting/new-pages.  So if you were interested in something else, you
 can browse around there and take a look at the XHR requests in the browser
 console.  Have fun!

 On Thu, Mar 29, 2018 at 12:54 AM, Zainan Zhou (a.k.a Victor) <
 z...@google.com> wrote:

> Hi Dan,
>
> How are you! This is Victor, It's been a while since we meet at the
> 2018 Wikimedia Dev Summit. I hope you are doing great.
>
> As I mentioned to you, my team works on extracting the knowledge from
> Wikipedia. Currently it's undergoing a project that expands language
> coverage. My teammate Yuan Gao(cc'ed here)  is tech leader of this
> project.She plans to *monitor the list of all the current available
> wikipedia's sites and the number of articles for each language*, so
> that we can compare with our extraction system's output to sanity-check if
> there is a massive breakage of the extraction logic, or if we need to
> add/remove languages in the event that a new wikipedia site is introduced
> to/remove from the wikipedia family.
>
> I think your team at Analytics at Wikimedia probably knows the best
> where we can find this data. Here 

Re: [Analytics] [Services] Getting more than just 1000 top articles from REST API

2018-04-02 Thread Nuria Ruiz
>are trying to rebuild our stale encyclopedia apps for offline usage but
are space-limited and would only like to include the most likely pages that
would be looked at that can fit within a size envelope >that varies with
the device in question (up to 100k article limit probably)
For this use case I would be careful to look at page ranks as true
popularity as the top data is affected by bot spikes regularly (that is a
known issue that we intend to fix). After you have your list of most
popular pages please take a second look, some -but not all- of the pages
that are artificially high due to bot traffic are pretty obvious (many
special pages).

On Mon, Apr 2, 2018 at 8:54 AM, Leila Zia  wrote:

>
>
> On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu 
> wrote:
>
>> Hi Srdjan,
>>
>> The data pipeline behind the API can't handle arbitrary skip or limit
>> parameters, but there's a better way for the kind of question you have.  We
>> publish all the pageviews at https://dumps.wikimedia.org/ot
>> her/pagecounts-ez/, look at the "Hourly page views per article"
>> section.  I would imagine for your use case one month of data is enough,
>> and you can get the top N articles for all wikis this way, where N is
>> anything you want.
>>
>
> ​One suggestion here is that if you want to find articles that are
> consistently high-page-view (and not part of spike/trend-views), you
> increase the time-window to 6 months or longer.
>
> Best,
> Leila​
>
> ​
> ​
> --
> Leila Zia
> Senior Research Scientist, Lead
> Wikimedia Foundation
>
> ​
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Migrated Reportcard with Updated Data

2018-03-11 Thread Nuria Ruiz
The data as it was sent to us via comscore can be found here:
https://github.com/wikimedia/analytics-reportcard-data/blob/master/datafiles/rc_comscore_region_reach.csv
(caveats are many as years go by data is less and less representative)



On Sun, Mar 11, 2018 at 6:02 PM, Tilman Bayer <tba...@wikimedia.org> wrote:

> The old report card at https://reportcard.wmflabs.org also included
> historical unique visitors data from comScore (at https://reportcard.
> wmflabs.org/graphs/unique_visitors ; the global number was one of WMF's
> core metrics for many years, being highlighted e.g. in our monthly reports,
> and IIRC that report card dashboard also included regional numbers).
>
> Have we preserved this data somewhere?
>
> On Fri, Apr 7, 2017 at 11:30 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> Hello!
>>
>> The Analytics team would like to announce that we have migrated the
>> reportcard to a new domain:
>>
>> https://analytics.wikimedia.org/dashboards/reportcard/#pagev
>> iews-july-2015-now
>>
>> The migrated reportcard includes both legacy and current pageview data,
>> daily unique devices and new editors data. Pageview and devices data is
>> updated daily but editor data is still updated ad-hoc.
>>
>> The team is working at this time on revamping the way we compute edit
>> data and we hope to be able to provide monthly updates for the main edit
>> metrics this quarter. Some of those will be visible in the reportcard but
>> the new wikistats will have more detailed reports.
>>
>> You can follow the new wikistats project here: https://phabricator.wiki
>> media.org/T130256
>>
>> Thanks,
>>
>> Nuria
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Wikipedia internal search clickstream

2018-03-05 Thread Nuria Ruiz
Short answer, no, this data is not available publicy such you can compute
the dataset yourself as it is Private data.

Thanks,

Nuria

On Mon, Mar 5, 2018 at 11:31 AM, Georg Sorst  wrote:

> Hi all,
>
> sorry for this messy post - I forgot to subscribe to the list so I can't
> directly reply to your responses.
>
> Nuria:
>
> > Datasets do not include simple wiki, there are calculated for a few wikis
> some or which are not very large so you might be able to use them.
>
> Is the raw data available? Can I compute the clickstream myself?
>
> Erik:
>
> > This is actually how our production search ranking is built for around
> the
> top 20 sites by search volume that we host. Simple wikipedia isn't one of
> those we currently use machine ranking for though.
>
> Awesome! Is there more info available somewhere? Algorithms used etc.
> maybe even source code?
>
> > Because of that we do have the data you need, but the problem will be
> that the actual search
> queries are considered PII (Personally Identifiable Information) and not
> something I can release publicly. It may be possible to release aggregated
> data sets that don't include the actual search terms, but at that point I
> don't think the data will be useful to you anymore.
>
> I think I'm fine with query-document pairs. Isn't that sufficiently
> aggregated to not be considered PII?
>
> Thank you!
> Georg
>
>
> Georg Sorst  schrieb am Mi., 28. Feb. 2018 um
> 12:17 Uhr:
>
>> Hi list,
>>
>> as part of a lecture on Information Retrieval I am giving we work a lot
>> with Simple Wikipedia articles. It's a great data set because it's
>> comprehensive and not domain specific so when building search on top of it
>> humans can easily judge result quality, and it's still small enough to be
>> handled by a regular computer.
>>
>> This year I want to cover the topic of Machine Learning for search. The
>> idea is to look at result clicks from an internal search search engine,
>> feed that into the Machine Learning and adjust search accordingly so that
>> the top-clicked results actually rank best. We will be using Solr LTR for
>> this purpose.
>>
>> I would love to base this on Simple Wikipedia data since it would fit
>> well into the rest of the lecture. Unfortunately, I could not find that
>> data. The closest I came is https://meta.wikimedia.org/
>> wiki/Research:Wikipedia_clickstream but this covers neither Simple
>> Wikipedia nor does it specify internal search queries.
>>
>> Did I miss something? Is this data available somewhere? Can I produce it
>> myself from raw data? Ideally I would need (query-document) pairs with the
>> number of occurrences.
>>
>> Thank you!
>> Georg
>> --
>> *Georg M. Sorst I CTO*
>> [image: FINDOLOGIC Logo]
>>
>> Jakob-Haringer-Str. 5a | 5020
>> 
>>  Salzburg
>> 
>> I T.: +43 662 456708 <+43%20662%20456708>
>> E.: g.so...@findologic.com
>> www.findologic.com Folgen Sie uns auf: XING
>>  facebook
>>  Twitter
>> 
>>
>> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
>> A6 Stand E130 in München*! Hier
>>  Termin
>> vereinbaren!
>> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*!
>> Hier  Termin
>> vereinbaren!
>> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand
>> G.17 in Zürich*! Hier  
>> Termin
>> vereinbaren!
>> Hier  geht es zu unserer *Homepage*!
>>
> --
> *Georg M. Sorst I CTO*
> [image: FINDOLOGIC Logo]
>
> Jakob-Haringer-Str. 5a | 5020
> 
>  Salzburg
> 
> I T.: +43 662 456708 <+43%20662%20456708>
> E.: g.so...@findologic.com
> www.findologic.com Folgen Sie uns auf: XING
>  facebook
>  Twitter
> 
>
> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
> A6 Stand E130 in München*! Hier
>  Termin
> vereinbaren!
> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
>  Termin
> vereinbaren!
> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
> in Zürich*! Hier  Termin
> vereinbaren!
> Hier  geht es zu unserer *Homepage*!
>

Re: [Analytics] PageView

2018-03-02 Thread Nuria Ruiz
>Or is there another method you also count that is gathered for other
companies that collect views?
Companies that do this such us comScore do it by getting their participants
install (normally desktop software) in their machines and tracking page
views that these participants do. It was the case until recently that
comScore only would use desktop statistics (that seems to agree with your
findings) when reporting, for example, unique users. Because of this fact,
their numbers (that did not included mobile usage) were largely incorrect.

Please see:
https://meta.wikimedia.org/wiki/ComScore/Announcement

On Fri, Mar 2, 2018 at 8:41 AM, Marcel Ruiz Forns 
wrote:

> Sorry, forwarding to Analytics...
>
> Hi Angelina,
>
> I don't think there's any (legal) way of tracking Wikipedia traffic.
> All Wikipedia traffic data is protected by WMF's privacy policy[1]
> and handled accordingly.
>
> We do, however, provide public sanitized high-level statistics on page
> views for Wikipedia in various ways (not to specific companies or
> organizations, but rather to the world at large). What "Next Big Sound"
> is probably doing, is consuming one of those public sources, but we
> don't know which one.
>
> These are 2 of the main sources this company might be grabbing stats from:
> https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
> https://dumps.wikimedia.org/
>
> Cheers!
>
> [1] https://wikimediafoundation.org/wiki/Privacy_policy
>
>
>
> On Fri, Mar 2, 2018 at 5:16 PM, Marcel Ruiz Forns 
> wrote:
>
>> Hi Angelina,
>>
>> I don't think there's any (legal) way of tracking Wikipedia traffic.
>> All Wikipedia traffic data is protected by WMF's privacy policy[1]
>> and handled accordingly.
>>
>> We do, however, provide public sanitized high-level statistics on page
>> views for Wikipedia in various ways (not to specific companies or
>> organizations, but rather to the world at large). What "Next Big Sound"
>> is probably doing, is consuming one of those public sources, but we
>> don't know which one.
>>
>> These are 2 of the main sources this company might be grabbing stats from:
>> https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews
>> https://dumps.wikimedia.org/
>>
>> Cheers!
>>
>> [1] https://wikimediafoundation.org/wiki/Privacy_policy
>>
>>
>> On Fri, Mar 2, 2018 at 4:19 PM, Marcel Ruiz Forns 
>> wrote:
>>
>>> Oh, forgot the subscribe link, here:
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>> Cheers!
>>>
>>> On Fri, Mar 2, 2018 at 4:18 PM, Marcel Ruiz Forns 
>>> wrote:
>>>
 Hi Angelina,

 I'm the administrator of this mailing-list. Just to let you know that
 your email was automatically filtered out by the mailing-list bot because
 your address is not subscribed to it. I just unblocked it, so yopu will
 receive a response in short. However, please subscribe to send further
 emails to the list.

 Thanks!


 On Wed, Feb 28, 2018 at 5:04 PM, BTShasSTOLENmyHEART <
 zangeli...@gmail.com> wrote:

> Hello,
>
> I recently spoke with "Next Big Sound" which is a company that tracks
> Wikipedia page views on certain artists. They informed me that they got
> details of the views directly from Wikipedia (because I had emailed them
> that the View counts mentioned on Wikipedia and Next Big Sound show a 
> major
> discrepancy). There are rumors flying about saying that the information
> only gathered is from Desktop Views, in which the counts are extremely
> similar. Is there any way you can confirm this as true? Or is there 
> another
> method you also count that is gathered for other companies that collect
> views? I know you have no idea of what Next Big Sound is presenting to the
> world wide audience, but I wanted to know if you can explain what
> information is given to Next Big Sound in terms of data. Thank you
>
>
> Sincerely,
>
> Angelina Zamora
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>


 --
 *Marcel Ruiz Forns*
 Analytics Developer
 Wikimedia Foundation

>>>
>>>
>>>
>>> --
>>> *Marcel Ruiz Forns*
>>> Analytics Developer
>>> Wikimedia Foundation
>>>
>>
>>
>>
>> --
>> *Marcel Ruiz Forns*
>> Analytics Developer
>> Wikimedia Foundation
>>
>
>
>
> --
> *Marcel Ruiz Forns*
> Analytics Developer
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Wikipedia internal search clickstream

2018-03-02 Thread Nuria Ruiz
>Did I miss something? Is this data available somewhere?
You can find more information about click streams datasets here:
https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/

Datasets do not include simple wiki, there are calculated for a few wikis
some or which are not very large so you might be able to use them.







On Wed, Feb 28, 2018 at 3:17 AM, Georg Sorst  wrote:

> Hi list,
>
> as part of a lecture on Information Retrieval I am giving we work a lot
> with Simple Wikipedia articles. It's a great data set because it's
> comprehensive and not domain specific so when building search on top of it
> humans can easily judge result quality, and it's still small enough to be
> handled by a regular computer.
>
> This year I want to cover the topic of Machine Learning for search. The
> idea is to look at result clicks from an internal search search engine,
> feed that into the Machine Learning and adjust search accordingly so that
> the top-clicked results actually rank best. We will be using Solr LTR for
> this purpose.
>
> I would love to base this on Simple Wikipedia data since it would fit well
> into the rest of the lecture. Unfortunately, I could not find that data.
> The closest I came is https://meta.wikimedia.org/wiki/Research:Wikipedia_
> clickstream but this covers neither Simple Wikipedia nor does it specify
> internal search queries.
>
> Did I miss something? Is this data available somewhere? Can I produce it
> myself from raw data? Ideally I would need (query-document) pairs with the
> number of occurrences.
>
> Thank you!
> Georg
> --
> *Georg M. Sorst I CTO*
> [image: FINDOLOGIC Logo]
>
> Jakob-Haringer-Str. 5a | 5020
> 
>  Salzburg
> 
> I T.: +43 662 456708 <+43%20662%20456708>
> E.: g.so...@findologic.com
> www.findologic.com Folgen Sie uns auf: XING
>  facebook
>  Twitter
> 
>
> Wir sehen uns auf der* Internet World* - am 06.03. & 07.03.2018 in *Halle
> A6 Stand E130 in München*! Hier
>  Termin
> vereinbaren!
> Wir sehen uns auf der *SHOPTALK* von 18. bis 21. März in *Las Vegas*! Hier
>  Termin
> vereinbaren!
> Wir sehen uns auf der *SOM* am 18.04. & 19.04.2018 in *Halle 7 Stand G.17
> in Zürich*! Hier  Termin
> vereinbaren!
> Hier  geht es zu unserer *Homepage*!
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] How to get old page views data?

2018-02-22 Thread Nuria Ruiz
Peter:

Do submit a phabricator tasks with your request, it'll be easier to follow
on it than it is via e-mail.  Our backlog:
https://phabricator.wikimedia.org/tag/analytics/

I assume you know that per article views are available since 2015, a way to
see those:  https://tools.wmflabs.org/pageviews/

Per project views are available since early on, in either downloadable
files or programatic form:
https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_Pagecounts

Thanks,

Nuria

On Thu, Feb 22, 2018 at 1:44 PM, Peter Meissner 
wrote:

> Like dumps on article-day level? That would be already super awesome much
> better than the current state.
>
> Best, Peter
>
> Am 22.02.2018 22:23 schrieb "Dan Andreescu" :
>
>> Peter, the data you mention here is quite large, and storage is cheap but
>> not free.  For now, we don't have capacity to serve that kind of timespan
>> from the API, but we will work to improve the dumps version so it's more
>> comprehensive.
>>
>> On Thu, Feb 22, 2018 at 4:12 PM, Peter Meissner > > wrote:
>>
>>> Dear List-eners,
>>>
>>>
>>> I write in to argue the case for an Wikipedia effort to make something
>>> like stats.grok.se (page views per day per article from 2007 onwards)
>>> available again.
>>>
>>>
>>> I am author of the first R-package that was providing easy access to
>>> pageview counts by accessing the stats.grok.se service and translating
>>> the it into need little R data frames.
>>>
>>> Since stats.grok.se is gone somebody writes in once a month - mostly
>>> from academia - asking about the status of page view data for the time
>>> before late 2015 - counts, per article, per day. To underline this further:
>>> the R pageviews package written by one of your former colleagues has over
>>> 7000 downloads within 2 years while my package has 14000 within 4 years
>>> (which are conservative numbers because they stem from one particular CRAN
>>> mirror only).
>>>
>>> I made some efforts to reconstruct the service that stats.grok.se was
>>> providing but well it's not a trivial endeavour as far as I can see (BIG
>>> data, demanding some computing time and storage resources and bandwidth,
>>> and some thinking about how to re-arrange and aggregate the data so it can
>>> be queried and served efficiently -  not to mention that the data is raw
>>> meaning it needs some proper cleaning up before using, also hosting will
>>> need some resources, ...) - and so my efforts have gone nowhere .
>>>
>>>
>>> Would it not be nice if Wikipedia could jump in and support research by
>>> going the whole mile and making those page counts available?
>>>
>>> In regard to the prioritizing - I am sure you have a long backlog - I
>>> would argue that this is something that really is a multiplier thing. It
>>> enables a lot of people to start researching. Daily page counts are not
>>> that fancy but without them people are simply blocked. They cannot start
>>> because they cant even get a basic idea about what was the general article
>>> popularity for a given day.
>>>
>>>
>>> Best Peter
>>>
>>>
>>>
>>> PS.: I would be willing to put in some time to help you folks in any way
>>> I can.
>>>
>>>
>>> 2018-02-22 21:56 GMT+01:00 Dan Andreescu :
>>>
 My view had been informed by the documentation at
> https://dumps.wikimedia.org/other/pagecounts-ez/:
>
> Hourly page views per article for around 30 million article titles
>> (Sept 2013) in around 800+ Wikimedia wikis. Repackaged (with extreme
>> shrinkage, without losing granularity), corrected, reformatted. Daily 
>> files
>> and two monthly files (see notes below).
>
>
> Regarding the claim that pagecounts-ez has data back to when wikimedia
> started tracking pageviews, I'll point out another error in the
> documentation that may have led to that view. The documentation claims 
> that
> data is available from 2007 onward:
>
>  From 2007 to May 2015: derived from Domas' pagecount/projectcount
>> files
>
>
> However, if you check out the actual files (
> https://dumps.wikimedia.org/other/pagecounts-ez/merged/), you'll see
> that the pagecounts only go back to late 2011.
>

 Ah, yes, but the projectcount files go back to 2007-12, that's where
 that confusion comes from, we should clarify or generate the old data.  I'm
 not sure whether this is easy, but I think it's fairly straightforward and
 I've opened a task for it: https://phabricator.wikimedia.org/T188041
 (we have a lot of work in our backlog, though, so we probably won't be able
 to get to this for a bit).

 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics


>>>
>>> ___
>>> Analytics mailing list
>>> 

Re: [Analytics] Wikistats 2.0 - Now with Maps!

2018-02-22 Thread Nuria Ruiz
>Can it be that search bots and other obscure automated processes are
distorting this data, and are there ways to filter that out in order to
know where are the actual humans interested in a >Wikimedia project?
Short answer: yes.
Now, the distortion bot-wise we estimate is < 5% overall.  We did some
research in order to quantify this when we rolled out our unique devices
metric. Take a look:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Bots_Research



On Tue, Feb 20, 2018 at 7:25 AM, Francisco Dans <fd...@wikimedia.org> wrote:

> Hey Quim!
>
> These are pageviews, so our aim is to count only user-originated traffic
> as such, but there are lots of bots and automated traffic pretending to be
> users, most commonly by utilizing fake browser user agents. We're close to
> starting a push to better identify these undercover bots, which you can
> take a look at in T138207 <https://phabricator.wikimedia.org/T138207>.
>
> Thank you for taking a look and for your comments :)
> Fran
>
> On Tue, Feb 20, 2018 at 9:31 AM, Quim Gil <q...@wikimedia.org> wrote:
>
>> Thank you for this feature. Maybe the data was available before, but it's
>> the maps who made me click.  :)
>>
>> https://stats.wikimedia.org/v2/#/ca.wikipedia.org/reading/pa
>> geviews-by-country
>>
>> Turns out that China is a huge follower of Catalan Wikipedia. ;)  Can it
>> be that search bots and other obscure automated processes are distorting
>> this data, and are there ways to filter that out in order to know where are
>> the actual humans interested in a Wikimedia project?
>>
>>
>> On Wed, Feb 14, 2018 at 11:15 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>>
>>> Hello from Analytics team:
>>>
>>> Just a brief note to announce that Wikistats 2.0 includes data about
>>> pageviews per project per country for the current month.
>>>
>>> Take a look, pageviews for spanish wikipedia this current month:
>>> https://stats.wikimedia.org/v2/#/es.wikipedia.org/reading/pa
>>> geviews-by-country
>>>
>>> Data is also available programatically vi APIs:
>>>
>>> https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#
>>> Pageviews_split_by_country
>>>
>>> We will be deploying small UI tweaks during this week but please explore
>>> and let us know what you think.
>>>
>>> Thanks,
>>>
>>> Nuria
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>>
>> --
>> Quim Gil
>> Engineering Community Manager @ Wikimedia Foundation
>> http://www.mediawiki.org/wiki/User:Qgil
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> *Francisco Dans*
> Software Engineer, Analytics Team
> Wikimedia Foundation
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Page hourly views

2018-02-11 Thread Nuria Ruiz
Sorry, not sure we understand this question. Can you elaborate?

On Sun, Feb 11, 2018 at 12:10 PM, Bo Han  wrote:

> Hello,
>
> Is the process for generating pageview hourly backed up?
>
> Thank you
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Nuria Ruiz
>Regarding the last few posts about the geolocation information, from the
data analysis perspective, there is indeed another, more serious concern
about using the GeoIP cookie: >It will create significant discrepancies
with the existing geolocation data we record for pageviews, where we have
chosen to derive this information from the IP instead

How did you came to the conclusion that the data will differ?

GeoIP cookie is inferred from your IP just the same, right?
https://github.com/wikimedia/puppet/blob/production/modules/varnish/templates/geoip.inc.vcl.erb#L10




On Wed, Feb 7, 2018 at 9:09 AM, Tilman Bayer  wrote:

> Thanks everyone! Separate from Sam's mapping out the frontend
> instrumentation work at https://phabricator.wikimedia.org/T184793 , I
> have created a task for the backend work at https://phabricator.wikimedia.
> org/T186728 based on this thread.
>
> Regarding the last few posts about the geolocation information, from the
> data analysis perspective, there is indeed another, more serious concern
> about using the GeoIP cookie: It will create significant discrepancies with
> the existing geolocation data we record for pageviews, where we have chosen
> to derive this information from the IP instead. (Remember the overarching
> goal here of measuring page previews the same way we measure page views
> currently; the basic principle is that if a reader visits a page and then
> uses the page preview feature on that page to read preview cards, all the
> metadata that is recorded for both should have identical values for both
> the preview and the pageview.) Therefore, we should go with the kind of
> solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such).
>
> On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:
>
>> Wow Sam, yeah, if this cookie works for you, it will make many things
>> much easier for us.  Check it out and let us know.  If it doesn’t work for
>> some reason, we can figure out the backend geocoding part.
>>
>>
>>
>> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>>
>>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>>
 > Using the GeoIP cookie will require reconfiguring the EventLogging
 varnishkafka instance [0]

 I’m not familiar with this cookie, but, if we used it, I thought it
 would be sent back to by the client in the event. E.g. event.country =
 response.headers.country; EventLogging.emit(event);

 That way, there’s no additional special logic needed on the server side
 to geocode or populate the country in the event.

>>>
>>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>>> you say, the implementation is quite easy.
>>>
>>> My only concern with this approach is the duplication of the value
>>> between the cookie, which is sent in every HTTP request to the
>>> /beacon/event endpoint, and the event itself. This duplication seems
>>> reasonable when balanced against capturing either: the client IP and then
>>> doing similar geocoding further along in the pipeline; or the cookie for
>>> all requests to that endpoint and then discarding them further along in the
>>> pipeline. It also reflects a seemingly core principle of the EventLogging
>>> system: that it doesn't capture potentiallly PII by default.
>>>
>>> -Sam
>>>
>>>
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Nuria Ruiz
>Wow Sam, yeah, if this cookie works for you, it will make many things much
easier for us
This is how it is done on performance schemas for Navigation timing data
per country, so there is a precedence.
https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/master/modules/ext.navigationTiming.js#L218

In this case because a preview request must happen after a full page
download the cookie will always be available. Now, the cookie mappings are
of this  form US:WA:Seattle so they would need further processing to be
akin to the current pageviews split.

On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto  wrote:

> Wow Sam, yeah, if this cookie works for you, it will make many things much
> easier for us.  Check it out and let us know.  If it doesn’t work for some
> reason, we can figure out the backend geocoding part.
>
>
>
> On Thu, Feb 1, 2018 at 2:43 AM, Sam Smith  wrote:
>
>> On Tue, Jan 30, 2018 at 8:02 AM, Andrew Otto  wrote:
>>
>>> > Using the GeoIP cookie will require reconfiguring the EventLogging
>>> varnishkafka instance [0]
>>>
>>> I’m not familiar with this cookie, but, if we used it, I thought it
>>> would be sent back to by the client in the event. E.g. event.country =
>>> response.headers.country; EventLogging.emit(event);
>>>
>>> That way, there’s no additional special logic needed on the server side
>>> to geocode or populate the country in the event.
>>>
>>
>> Hah! I didn't think about accessing the GeoIP cookie on the client. As
>> you say, the implementation is quite easy.
>>
>> My only concern with this approach is the duplication of the value
>> between the cookie, which is sent in every HTTP request to the
>> /beacon/event endpoint, and the event itself. This duplication seems
>> reasonable when balanced against capturing either: the client IP and then
>> doing similar geocoding further along in the pipeline; or the cookie for
>> all requests to that endpoint and then discarding them further along in the
>> pipeline. It also reflects a seemingly core principle of the EventLogging
>> system: that it doesn't capture potentiallly PII by default.
>>
>> -Sam
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Product] Fwd: Session #6 and into all hands

2018-01-31 Thread Nuria Ruiz
Sorry,  my last correspondence was for analytics-internal@

On Wed, Jan 31, 2018 at 8:29 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:

> If you have time, do skim through these docs. I will do the same between
> today and tomorrow,   they are pretty informative as to how annual plan is
> and what audiences is doing.
>
>
> -- Forwarded message --
> From: Jon Katz <jk...@wikimedia.org>
> Date: Tue, Jan 30, 2018 at 8:16 PM
> Subject: [Product] Fwd: Session #6 and into all hands
> To: WMF Product Team <wmfprod...@lists.wikimedia.org>
>
>
> Hey Folks,
>
> *Annual Planning Context*
> Last week we came together and, among other things, talked about strategy
> and annual planning.  As I promised when presenting, I wanted to share more
> context and details about the annual planning process with you. Many of you
> probably aren't interested in how we honed in on the product principles or
> annual plan goal, but for those of you who *are* the process was pretty
> heavily documented and you should feel free to dig in.  If you have
> specific questions, pinging Danny, Josh or me is probably the fastest way
> to get an answer.  We're also really interested in learning where the holes
> are, so hearing your questions/feedback is really useful.
>
>   Below is an email I shared with the planning group before all hands, but
> here are some other artifacts you might find helpful:
>
>- Preso
>
> <https://docs.google.com/presentation/d/1AoZj4BqeAhHHQKelodVUHuqGXCzK8Ozcp2_aqc5n5uc/edit?usp=sharing>
>from all-audiences meeting at all-hands.
>- Annual planning outcome doc
>
> <https://docs.google.com/document/d/1Mjy7lMg6RBD8dqpRW8uCYtBgeP4kPTyJ2nYSsKuxrUU/edit?usp=sharing>
>  - This
>document elaborates on what we think the theme means as well as what each
>of the 'output' groupings mean.  It is definitely a living document, and
>subject to modification.
>- working doc
>
> <https://docs.google.com/presentation/d/1qQh-0HkvVerqM2or0b0QIIoos2hFX8bjRmh8z2ypemg/edit?usp=sharing>
>  for
>the session before all-hands - This is the session where we identified the
>product groupings
>- Session notes
>
> <https://docs.google.com/document/d/1m4OaCBtnez2PEFzetFEW5KAN7jCcZ9utn_n7LTLeLU8/edit?usp=sharing>
>  -
>Notes from every session of the coordinating group
>- The emails
>
> <https://docs.google.com/document/d/1Wisbp0zdic2NOrtbPEietL0XUcxiKSPLLb0VpCj6Lsk/edit#>
>  sent
>(like the one below) before and after each session
>- Shared folder
><https://drive.google.com/open?id=16b3h0kv0qpZ14sjnHBmVzxAT37GjDN3r>
>with all documentation and output
>
>
> *Next Steps*
> As far as next steps go, product owners will be working with their teams
> to identify projects that fit into the core output groupings and that will
> lead to the impact we've identified.
>
>
>
>
> *Once we have a rough sense of the year, if we haven't already, we will
> need to run it by: - engineering ASAP to assess feasibility and
> timelines..- adjacent teams and dependencies like CE, data analytics, etc.*
>
> In parallel, the data analysts, Danny and I are starting to define the
> primary metrics we will use to evaluate success--we'll run those by the
> PO's for approval. The annual plan draft is due Feb 23.
>
> Again, please reach out with any questions or feedback. If you're
> scratching your head, I've probably done something wrong and should try to
> fix it :)
>
> Best,
>
> J
>
>
> -- Forwarded message --
> From: Jon Katz <jk...@wikimedia.org>
> Date: Sat, Jan 20, 2018 at 2:08 PM
> Subject: Session #6 and into all hands
> To: Dan Garry <dga...@wikimedia.org>, "ggeller...@wikimedia.org" <
> ggeller...@wikimedia.org>, Joshua Minor <jmi...@wikimedia.org>, James
> Forrester <jforres...@wikimedia.org>, Ramsey Isler <ris...@wikimedia.org>,
> Toby Negrin <tneg...@wikimedia.org>, Runa Bhattacharjee <
> rbhattachar...@wikimedia.org>, Lydia Pintscher <
> lydia.pintsc...@wikimedia.de>, Charlotte Gauthier <cgauth...@wikimedia.org>,
> Neil Quinn <nqu...@wikimedia.org>, Corey Floyd <cfl...@wikimedia.org>,
> Trevor Bolliger <tbolli...@wikimedia.org>, Danny Horn <dh...@wikimedia.org>,
> Roan Kattouw <rkatt...@wikimedia.org>, Amir Aharoni <
> aahar...@wikimedia.org>, Nirzar Pangarkar <npangar...@wikimedia.org>,
> Olga Vasileva <ovasil...@wikimedia.org>, Joe Matazzoni <
> jmatazz...@wikimedia.org>, Anne Gomez <ago...@wikimedia.org>, Amanda
> Bittaker <abitta...@wikimedia.org>, Adam Baso 

[Analytics] Fwd: [Product] Fwd: Session #6 and into all hands

2018-01-31 Thread Nuria Ruiz
If you have time, do skim through these docs. I will do the same between
today and tomorrow,   they are pretty informative as to how annual plan is
and what audiences is doing.


-- Forwarded message --
From: Jon Katz 
Date: Tue, Jan 30, 2018 at 8:16 PM
Subject: [Product] Fwd: Session #6 and into all hands
To: WMF Product Team 


Hey Folks,

*Annual Planning Context*
Last week we came together and, among other things, talked about strategy
and annual planning.  As I promised when presenting, I wanted to share more
context and details about the annual planning process with you. Many of you
probably aren't interested in how we honed in on the product principles or
annual plan goal, but for those of you who *are* the process was pretty
heavily documented and you should feel free to dig in.  If you have
specific questions, pinging Danny, Josh or me is probably the fastest way
to get an answer.  We're also really interested in learning where the holes
are, so hearing your questions/feedback is really useful.

  Below is an email I shared with the planning group before all hands, but
here are some other artifacts you might find helpful:

   - Preso
   

   from all-audiences meeting at all-hands.
   - Annual planning outcome doc
   

- This
   document elaborates on what we think the theme means as well as what each
   of the 'output' groupings mean.  It is definitely a living document, and
   subject to modification.
   - working doc
   

for
   the session before all-hands - This is the session where we identified the
   product groupings
   - Session notes
   

-
   Notes from every session of the coordinating group
   - The emails
   

sent
   (like the one below) before and after each session
   - Shared folder
   
   with all documentation and output


*Next Steps*
As far as next steps go, product owners will be working with their teams to
identify projects that fit into the core output groupings and that will
lead to the impact we've identified.




*Once we have a rough sense of the year, if we haven't already, we will
need to run it by: - engineering ASAP to assess feasibility and
timelines..- adjacent teams and dependencies like CE, data analytics, etc.*

In parallel, the data analysts, Danny and I are starting to define the
primary metrics we will use to evaluate success--we'll run those by the
PO's for approval. The annual plan draft is due Feb 23.

Again, please reach out with any questions or feedback. If you're
scratching your head, I've probably done something wrong and should try to
fix it :)

Best,

J


-- Forwarded message --
From: Jon Katz 
Date: Sat, Jan 20, 2018 at 2:08 PM
Subject: Session #6 and into all hands
To: Dan Garry , "ggeller...@wikimedia.org" <
ggeller...@wikimedia.org>, Joshua Minor , James
Forrester , Ramsey Isler ,
Toby Negrin , Runa Bhattacharjee <
rbhattachar...@wikimedia.org>, Lydia Pintscher , Charlotte Gauthier , Neil Quinn <
nqu...@wikimedia.org>, Corey Floyd , Trevor Bolliger <
tbolli...@wikimedia.org>, Danny Horn , Roan Kattouw <
rkatt...@wikimedia.org>, Amir Aharoni , Nirzar
Pangarkar , Olga Vasileva ,
Joe Matazzoni , Anne Gomez ,
Amanda Bittaker , Adam Baso 


Hey folks,

*TLDR:* More clarity for 2018-2019!  Over the last 2 weeks, the
coordinating team better-defined the audience department's impact theme and
identified 4 specific project areas (outputs) we think the teams should
focus their efforts on to generate that impact. The next steps are to
identify the specific projects under those areas and which team is working
on them.  We should start on this next week with our teams.  Reasonable
docs to look at: outcome doc
,
working doc for session #6
,
notes

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-30 Thread Nuria Ruiz
>I’m not totally sure if this works for you all, but I had pictured
generating aggregates from the page preview events, and then joining the
page preview aggregates with the >pageview aggregates into a new table with
an extra dimension specifying which type of content view was made.

On my opinion the aggregated data should stay in two different tables. I
can see a future where the preview data is of different types (might
include rich media that was/was not played, there are simple popups and
"richer" ones .. whatever) and the dimensions where you represent this
consumption are not going to match with pageview_hourly which again only
represents well full page loads.

On Tue, Jan 30, 2018 at 12:02 AM, Andrew Otto  wrote:

> CoOOOl :)
>
> > Using the GeoIP cookie will require reconfiguring the EventLogging
> varnishkafka instance [0]
>
> I’m not familiar with this cookie, but, if we used it, I thought it would
> be sent back to by the client in the event. E.g. event.country =
> response.headers.country; EventLogging.emit(event);
>
> That way, there’s no additional special logic needed on the server side to
> geocode or populate the country in the event.
>
> However, if y’all can’t or don’t want to use the country cookie, then
> yaaa, we gotta figure out what to do about IPs and geocoding in
> EventLogging. There are a few options here, but none of them are great. The
> options basically are variations on ‘treat this event schema as special and
> make special conditionals in EventLogging processor code’, or, 'include IP
> and/or geocode all events in all schemas'. We’re not sure which we want to
> do yet, but we did mention this at our offsite today. I think we’ll figure
> this out and make it happen in the next week or two. Whatever the
> implementation ends up being, we’ll get geocoded data into this dataset.
>
> > Is the geocoding code that we use on webrequest_raw available as an
> Hive UDF or in PySpark?
> The IP is geocoded from wmf_raw.webrequest to wmf.webrequest using a Hive
> UDF
> 
> which ultimately just calls this getGeocodedData
> 
> function, which itself is just a wrapper around the Maxmind API. We may end
> up doing geocoding in the EventLogging server codebase (again, really not
> sure about this yet…), but if we do it will use the same Maxmind databases.
>
>
> > Aggregating the EventLogging data in the same way that we aggregate
> webrequest data into pageviews data will require either: replicating the
> process that does this and keeping the two processes in sync; or
> abstracting away the source table from the aggregation process so that it
> can work on both tables
>
> I’m not totally sure if this works for you all, but I had pictured
> generating aggregates from the page preview events, and then joining the
> page preview aggregates with the pageview aggregates into a new table with
> an extra dimension specifying which type of content view was made.
>
>
> >  I’d appreciate it if someone could estimate how much work it will be
> to implement GeoIP information and the other fields from Pageview hourly
> for EventLogging events
>
> Ya we gotta figure this out still, but actual implementation shouldn’t be
> difficult, however we decide to do it.
>
> On Mon, Jan 29, 2018 at 10:30 PM, Sam Smith 
> wrote:
>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> *Hullo all,It seems like we've arrived at an implementation for the
>> client-side (JS) part of this problem: use EventLogging to track a page
>> interaction from within the Page Previews code. This'll give us the
>> flexibility to take advantage of a stream processing solution if/when it
>> becomes available, to push the definition of a "Page Previews page
>> interaction" to the client, and to rely on any events that we log in the
>> immediate future ending up in tables that we're already familiar with.In
>> principle, I agree with Andrew's argument that adding additional filtering
>> logic to the webrequest refinement process will make it harder to change
>> existing definitions of views or add others in future. In practice though,
>> we'll need to: - Ensure that the server-side EventLogging component records
>> metadata consistent with with our existing content consumption measurement,
>> concretely: the fields available in the
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageview_hourly
>> 
>> table. In particular, that it either doesn't discard the client IP or
>> utilizes the GeoIP cookie sent by the client for this schema.- Aggregate
>> the resulting table so that it can be combined with the pageviews 

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, good to know - is there a report around that? I'm wondering how
"missing requests" ought to be expressed with some margin of error.
I think the ones that can quantify this best is your team. If anything from
what I remember from pop ups experiments the inflow of events was higher
than expected calculations. Overall usage of DNT for FF users was about
~10% last time we looked at it, overall usage on our userbase is quite a
bit smaller I bet.

https://blog.mozilla.org/netpolicy/2013/05/03/mozillas-new-do-not-track-dashboard-firefox-users-continue-to-seek-out-and-enable-dnt/

On Fri, Jan 19, 2018 at 10:09 AM, Adam Baso  wrote:

>
> >Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
>> library would some sort of new method be needed so that these impressions
>> arena't undercounted?
>> If we had a lot of users with DNT, maybe, from our tests when we enabled
>> that on EL this is not the case.
>>
>
> Thanks, good to know - is there a report around that? I'm wondering how
> "missing requests" ought to be expressed with some margin of error.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>So maybe it's worth considering which approach takes us closer to that?
AIUI the beacon puts the record into the webrequest table and from there it
would only take some >trivial preprocessing to replace the beacon URL with
the virtual URL and and add the beacon type as a "virtual_type" field or
something, making it very easy to expose it >everywhere where views are
tracked, while EventLogging data gets stored in a different, unrelated way.
Any thing that involves combing* 1 terabyte of data a day and 150.000
request s per second at peak *cannot be consider "simple" or "trivial".
Rather than looking for a needle in the haystack rely let's please on the
client to send you preselected data (events). That data can be
aggregated later in different ways, and the fact that the data comes from
event logging does not dictate how aggregation needs to happen.




On Wed, Jan 17, 2018 at 6:09 PM, Gergo Tisza <gti...@wikimedia.org> wrote:

> On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> Recording "preview_events" is really no different that recording any
>> other kind of UI event, difference is going to come from scale if anything,
>> as they are probably tens of thousands of those per second (I think your
>> team already estimated volume, if so please send those estimates along)
>>
>
> Conceptually I think a virtual pageview is a different thing from a UI
> event (which is how e.g. Google Analytics handles it, there is a method to
> send an event for the current page and a different method to send a virtual
> pageview for a different page), and the ideal way it is exposed in an
> analytics system should be very different. (I would want to see virtual
> pageviews together with normal pageviews, with some filtering option. If I
> deploy code that shows previews and converts users from making real
> pageviews to making virtual pageviews, I want to see how the total
> pageviews changed in the normal pageview stats; I don't want to have to
> create that chart and export one dataset from pageviews and one dataset
> from eventlogging to do that. As a user, I want to see in the fileview API
> how many people looked at the photo I uploaded, I don't particularly care
> if they used MediaViewer or not. etc.)
>
> So maybe it's worth considering which approach takes us closer to that?
> AIUI the beacon puts the record into the webrequest table and from there it
> would only take some trivial preprocessing to replace the beacon URL with
> the virtual URL and and add the beacon type as a "virtual_type" field or
> something, making it very easy to expose it everywhere where views are
> tracked, while EventLogging data gets stored in a different, unrelated way.
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
library would some sort of new method be needed so that these impressions
arena't undercounted?
If we had a lot of users with DNT, maybe, from our tests when we enabled
that on EL this is not the case. Your team has already run experiments on
this functionality and they can speak as to the projection of numbers.

On Fri, Jan 19, 2018 at 3:05 AM, Adam Baso  wrote:

> Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS
> library would some sort of new method be needed so that these impressions
> arena't undercounted?
>
> On Fri, Jan 19, 2018 at 4:49 AM, Sam Smith  wrote:
>
>> On Thu, Jan 18, 2018 at 9:57 PM, Adam Baso  wrote:
>>
>>> Adding to this, one thing to consider is DNT - is there a way to invoke
>>> EL so that such traffic is appropriately imputed or something?
>>>
>>
>> The EventLogging client respects DNT [0]. When the user enables DNT,
>> mw.eventLog.logEvent is a NOP.
>>
>> I don't see any mention of DNT in the Varnish VCLs around the the /beacon
>> endpoint or otherwise but it may be handled elsewhere. While it's unlikely,
>> there's nothing stopping a client sending a well-formatted request to the
>> /beacon/event endpoint directly [1], ignoring the user's choice.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/diffusion/EEVL/browse/
>> master/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae91
>> 5c1755223fd7a5bab9b9$251
>> [1] https://phabricator.wikimedia.org/diffusion/EEVL/browse/mast
>> er/modules/ext.eventLogging.core.js;4480f7e27140fcb8ae915c
>> 1755223fd7a5bab9b9$215
>>
>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
> I don't see how this addresses Gergo's larger point about the difference
between consistently tallying content consumption (pageviews, previews,
mediaviewer image views) >and analyzing UI interactions (which is the main
use case that EventLogging has been developed and used for).

Event logging use cases are events, as we move to a thicker client -more
javascript heavy- you will be needing to measure events for -nearly-
everything, whether those are to be consider "content consumption"  or "ui
interaction" is not that relevant. Example: video plays are content
consumption and are also "ui interactions".

We are the only major website that does not have a thick client and this
notion of joining UI interactions and consumption is new to us but really
it is not that new at all.


On Thu, Jan 18, 2018 at 3:17 PM, Tilman Bayer <tba...@wikimedia.org> wrote:

>
> On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> Gergo,
>>
>> >while EventLogging data gets stored in a different, unrelated way
>> Not really, This has changed quite a bit as of the last two quarters.
>> Eventlogging data as of recent gets preprocessed and refined similar to how
>> webrequest data is preprocessed and refined. You can have a dashboard on
>> top of some eventlogging schemas on superset in the same way you have a
>> dashboard that displays pageview data on superset.
>>
>
> I don't see how this addresses Gergo's larger point about the difference
> between consistently tallying content consumption (pageviews, previews,
> mediaviewer image views) and analyzing UI interactions (which is the main
> use case that EventLogging has been developed and used for). There are
> really quite a few differences between these two. For example, UI
> instrumentations on the web are almost always sampled, because that yields
> enough data to answer UI questions - but on the other hand tend to record
> much more detail about the individual interaction. In contrast, we register
> all pageviews unsampled, but don't keep a permanent record of every single
> one of them with precise timestamps - rather, we have aggregated tables
> (pageview_hourly in particular). Our EventLogging backend is not tailored
> to that.
>
>
>
>>
>> See dashboards on superset (user required).
>>
>> https://superset.wikimedia.org/superset/dashboard/7/?presele
>> ct_filters=%7B%7D
>>
>> And (again, user required) EL data on druid, this very same data we are
>> talking about, page previews:
>>
>> https://pivot.wikimedia.org/#tbayer_popups
>>
>
> That's actually not the "very same data we are talking about". You can
> rest assured that the web team (and Sam in particular) has already been
> aware of the existence of the Popups instrumentation for page previews. The
> team spent considerable effort building it in order to understand how users
> interact with the feature's UI. Now comes the separate effort of
> systematically tallying content consumption from this new channel. Superset
> and Pivot are great, but are nowhere near providing all the ways that WMF
> analysts and community members currently have to study pageview data.
> Storing data about seen previews in the same way as we do for pageviews,
> for example in the pageview_hourly (suitably tagged, perhaps giving that
> table a more general name) would facilitate that a lot, by allowing us to
> largely reuse the work that during the past few years went into getting
> pageview aggregation right.
>
>
>>
>> >I was going to make the point that #2 already has a processing pipeline
>> established whereas #1 doesn't.
>> This is incorrect, we mark as "preview" data that we want to exclude
>> from processing, see:
>> https://github.com/wikimedia/analytics-refinery-source/blob/
>> master/refinery-core/src/main/java/org/wikimedia/analytics/r
>> efinery/core/PageviewDefinition.java#L144
>> Naming is unfortunate but previews are really "preloads" as in requests
>> we make (and cache locally) and maybe shown to users or not.
>>
>>
>> But again, tracking of events is better done on an event based system and
>> EL is such a system.
>>
>>
>> Again, tracking of individual events is not the ultimate goal here.
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
>Adding to this, one thing to consider is DNT - is there a way to invoke EL
so that such traffic is appropriately imputed or something?

I am not sure what you are asking ...

On Thu, Jan 18, 2018 at 1:57 PM, Adam Baso <ab...@wikimedia.org> wrote:

> (I'd defer to the Readers Web team with Tilman on whether country
> extracted from the cookie would be sufficient.)
>
> Adding to this, one thing to consider is DNT - is there a way to invoke EL
> so that such traffic is appropriately imputed or something?
>
> -Adam
>
> On Thu, Jan 18, 2018 at 2:13 PM, Andrew Otto <o...@wikimedia.org> wrote:
>
>> >  In particular, will we be able to sort by country, OS, Browser, etc?
>> OS, Browser, yes.  User Agent parsing is done by the EventLogging
>> processors.
>>
>> Country not quite as easily, as EventLogging does not include client
>> IP addresses.  We could consider putting this back in somehow, or, I’ve
>> also heard that there is a geocoded country cookie that varnish will set
>> that the browser could send back as part of the event.  Is country enough
>> geo detail?
>>
>>
>>
>> On Thu, Jan 18, 2018 at 2:30 PM, Olga Vasileva <ovasil...@wikimedia.org>
>> wrote:
>>
>>> Hi all,
>>>
>>> I just want to confirm that the proposed method using Eventlogging will
>>> allow us to gather data in a similar fashion to the web request table.  In
>>> particular, will we be able to sort by country, OS, Browser, etc?  Our goal
>>> here is to be able to consider the new page interactions metric on the same
>>> level and with the same depth as pageviews.
>>>
>>> Thanks!
>>>
>>> - Olga
>>>
>>> On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto <o...@wikimedia.org> wrote:
>>>
>>>> > the beacon puts the record into the webrequest table and from there
>>>> it would only take some trivial preprocessing
>>>> ‘Trivial’ preprocessing that has to look through 150K requests per
>>>> second! This is a lot of work!
>>>>
>>>> > tracking of events is better done on an event based system and EL is
>>>> such a system.
>>>> I agree with this too.  We really want to discourage people from trying
>>>> to measure things by searching through the huge haystack of all
>>>> webrequests.  To measure something, you should emit an event if you can.
>>>> If it were practical, I’d prefer that we did this for pageviews as well.
>>>> Currently, we need a complicated definition of what a pageview is, which
>>>> really only exists in the Java implementation in the Hadoop cluster.  It’d
>>>> be much clearer if app developers had a way to define themselves what
>>>> counts as a pageview, and emit that as an event.
>>>>
>>>> This should be the approach that people take when they want to measure
>>>> something new.  Emit an event!  This event will get its own Kafka topic
>>>> (you can consume this to do whatever you like with it), and be refined into
>>>> its own Hive table.
>>>>
>>>> >  I don’t want to have to create that chart and export one dataset
>>>> from pageviews and one dataset from eventlogging to do that.
>>>>  If you also design your schema nicely
>>>> <https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging/Schema_Guidelines>,
>>>> it will be easily importable into Druid and usable in Pivot and Superset,
>>>> alongside of pageviews.  We’re working on getting nice schemas 
>>>> automatically
>>>> imported into druid <https://gerrit.wikimedia.org/r/#/c/386882/>.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 18, 2018 at 11:16 AM, Nuria Ruiz <nu...@wikimedia.org>
>>>> wrote:
>>>>
>>>>> Gergo,
>>>>>
>>>>> >while EventLogging data gets stored in a different, unrelated way
>>>>> Not really, This has changed quite a bit as of the last two quarters.
>>>>> Eventlogging data as of recent gets preprocessed and refined similar to 
>>>>> how
>>>>> webrequest data is preprocessed and refined. You can have a dashboard on
>>>>> top of some eventlogging schemas on superset in the same way you have a
>>>>> dashboard that displays pageview data on superset.
>>>>>
>>>>> See dashboards on superset (user required).
>>>>>
>>>>> htt

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
Gergo,

>while EventLogging data gets stored in a different, unrelated way
Not really, This has changed quite a bit as of the last two quarters.
Eventlogging data as of recent gets preprocessed and refined similar to how
webrequest data is preprocessed and refined. You can have a dashboard on
top of some eventlogging schemas on superset in the same way you have a
dashboard that displays pageview data on superset.

See dashboards on superset (user required).

https://superset.wikimedia.org/superset/dashboard/7/?preselect_filters=%7B%7D

And (again, user required) EL data on druid, this very same data we are
talking about, page previews:

https://pivot.wikimedia.org/#tbayer_popups


>I was going to make the point that #2 already has a processing pipeline
established whereas #1 doesn't.
This is incorrect, we mark as "preview" data that we want to exclude from
processing, see:
https://github.com/wikimedia/analytics-refinery-source/blob/master/refinery-core/src/main/java/org/wikimedia/analytics/refinery/core/PageviewDefinition.java#L144
Naming is unfortunate but previews are really "preloads" as in requests we
make (and cache locally) and maybe shown to users or not.


But again, tracking of events is better done on an event based system and
EL is such a system.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] How best to accurately record page interactions in Page Previews

2018-01-17 Thread Nuria Ruiz
(Moving ops list to bcc)

>Are there other ways of recording this information? We're fairly confident
that #1 seems like the best choice here but it's referred to as the
"virtual file view hack". Is this really the case?
Yes, there are, please use eventlogging.

Recording "preview_events" is really no different that recording any other
kind of UI event, difference is going to come from scale if anything, as
they are probably tens of thousands of those per second (I think your team
already estimated volume, if so please send those estimates along)

We discourage you from sending events directly to beacon. Rather, use the
EL client to send a page-preview event defined in a given schema. This is a
similar approach as to how we will be measuring banner impressions for
fundraising banners in the future.

Thanks,

Nuria



On Wed, Jan 17, 2018 at 1:51 AM, Sam Smith  wrote:

> Hullo,
>
> Page Previews is now fully deployed to all but 2 of the Wikipedias. In
> deploying it, we've created a new way to interact with pages without
> navigating to them. This impacts the overall and per-page pageviews metrics
> that are used in myriad reports, e.g. to editors about the readership of
> their articles and in monthly reports to the board. Consequently, we need
> to be able to report a user reading the preview of a page just like we do
> them navigating to it.
>
> Readers Web are planning to instrument Page Previews such that when a
> preview is available and open for longer than X ms, a "page interaction" is
> recorded. We're aware of a couple of mechanisms for recording something
> like this from the client:
>
>1. All files viewed with the media viewer are recorded by the client
>requesting the /beacon/media?duration=X=Y URL at some point [0] – as
>Nuria points out in that thread, requests to /beacon/... are already
>filtered and a canned response is sent immediately by Varnish [1].
>2. Requesting a URL with the X-Analytics header [2] set to "preview".
>In this context, we'd make a HEAD request to the URL of the page with the
>header set.
>
> IMO #1 is preferable from the operations and performance perspectives as
> the response is always served from the edge and includes very few headers,
> whereas the request in #2 may be served by the application servers if the
> user is logged in (or in the mobile site's beta cohort). However, the
> requests in #2 are already
>
> We're currently considering recording page interactions when previews are
> open for longer than 1000 ms. We estimate that this would increase overall
> web requests by 0.3% [3].
>
> Are there other ways of recording this information? We're fairly confident
> that #1 seems like the best choice here but it's referred to as the
> "virtual file view hack". Is this really the case? Moreover, should we
> request a distinct URL, e.g. /beacon/preview?duration=X=Y, or should
> we consolidate the URLs as both represent the same thing essentially?
>
> Thanks,
>
> -Sam
>
> Timezone: GMT
> IRC (Freenode): phuedx
>
> [0] https://lists.wikimedia.org/pipermail/analytics/2015-March/003633.html
> [1] 
> *https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modules/varnish/templates/vcl/wikimedia-frontend.vcl.erb;1bce79d58e03bd02888beef986c41989e8345037$269
> *
> [2] https://wikitech.wikimedia.org/wiki/X-Analytics
> [3] https://phabricator.wikimedia.org/T184793#3901365
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Reboot of eventlog1001 for kernel upgrades

2018-01-15 Thread Nuria Ruiz
>If you see a dip in Eventlogging schema metrics (https://grafana.wikimedia.
org/dashboard/db/eventlogging-schema?orgId=1) it will be my fault :)

To be super clear: the host will stop consuming for as long as it is being
rebooted, it iwll pick up past data once it comes back online.

On Mon, Jan 15, 2018 at 5:38 AM, Luca Toscano 
wrote:

> Hi everybody,
>
> I am about to reboot eventlog1001 for kernel upgrades. This host runs all
> the Eventlogging daemons that pull data from Kafka, elaborate it and then
> push to Mysql. The maintenance is needed to deploy the new Linux Kernel
> that fixes the Meltdown vulnerability.
>
> If you see a dip in Eventlogging schema metrics (
> https://grafana.wikimedia.org/dashboard/db/eventlogging-schema?orgId=1)
> it will be my fault :)
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Engineering] Important news about Analytics databases

2017-11-22 Thread Nuria Ruiz
>The log database is scheduled to be dropped from dbstore1002 on Tuesday
28th. After that, the log database will be available only on db1108
(analytics-slave.eqiad.wmnet).
To make sure everyone is on the same page this means that you need to
connect to analytics-slave.eqiad.wmnet if you wish to look at eventlogging
data on MySQL

On Tue, Nov 21, 2017 at 6:28 AM, Luca Toscano 
wrote:

> Hi everybody,
>
> some updates about the status of the Analytics databases refactoring:
>
> 1) analytics-slave.eqiad.wmnet's CNAME now points to db1108, the new host.
> The staging database that was on db1047 (the old CNAME) has been copied to
> db1108 so all the previous data is there. We are working on the owners of
> the remaining databases to figure out what to backup and what not, after
> that we'll be able to finally decommission db1047.
>
> 2) s[12]-analytics.eqiad.wmnet now point to dbstore1002.eqiad.wmnet
> (analytics-store). Previously they were pointing to db1047 (old CNAME for
> analytics-slave).
>
> 3) the analytics-store.eqiad.wmnet log database (dbstore1002.eqiad.wmnet)
> is going to be dropped soon (it was scheduled for the 20th but we thought
> to wait a bit more). Some people already followed up in
> https://phabricator.wikimedia.org/T156844 and as far as I can see there
> is no opposition to proceed, please ping us otherwise.
> The log database is scheduled to be dropped from dbstore1002 on Tuesday
> 28th. After that, the log database will be available only on db1108
> (analytics-slave.eqiad.wmnet).
>
> Thanks a lot!
>
> Luca
>
> 2017-11-08 12:02 GMT+01:00 Luca Toscano :
>
>> Hi everybody,
>>
>> the Analytics team needs to make some changes to the current
>> configuration and deployment of the Analytics databases. Before starting a
>> little refresh to be on the same page:
>>
>> - db1046 - eventlogging master database
>> - db1047 - also known as analytics-slave.eqiad.wmnet - replicates via
>> mysql s1/s2 and the log database (on db1046) using a custom replication
>> script.
>> - dbstore1002 - also known as analytics-store.eqiad.wmnet and
>> x1-analytics-slave.eqiad.wmnet - replicates most of the S shards and X1 via
>> mysql, and the log database using a custom replication script.
>> - db1108 (brand new host) - replicates the log database using a custom
>> replication script.
>>
>> We have been suffering during the past months some space and performance
>> issues on dbstore1002 (https://phabricator.wikimedia.org/T168303), so we
>> came up with the following plan:
>>
>> - db1108, a brand new host with SSD disks, replaces db1047 and becomes
>> the CNAME of analytics-slave.eqiad.wmnet. This new host will be a replica
>> of the log database only, no other database will be replicated.
>> - dbstore1002 will loose the support of the log database, that will be
>> dropped from the host.
>> - db1047 will eventually be decommissioned (after backing up data and
>> alert people beforehand - T156844).
>>
>> This will allow us to:
>> 1) Reduce the load on dbstore1002 and free a lot of space on the host.
>> 2) Offer a more performant way to query eventlogging analytics data.
>> 3) Reduce the current performance issues that we have been experiencing
>> while trying to sanitize/purge old event-logging data (
>> https://phabricator.wikimedia.org/T156933)
>>
>> The plan is the following:
>>
>> - November 13th: the analytics-slave CNAME moves from db1047 to db1108
>> - November 20th: the log database will be dropped from
>> dbstore1002/analytics-store together with the event-logging replication
>> script
>> - December 4th: shutdown of db1047 (prior backup of non-log database
>> tables)
>>
>> More info in https://phabricator.wikimedia.org/T156844
>>
>> To summarize what will change from the users perspective:
>>
>> - dbstore1002 (analytics-store) will offer all the S/X shards replication
>> (wikis) and all the databases like staging that everybody is used to work
>> with. It will only loose the support of the log database.
>> - db1108 will offer the log database replication and a staging database.
>> - the db1047's (analytics-slave) staging database will be moved or copied
>> with a different name (like staging_db1047) to dbstore1002.
>>
>> Please let us know in the task your opinion in T156844, we'd love to hear
>> some feedback before proceeding, especially about extra requirements that
>> we haven't thought of.
>>
>> Thanks!
>>
>> Luca (on behalf of the Analytics team)
>>
>>
>
> ___
> Engineering mailing list
> engineer...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/engineering
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Undocumented project code in pagecounts-ez

2017-11-22 Thread Nuria Ruiz
Maybe this doc will help?

https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-all-sites#Disambiguating_abbreviations_ending_in_.E2.80.9C.m.E2.80.9D

On Tue, Nov 14, 2017 at 1:29 PM, Michael Baldwin 
wrote:

> Thanks, Federico.
>
> In the docs you referenced, I can't find any reference to "en.m" that
> contrasts with the "en.z". This page
>  describes
> codes, but they're different from the ones in pagecounts-ez
> .
>
> I've noticed the "en.m" lines only started appearing in Dec 2015.
>
> I'm just trying to understand, if I want the most accurate pagecounts over
> time, should I be including the "en.m" lines on top of "en.z", or are they
> something different?
>
>
> On Tue, Nov 14, 2017 at 7:24 AM, Federico Leva (Nemo) 
> wrote:
>
>> Michael Baldwin, 14/11/2017 04:43:
>>
>>> However, I've been coming across a large number of wiki codes "en.m".
>>> The "m" code is undocumented. It appears to be the mobile version of
>>> Wikipedia, but can anyone confirm that? Should the page be updated with
>>> this information?
>>>
>>
>> Historically we collect most docs here:
>> 
>> 
>>
>> Federico
>>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] research process (was Re: Google Code-in: Get your tasks for young contributors prepared!)

2017-11-17 Thread Nuria Ruiz
Lars:

This is not so much a request for a research project but rather an ad-hoc
requests for data. This request is a similar one to the one you mentioned
on a prior thread: https://www.mail-archive.com/
analytics@lists.wikimedia.org/msg03760.html for which you filed this
phabricator ticket: https://phabricator.wikimedia.org/T144714

As Leila mentioned our resources to attend to such a requests are very
limited but more so, in this case most of the data you are interested in we
do not have for the reason's explained prior.I am actually not sure we
would have any data at all that could help you to be honest.

In case anyone wonders our FAQ for ad-hoc data requests is here:
https://meta.wikimedia.org/wiki/Research:FAQ#Where_do_I_find_data_or_statistics_about_a_specific_Product_Audience.3F

Thanks,

Nuria

On Fri, Nov 17, 2017 at 8:07 PM, Leila Zia <le...@wikimedia.org> wrote:

> An update on this request:
>
> Lars and I went off-list for a bit (Nuria and Mikhail are cc-ed in
> those conversations). Research doesn't have capacity to pick this task
> up at the moment, but if other people with appropriate access have
> bandwidth to pick it up and respond to it, they should feel free to. A
> few things for those who may be able to help:
> * Lars confirmed that even the skewed data from very specific browsers
> may help them gain some insight and it can be better than not knowing
> anything extra at all (the current case).
> * If you decide to work on a query and release the data, please ping
> Research and Legal before releasing it unless the data is highly
> aggregated. This can especially be important in this case where only a
> few not-very-widely used browsers are sending this information to our
> servers.
>
> Lars: I'm sorry that Research was not able to be of help. With the
> best of our intentions, we have to say no to so many requests. We need
> to be aware of our already long backlogs, but also aware of other
> teams' backlogs that will be affected by our decision. In this case,
> depending on which path we go with, Research commitment can mean
> Security, Legal, Analytics, and Tech Ops commitment and work.
>
> Thank you for your understanding, and I'm here to help if someone else
> picks up this task and they need Research input.
>
> Best,
> Leila
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
>
> On Tue, Nov 7, 2017 at 12:22 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> > I would say that referrer "origin-when-cross-origin" (Send a full URL
> when
> > performing a same-origin request, but only send the origin of the
> document
> > for other cases) is probably the most widely deployed default on the
> > internets, we use it as well as google, facebook...
> >
> >
> > For wikipedia, see: https://phabricator.wikimedia.org/T87276
> >
> > On Tue, Nov 7, 2017 at 12:07 PM, Lars Noodén <lars.noo...@gmail.com>
> wrote:
> >>
> >> On 11/07/2017 09:49 PM, Mikhail Popov wrote:
> >> > The referrer policy is already in use at Google, which is why we
> >> > don't see users' search queries in referer field in our request logs;
> >> > just that they came from Google.
> >>
> >> Thanks.  I'm looking at the current version:
> >>  https://www.w3.org/TR/referrer-policy/
> >>
> >> Are there any published articles, statistics, or reports about how
> >> widely referrer policy has already been deployed?
> >>
> >> /Lars
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] research process (was Re: Google Code-in: Get your tasks for young contributors prepared!)

2017-11-07 Thread Nuria Ruiz
I would say that referrer "origin-when-cross-origin" (Send a full URL when
performing a same-origin request, but only send the origin of the document
for other cases) is probably the most widely deployed default on the
internets, we use it as well as google, facebook...


For wikipedia, see: https://phabricator.wikimedia.org/T87276

On Tue, Nov 7, 2017 at 12:07 PM, Lars Noodén  wrote:

> On 11/07/2017 09:49 PM, Mikhail Popov wrote:
> > The referrer policy is already in use at Google, which is why we
> > don't see users' search queries in referer field in our request logs;
> > just that they came from Google.
>
> Thanks.  I'm looking at the current version:
>  https://www.w3.org/TR/referrer-policy/
>
> Are there any published articles, statistics, or reports about how
> widely referrer policy has already been deployed?
>
> /Lars
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Heads up: mw.track client-side EventLogging mechanism "ignored" certain events

2017-10-13 Thread Nuria Ruiz
>Regarding "minority of cases", is this based on a quantitative
>estimate? I would be interested to learn more about this.Yes, this is race
condition, subscribing to onload is an anti-pattern as it is not a promise,
but more often than not it would have worked fine.
You can see that on the data too (which might make more sense to you)  the
fix went live on 2017-09-28 and you can see no noticeable increase on
NavigationTiming events:
https://grafana.wikimedia.org/dashboard-solo/db/eventlogging-schema?orgId=1;
var-schema=NavigationTiming=1505250864328=1507842864328=9


In the Popup case you might have hit this race condition perhaps more often
(that can be), it will be easy enough to verify when you set up your next
experiment.

Thanks,

Nuria



On Thu, Oct 12, 2017 at 9:33 PM, Tilman Bayer <tba...@wikimedia.org> wrote:

> Regarding "minority of cases", is this based on a quantitative
> estimate? I would be interested to learn more about this.
>
> In any case though, Nuria is correct in pointing out that it's less
> than 100% - it's certainly possible that events are being sent again
> later in the session (this happened e.g. when I reproduced the bug
> here: https://phabricator.wikimedia.org/T175918#3612580 , and is also
> evident in data e.g. from the previous Popups experiments).
>
> On Thu, Oct 12, 2017 at 2:14 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> > Please have in mind that hiting this bug is a race condition and it is
> hit
> > in a minority of cases, not all times. The essence of the bug has to do
> with
> > the subscription to the "load" event. In some instances the event had
> > already happened by the time the EL code was loaded.
> >
> > Thanks,
> >
> > Nuria
> >
> >
> >
> >
> >
> > On Thu, Oct 12, 2017 at 7:35 AM, Dan Andreescu <dandree...@wikimedia.org
> >
> > wrote:
> >>
> >> Thanks for the post, this bug will definitely bias any data people got
> >> with mw.track.  If the data is found to be so broken as to be useless,
> >> should we delete it up through the date the fix goes live?  Asking
> people
> >> who use mw.track, not Sam
> >>
> >> On Thu, Oct 12, 2017 at 6:41 AM, Sam Smith <samsm...@wikimedia.org>
> wrote:
> >>>
> >>> o/
> >>>
> >>> Prior to Thursday, 28th September, if your client-side EventLogging
> >>> instrumentation logged event via mw.track, then only events tracked
> >>> during the first pageview of a user's session were logged.
> >>>
> >>> Now, technically, the events weren't ignored or dropped. Instead, the
> >>> subscriber for the "event" topic was never subscribed when the module
> >>> was loaded from the ResourceLoader's cache and so events published on
> >>> that topic simply weren't received and logged.
> >>>
> >>> This bug was discovered while testing some instrumentation maintained
> >>> by Readers Web [0] and independently by Timo Tijhof, who submitted the
> >>> ideal fix [1] promptly.
> >>>
> >>> -Sam
> >>>
> >>> [0] https://phabricator.wikimedia.org/T175918
> >>> [1] https://gerrit.wikimedia.org/r/#/c/378804/
> >>>
> >>> ---
> >>> Engineering Manager
> >>> Readers
> >>>
> >>> Timezone: BST (UTC+1)
> >>> IRC (Freenode): phuedx
> >>>
> >>> ___
> >>> Analytics mailing list
> >>> Analytics@lists.wikimedia.org
> >>> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >>
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Heads up: mw.track client-side EventLogging mechanism "ignored" certain events

2017-10-12 Thread Nuria Ruiz
Please have in mind that hiting this bug is a race condition and it is hit
in a minority of cases, not all times. The essence of the bug has to do
with the subscription to the "load" event. In some instances the event had
already happened by the time the EL code was loaded.

Thanks,

Nuria





On Thu, Oct 12, 2017 at 7:35 AM, Dan Andreescu 
wrote:

> Thanks for the post, this bug will definitely bias any data people got
> with mw.track.  If the data is found to be so broken as to be useless,
> should we delete it up through the date the fix goes live?  Asking people
> who use mw.track, not Sam
>
> On Thu, Oct 12, 2017 at 6:41 AM, Sam Smith  wrote:
>
>> o/
>>
>> Prior to Thursday, 28th September, if your client-side EventLogging
>> instrumentation logged event via mw.track, then only events tracked
>> during the first pageview of a user's session were logged.
>>
>> Now, technically, the events weren't ignored or dropped. Instead, the
>> subscriber for the "event" topic was never subscribed when the module
>> was loaded from the ResourceLoader's cache and so events published on
>> that topic simply weren't received and logged.
>>
>> This bug was discovered while testing some instrumentation maintained
>> by Readers Web [0] and independently by Timo Tijhof, who submitted the
>> ideal fix [1] promptly.
>>
>> -Sam
>>
>> [0] https://phabricator.wikimedia.org/T175918
>> [1] https://gerrit.wikimedia.org/r/#/c/378804/
>>
>> ---
>> Engineering Manager
>> Readers
>>
>> Timezone: BST (UTC+1)
>> IRC (Freenode): phuedx
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Archiving some eventlogging tables to hadoop

2017-10-03 Thread Nuria Ruiz
Team:

Our mysql backend for eventlogging is having issues due to disk space. We
need to free space on near term so we will be archiving some tables to
hadoop.

Please see: https://phabricator.wikimedia.org/T168303 and:
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Hadoop._Archived_Data

Thanks,

Nuria
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Resources stat1005

2017-08-14 Thread Nuria Ruiz
Adrian,

You already have access to use the cluster, which is where you  should move
your processing, the link to yarn was just to show resource consumption.

Thanks,

Nuria

On Sat, Aug 12, 2017 at 3:52 PM, Adrian Bielefeldt <
adrian.bielefe...@mailbox.tu-dresden.de> wrote:

> Hi Andrew,
>
> thanks for the advice. Quick follow-up question: Which kind of
> access/account do I need for yarn.wikimedia.org? Neither my
> MediaWiki-Account nor my WikiTech-Account work.
>
> Greetings,
>
> Adrian
>
> On 08/12/2017 09:58 PM, Andrew Otto wrote:
>
> I have only 2 comments:
>
> 1.  Please nice  any heavy
> long running local processes, so that others can continue to use the
> machine.
>
> 2. For large data, consider using the Hadoop cluster!  I think you are
> getting your data from the webrequest logs in Hadoop anyway, so you might
> as well continue to do processing there, no?  If you do, you shouldn’t have
> to worry (too much) about resource contention: https://yarn.
> wikimedia.org/cluster/scheduler
>
> :)
>
> - Andrew Otto
>   Systems Engineer, WMF
>
>
>
>
> On Sat, Aug 12, 2017 at 2:20 PM, Erik Zachte 
> wrote:
>
>> I will soon start the two Wikistats jobs which run for about several
>> weeks each month,
>>
>> They might use two cores each, one for unzip, one for perl.
>>
>> How many cores are there anyway?
>>
>>
>>
>> Cheers,
>>
>> Erik
>>
>>
>>
>> *From:* Analytics [mailto:analytics-boun...@lists.wikimedia.org] *On
>> Behalf Of *Adrian Bielefeldt
>> *Sent:* Saturday, August 12, 2017 19:44
>> *To:* analytics@lists.wikimedia.org
>> *Subject:* [Analytics] Resources stat1005
>>
>>
>>
>> Hello everyone,
>>
>> I wanted to ask about resource allocation on stat1005. We
>> 
>> need quite a bit since we process every entry in wdqs_extract and I was
>> wondering how many cores and how much memory we can use without conflicting
>> with anyone else.
>>
>> Greetings,
>>
>> Adrian
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> ___
> Analytics mailing 
> listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Article creation stats

2017-08-14 Thread Nuria Ruiz
>Would there happen to be a dataset of that available somewhere?

Data is available on public labs replicas but sql is complicated to write
and likely to time out due the volume of data that is combing. Data is also
available on Hadoop Data Lake which is not public yet (it is our plan to
make it so). This data has already been used to gather such a stats. See:
https://phabricator.wikimedia.org/T149021

On Sun, Aug 13, 2017 at 10:10 AM, Morten Wang  wrote:

> Hello everyone,
>
> I'm currently working gathering data for the Autoconfirmed article
> creation trial project[1]. One of the measures we're interested in is the
> number of new articles, both surviving and deleted, that is created per
> day. I know that recent data is logged through EventBus, but if possible
> I'd would also like to have historic stats on this (e.g. going back a
> handful of years). Would there happen to be a dataset of that available
> somewhere?
>
>
> References:
> 1: https://meta.wikimedia.org/wiki/Research:Autoconfirmed_
> article_creation_trial
>
> Cheers,
> Morten
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] fishy browser stats

2017-08-03 Thread Nuria Ruiz
> "At about 0.5% of our total human views currently, they start to matter
for overall traffic trend analysis etc."
Our bot traffic not reported as such is a lot higher than 0.5%, probably
more like one order of magnitude higher 2% at least. See:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Pageviews/Bots_Research

This matters a lot for computations like top pageviews which are so
distorted by bot traffic that they almost become not useful. Now, while
these are overall numbers the effect is felt mostly on english wikipedia as
smaller wikipedias have a lot smaller percentage of non reported bots.

On Thu, Aug 3, 2017 at 8:59 AM, Tilman Bayer <tba...@wikimedia.org> wrote:

> For those with NDA access, see also the more detailed investigation at
> https://phabricator.wikimedia.org/T157404 (nothing super secret about the
> topic per se, it's just that some partial IP data was examined in the
> process, so the task was set to non-public to avoid privacy concerns)
>
> When filing that task half a year ago, I wrote that "At about 0.5% of our
> total human views currently, they start to matter for overall traffic trend
> analysis etc." They have since increased and, as can be gleaned from
> Kaldari's remarks, do indeed affect our global stats markedly now. I have
> started to remove them in the pageviews stats and trends I'm preparing,
> will follow up with more detail on Phabricator.
>
>
> On Fri, Jul 21, 2017 at 9:24 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> >Surely this can't be accurate though as most other sites on the
>> internet report virtually non-existent usage of IE7 (less than 1%
>> everywhere I've checked). Can someone >double-check this?
>> This is likely bot traffic with IE7 user-agent. See:
>> https://phabricator.wikimedia.org/T148461
>>
>> We will hopefully be able to tackle distortion of stats by non-labelled
>> bot traffic in the next year: https://phabricator.wikimedia.org/T138207
>>
>> Issue for dataset noted here: https://wikitech.wikimed
>> ia.org/wiki/Analytics/Data_Lake/Traffic/Browser_general#Chan
>> ges_and_known_problems_since_2016-03-21
>>
>>
>>
>>
>> On Fri, Jul 21, 2017 at 4:36 PM, Ryan Kaldari <rkald...@wikimedia.org>
>> wrote:
>>
>>> According to...
>>> https://analytics.wikimedia.org/dashboards/browsers/#all-sit
>>> es-by-browser/browser-family-and-major-hierarchical-view
>>> ... IE7 accounts for 2.5% of all pageviews in the last month.
>>>
>>> According to...
>>> https://analytics.wikimedia.org/dashboards/browsers/#desktop
>>> -site-by-browser/browser-family-and-major-hierarchical-view
>>> ... IE7 accounts for 5.1% of all desktop pageviews in the last month.
>>>
>>> If that's true, IE7 (which came out 10 years ago) is more popular than
>>> all versions of Safari combined. It also means that we need to roll back a
>>> whole slew of features in MediaWiki that aren't supported in IE7.
>>>
>>> Surely this can't be accurate though as most other sites on the internet
>>> report virtually non-existent usage of IE7 (less than 1% everywhere I've
>>> checked). Can someone double-check this?
>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
>
> --
> Tilman Bayer
> Senior Analyst
> Wikimedia Foundation
> IRC (Freenode): HaeB
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Daily merged pageviews stopped ?

2017-08-01 Thread Nuria Ruiz
Ticket here:

https://phabricator.wikimedia.org/T172032

On Tue, Aug 1, 2017 at 1:22 PM, Akeron  wrote:

> Hello,
>
> Last file is one week ago : pagecounts-2017-07-23.bz2
> https://dumps.wikimedia.org/other/pagecounts-ez/merged/2017/2017-07/
>
> Thanks,
>
> Akeron.
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Analytics project request

2017-07-24 Thread Nuria Ruiz
Daniel,

Singining an NDA is not enough to get access to the data, you also need to
be part of  a formal research collaboration with our research team, they
have a number of those and they are not likely to accept any more soon but
you can contact them on that regard:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations

Thanks,

Nuria



On Mon, Jul 24, 2017 at 6:37 AM, Daniel Oberski 
wrote:

> Dear list,
>
> I'm posting a recent conversation with Dan below, as well as a few
> follow-up questions.
>
> Dan was kind enough to point out this list. I apologize that the post is
> "backward" (in
> email-thread format) due to my ignorance, will use this list from now on.
>
> Thanks, Daniel
>
>
> 
>
> Hi Dan
>
>
> Thanks for getting back to me so quickly!
>
> >Thanks for writing.  In general these questions are best asked on our
> public list, so other
> >people can see and benefit from any answers: https://lists.wikimedia.org/
> mailman/listinfo/
> >analytics
>
> Thanks, I've joined this list and will ask subsequent questions there.
>
> >* pairs of pages: we have two datasets that are mentioned in this task
> https://
> >phabricator.wikimedia.org/T158972 which should be very interesting for
> this purpose.  They
> >aren't being updated right now, and the task is to do just that.  We'll
> probably get to
> >that within the next 3 months, but a bunch of us are on paternity leave
> this summer, so
> >things are a little slower than normal
>
> This seems close to what I need. From the descriptions I gather the
> linkage is by session.
> Is there also a linkage by ip (with IP's removed of course)?
>
> >* country data for pageviews: for privacy reasons we only allow access to
> this with an
> >NDA.  We have good data on it, but you need to sign this NDA and use our
> cluster to access
> >it, being careful about what you report about it to the world at large.
> Here's information
> >on that: https://wikitech.wikimedia.org/wiki/Volunteer_NDA
>
> I've read this and am happy to sign an NDA. I understand it is best to be
> as specific as
> possible about the reasoning, intentions with the data, and permissions
> required. For me to
> figure this out it would be useful to know the relevant parts of the
> database schema, and
> perhaps a hint as to which data might be most interesting there. Would you
> be able to point
> me towards that?
>
> >Hope that helps, and feel free to write back to the public list in the
> future.
>
> Definitely, very helpful and thank you!
>
> Best, Daniel
>
>
> On Wed, Jul 19, 2017 at 9:51 AM, Oberski, D.L. (Daniel) 
> wrote:
> Dear Dan,
>
>
> My name is Daniel Oberski, I'm an associate professor of data science
> methodology in the
> department of statistics at Utrecht University in the Netherlands.
>
> I've been using your incredibly useful pageviews API to study correlations
> between the
> amount of interest people show in a topic (pageviews) with other data such
> as political
> party preference over time. That has yielded some interesting results
> (which I have yet to
> write up).
>
> However, to do a better study it would be very helpful to have slightly
> more information
> than is in the API. Specifically, it would be very useful to be able to
> query, for each
> _pair_ of pages, how many people (or IP's) viewed _both_ of those pages.
> That way I can find
> out which pages are really indicative of interest in a specific common
> topic, rather than
> just correlated by accident. In addition, I've found it hard to figure out
> pageviews for
> specific pages by country rather than language.
>
> My question is, would you happen to know if is there any way to obtain
> this information?
> (does not necessarily have to be through the API.) Or do you know if there
> are people to
> whom I might talk about this?
>
> Thanks for reading (to) the end and best regards,
>
> Daniel
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] fishy browser stats

2017-07-21 Thread Nuria Ruiz
>Surely this can't be accurate though as most other sites on the internet
report virtually non-existent usage of IE7 (less than 1% everywhere I've
checked). Can someone >double-check this?
This is likely bot traffic with IE7 user-agent. See:
https://phabricator.wikimedia.org/T148461

We will hopefully be able to tackle distortion of stats by non-labelled bot
traffic in the next year: https://phabricator.wikimedia.org/T138207

Issue for dataset noted here:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Browser_general#Changes_and_known_problems_since_2016-03-21




On Fri, Jul 21, 2017 at 4:36 PM, Ryan Kaldari 
wrote:

> According to...
> https://analytics.wikimedia.org/dashboards/browsers/#all-
> sites-by-browser/browser-family-and-major-hierarchical-view
> ... IE7 accounts for 2.5% of all pageviews in the last month.
>
> According to...
> https://analytics.wikimedia.org/dashboards/browsers/#
> desktop-site-by-browser/browser-family-and-major-hierarchical-view
> ... IE7 accounts for 5.1% of all desktop pageviews in the last month.
>
> If that's true, IE7 (which came out 10 years ago) is more popular than all
> versions of Safari combined. It also means that we need to roll back a
> whole slew of features in MediaWiki that aren't supported in IE7.
>
> Surely this can't be accurate though as most other sites on the internet
> report virtually non-existent usage of IE7 (less than 1% everywhere I've
> checked). Can someone double-check this?
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Eventlogging incident report

2017-07-18 Thread Nuria Ruiz
Team:

Please see the recent incident report for eventlogging [1], [2]

TL;DR After addition of some EventBus events to MySQL we had an issue with
insertion of events in which some events were dropped. This affected all
schemas. Events for al schemas have been backfilled as of now.


[1]
https://wikitech.wikimedia.org/wiki/Incident_documentation/20170711-EventLogging

[2]
https://wikitech.wikimedia.org/wiki/Analytics/Systems/EventLogging#Changes_and_Known_Problems_with_Dataset
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wikitech-l] Drop in mainpage pageviews?

2017-07-17 Thread Nuria Ruiz
> do not remember the exact date, but a couple of months ago the way for
> pageviews counting was changed, using cookies, affecting the mobile web
> views. This can be the cause.
> Igal (User IKhitron)

... not sure what you are referring to, can you be more specific? We
have not changed the way we measure pageviews for 2 years

There are improvements/fixes continously done to the system and those get
documented here:

https://meta.wikimedia.org/wiki/Research:Page_view#Change_log


FYI, we have opened a ticket for this issue:
https://phabricator.wikimedia.org/T170845

Without looking at it in detail there are many possible causes: removal of
"bad" pageviews (like requests for banners) or drop of traffic due to
having less bots.











On Sat, Jul 15, 2017 at 11:38 PM, Strainu  wrote:

> 2017-07-15 14:34 GMT+03:00 יגאל חיטרון :
> > Hello, Strainu.
> > 1. Try in place of "all-access" use different platforms. You'll see, as I
> > expected reading your letter, that the effect you recognized appears in
> > "mobile-web" mostly.
> > 2. I do not remember the exact date, but a couple of months ago the way
> for
> > pageviews counting was changed, using cookies, affecting the mobile web
> > views. This can be the cause.
> > Igal (User IKhitron)
>
> Thank you Igal, that must be the reason, I'll look for the change
> announcement just to understand what it means exactly.
>
> Strainu
>
> >
> > On Jul 15, 2017 13:47, "Strainu"  wrote:
> >
> >> Hi,
> >>
> >> Starting from an unrelated discussion on meta, I noticed a significant
> >> drop in main page views for several wikis starting from April this
> >> year. Is there anything we (or Google) did at that time to justify
> >> this drop?
> >>
> >> ro.wiki: http://tools.wmflabs.org/pageviews/?project=ro.
> >> wikipedia.org=all-access=user=2016-
> >> 01=2017-06=Pagina_principal%C4%83
> >>
> >> hu.wiki: http://tools.wmflabs.org/pageviews/?project=hu.
> >> wikipedia.org=all-access=user=2016-
> >> 01=2017-06=Kezd%C5%91lap
> >>
> >> fr.wiki: http://tools.wmflabs.org/pageviews/?project=fr.
> >> wikipedia.org=all-access=user=2016-
> >> 01=2017-06=Wikip%C3%A9dia:Accueil_principal
> >>
> >> de.wiki: http://tools.wmflabs.org/pageviews/?project=de.
> >> wikipedia.org=all-access=user=2016-
> >> 01=2017-06=Wikipedia:Hauptseite
> >>
> >> en.wiki: http://tools.wmflabs.org/pageviews/?project=en.
> >> wikipedia.org=all-access=user=2016-
> >> 01=2017-06=Main_Page
> >> (slightly different pattern)
> >>
> >> The same cannot be said for other projects, for instance uk.wiki:
> >> http://tools.wmflabs.org/pageviews/?project=uk.
> wikipedia.org=all-
> >> access=user=2016-01=2017-06=%D0%93%
> >> D0%BE%D0%BB%D0%BE%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D0%BE%D1%
> >> 80%D1%96%D0%BD%D0%BA%D0%B0
> >>
> >> Thanks,
> >>Strainu
> >>
> >> ___
> >> Wikitech-l mailing list
> >> wikitec...@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > ___
> > Wikitech-l mailing list
> > wikitec...@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Nuria Ruiz
>Can you specify what you mean by "next year"? I can think fiscal,
>calendar, etc. :)

We are aiming for this data to be public in its current analytics-friendly
form by end 2017/ begginning 2018.

On Wed, Jul 12, 2017 at 12:22 PM, Leila Zia <le...@wikimedia.org> wrote:

> On Wed, Jul 12, 2017 at 12:16 PM, Nuria Ruiz <nu...@wikimedia.org> wrote:
> > Further clarification that this snapshot of data is not yet public
> (meaning
> > available to the outside world, not just WMF/NAD holders) .
>
> Thanks for clarifying this and the work you and your team has put into
> this.
>
> > Our team is working towards making this data available next year in labs
> in the same
> > fashion that data is now available on the labs replicas.
>
> Can you specify what you mean by "next year"? I can think fiscal,
> calendar, etc. :)
>
> A big thumbs up for making data public. wiki-research-l list and
> audience will be happy.
>
> Best,
> Leila
>
> >
> >
> > Thanks,
> >
> > Nuria
> >
> > On Wed, Jul 12, 2017 at 9:34 AM, Dan Andreescu <dandree...@wikimedia.org
> >
> > wrote:
> >>
> >> Today we announce a new snapshot (named 2017-06) of the mediawiki
> history
> >> data [1].  It includes these awesome new fields:
> >>
> >> event_user_revision_count: 'Cumulative revision count per user for the
> >> current event_user_id (only available in revision-create events so far)'
> >>
> >> page_revision_count: 'In revision/page events: Cumulative revision count
> >> per page for the current page_id (only available in revision-create
> events
> >> so far)'
> >>
> >> The event_user_revision_count field is useful as a close estimate to
> >> user_editcount, but it does not include Flow talk page edits.
> >> We've also added event_user_seconds_to_previous_revision and
> >> page_seconds_to_previous_revision, but those are not being computed
> right
> >> now.
> >>
> >> The mediawiki_history dataset is updated every month, but we thought
> we'd
> >> let you know about this one since it has new goodies.  It's all thanks
> to
> >> Joseph who did everything but announce this wonderful work and then had
> to
> >> rush away to welcome his daughter into the world.  Hi Joseph!  Stop
> reading
> >> work email! :D
> >>
> >>
> >> [1]
> >> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/
> Edits/Mediawiki_history
> >>
> >> ___
> >> Analytics mailing list
> >> Analytics@lists.wikimedia.org
> >> https://lists.wikimedia.org/mailman/listinfo/analytics
> >>
> >
> >
> > ___
> > Analytics mailing list
> > Analytics@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/analytics
> >
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Nuria Ruiz
Further clarification that this snapshot of data is not yet public (meaning
available to the outside world, not just WMF/NAD holders) . Our team is
working towards making this data available next year in labs in the same
fashion that data is now available on the labs replicas.


Thanks,

Nuria

On Wed, Jul 12, 2017 at 9:34 AM, Dan Andreescu 
wrote:

> Today we announce a new snapshot (named *2017-06*) of the mediawiki
> history data [1].  It includes these awesome new fields:
>
> *event_user_revision_count*: 'Cumulative revision count per user for the
> current event_user_id (only available in revision-create events so far)'
>
> *page_revision_count*: 'In revision/page events: Cumulative revision
> count per page for the current page_id (only available in revision-create
> events so far)'
>
> The *event_user_revision_count* field is useful as a close estimate to
> user_editcount, but it does not include Flow talk page edits.
> We've also added event_user_seconds_to_previous_revision and
> page_seconds_to_previous_revision, but those are not being computed right
> now.
>
> The mediawiki_history dataset is updated every month, but we thought we'd
> let you know about this one since it has new goodies.  It's all thanks to
> Joseph who did everything but announce this wonderful work and then had to
> rush away to welcome his daughter into the world.  Hi Joseph!  Stop reading
> work email! :D
>
>
> [1] https://wikitech.wikimedia.org/wiki/Analytics/
> Data_Lake/Edits/Mediawiki_history
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Dropping MoodBar extension tables from all wikis

2017-07-07 Thread Nuria Ruiz
Hello!

This is an FYI that ModBar extension has been undeployed and, as such, its
tables will be removed from all wikis.  See
https://phabricator.wikimedia.org/T153033

It looks like this extension sprang some interest in the past [1] and there
were some research projects about it. Please let us know (before August
7th) whether we should keep the tables for any reason.

Thanks,


Nuria



[1] https://meta.wikimedia.org/wiki/Research:MoodBar/First_month_of_activity
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Research-Internal] [Ops] EventStreams launch and RCStream deprecation

2017-06-27 Thread Nuria Ruiz
I think Jon got his question answered but to keep archives happy. Here is
length for the new event:

https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/recentchange/1.yaml#L113

On Wed, Mar 8, 2017 at 5:02 PM, Jon Robson  wrote:

> Hey there!
> I was looking at migrating over to the new stream but it seems
> EventStreams does not surface the change in bytes made by a revision.
> This is quite crucial to my tool and blocks me from moving over to the new
> service.
>
> Previously this was surfaced like so:
> length: { new: 2, old: 1 },
> where new is the new length of the page in bytes and old is the old length
> in bytes.
>
> Am I simply missing where to look for this or is this indeed not there? If
> so, is there a bug open for this that I can track?
>
> Jon
>
>
> On Wed, Feb 8, 2017 at 9:00 AM Thomas Steiner  wrote:
>
>> Hi all,
>>
>> Thanks for launching this to the public! I have created an unofficial SSE
>> stream based on the IRC stream for open consumption for a while now at
>> http://wikipedia-edits.herokuapp.com/ (see link on top).
>>
>> Looking forward to migrating over to the new official SSE stream, but
>> wanted to check first if anyone relied on mine? I would probably turn it
>> off soonish (time permitting), but could also keep it running indefinitely
>> if people rely heavily on it. Just let me know.
>>
>> Cheers,
>> Tom
>> --
>> Dr. Thomas Steiner, Employee (https://blog.tomayac.com,
>> https://twitter.com/tomayac)
>>
>> Google Germany GmbH, ABC-Str. 19, 20354 Hamburg, Germany
>> Managing Directors: Matthew Scott Sucherman, Paul Terence Manicle
>> Registration office and registration number: Hamburg, HRB 86891
>>
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v2.1.17 (GNU/Linux)
>>
>> iFy0uwAntT0bE3xtRa5AfeCheCkthAtTh3reSabiGbl0ck0fjumBl3DChara
>> CTersAttH3b0ttom.hTtP5://xKcd.c0m/1181/
>> -END PGP SIGNATURE-
>> ___
>> Ops mailing list
>> o...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/ops
>>
>
> ___
> Research-Internal mailing list
> research-inter...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-14 Thread Nuria Ruiz
>(i.e. implying that we need to collect the data somewhere else, and move
to production for number crunching only)?
I think we should probably set up a sync up so you get an overview of how
this works cause this is a brief response. Data is harvested in some
production machines, it is processed (in different production machines) and
moved to stats machines (also production but a sheltered environment). We
do not use stats machines to harvest data. They just provide access to it
and are sized so you can process and crunch data, this talk explains a bit
how does this all works: https://www.youtube.com/watch?v=tx1pagZOsiM

We might be talking pass each other here, if so, a meeting might help.


>Nuria, what exactly do you have in mind when you say "a development
instance of Wikidata"?
If you need to look at a wikidata query and see what it shows on the logs
when you  query x or y, that step should be done on a (wikidata) *test
environment* that logs the http requests for your queries as received by
the server. So you can "test" your queries agains a server and see how
those are received.


Thanks,

Nuria





On Sun, May 14, 2017 at 1:10 PM, Adrian Bielefeldt <
adrian.bielefe...@mailbox.tu-dresden.de> wrote:

> Hi Addshore,
> thanks for the advice, I can now connect.
>
> Greetings,
>
> Adrian
>
>
> On 05/13/2017 05:47 PM, Addshore wrote:
>
> You should be able to connect to query.wikidata.org via the webproxy.
>
> https://wikitech.wikimedia.org/wiki/HTTP_proxy
>
> On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt <
> adrian.bielefe...@mailbox.tu-dresden.de> wrote:
>
>> Hello Nuri,
>>
>> I'm working on a project
>> 
>> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g.
>> uri_query, hour) from wmf.wdqs_extract, parse the queries with a java
>> program using open_rdf as the parser and then analyze it for different
>> metrics like variable count, which entities are being used and so on.
>>
>> At the moment I'm working on checking which entries equal one of the
>> example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_
>> service/queries/examples using this
>> 
>> code. Unfortunately the program cannot connect to the website, so I'm
>> assuming I have to create an exception for this request or ask for it to be
>> created.
>>
>> Greetings,
>>
>> Adrian
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> ___
> Analytics mailing 
> listAnalytics@lists.wikimedia.orghttps://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-13 Thread Nuria Ruiz
Adrian,

>At the moment I'm working on checking which entries equal one of the
example queries at https://www.wikidata.org/>wiki/Wikidata:SPARQL_query_serv
ice/queries/examples

 using this

 code.

The stats machines are useful to analyze data but we do not use them to do
development. It seems like you would benefit from querying a development
instance of wikidatata and looking at development logs to know what to
expect. We strongly advise against doing development in production, looking
at logs in a development environment would be synchronous so you can get
your answers fast.

Thanks,

Nuria

On Sat, May 13, 2017 at 5:47 PM, Addshore  wrote:

> You should be able to connect to query.wikidata.org via the webproxy.
>
> https://wikitech.wikimedia.org/wiki/HTTP_proxy
>
> On Sat, 13 May 2017 at 15:23 Adrian Bielefeldt <
> adrian.bielefe...@mailbox.tu-dresden.de> wrote:
>
>> Hello Nuri,
>>
>> I'm working on a project
>> 
>> analyzing the wikidata SPARQL-queries. We extract specific fields (e.g.
>> uri_query, hour) from wmf.wdqs_extract, parse the queries with a java
>> program using open_rdf as the parser and then analyze it for different
>> metrics like variable count, which entities are being used and so on.
>>
>> At the moment I'm working on checking which entries equal one of the
>> example queries at https://www.wikidata.org/wiki/Wikidata:SPARQL_query_
>> service/queries/examples using this
>> 
>> code. Unfortunately the program cannot connect to the website, so I'm
>> assuming I have to create an exception for this request or ask for it to be
>> created.
>>
>> Greetings,
>>
>> Adrian
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-12 Thread Nuria Ruiz
Adrian,

Can you give us some context as to what is the project you are working
on/what are you trying to do?

Thanks,

Nuria

On Sat, May 13, 2017 at 1:12 AM, Adrian Bielefeldt <
adrian.bielefe...@mailbox.tu-dresden.de> wrote:

> Hello everyone,
>
> I wanted to ask how I have to proceed to be able to connect to
> wikidata.org from a java program running on stat1002.eqiad.wmnet. I am
> specifically acessing the example queries for query.wikidata.org to
> check the logs for occurences.
>
> Greetings,
>
> Adrian
>
>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


  1   2   3   4   >