Re: [Analytics] When is the new pages API updated?

2018-10-10 Thread Nuria Ruiz
>Wikistats 1 generates data on content pages with a delay of 10-15 days
after the end of the month
This is true for full snapshots (for the reasons we have discussed before
and that Dan has described on this thread). You can expect data to be
available on the API soon after the 10th, but it is unlikely that it will
be there before the 10th as we do not start the process until the 5th.

Now, data - as you now- is streamed real time, every second. So it is only
the full reconstruction of events, the full snapshot, that takes several
days to build. Have you looked into using the real time events when the
next month snapshot is yet not available?


On Wed, Oct 10, 2018 at 7:48 PM Dan Andreescu 
wrote:

> It should be updated soon, the jobs are all done successfully.  But
> currently we do expect this kind of lag, I'll explain why.
>
> When we started we were sqooping at the beginning of the month and the
> processing takes something like 4 days total, most of it sqooping.  But
> this put too much load on the database serves too close to the beginning of
> the month when a bunch of other stuff is running.  So we had to move it
> back to the 5th of the month [1].  Add 4 days onto that and we end up
> finishing around the 9th of the month.  We don't like this at all and we're
> trying to figure out a better way to import the data incrementally so we
> can just start processing when we have all of it.  It's unfortunate but we
> couldn't foresee the infrastructure limitation, too much was up in the air
> about even where we would sqoop from when we started this work.  Joseph and
> I have a weekly meeting to discuss moving towards a more incremental
> approach, and this task is the parent task to watch for now:
> https://phabricator.wikimedia.org/T193650 (priority is low because we
> have too many other commitments, but it's something I'd love to see before
> we call wikistats 2 "production" quality)
>
> [1]
> https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0dfc2ed7/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp#L43
>
> On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn 
> wrote:
>
>> Hey there!
>>
>> I just wrote a script that fetches data from the AQS new pages endpoint
>> 
>> in order to prepare the our monthly health metrics (T199459
>> ).
>>
>> However, it seems like that endpoint doesn't yet have monthly data for
>> September. For example, a query for Commons with a start of July 1 and
>> and an end of October 1
>> 
>> returns only data for July and August. What's the schedule for updating
>> this data?
>>
>> To be honest, I feel pretty frustrated by this. Wikistats 1 generates
>> data on content pages with a delay of 10-15 days after the end of the
>> month, which has made it difficult for us to provide timely metrics to
>> executives and the board. I had assumed (to a degree that I didn't even
>> check) that by switching to this API, we would instead only have to deal
>> with the delay in generating the mediawiki_history snapshot (5-7 days after
>> the end of the month). But that doesn't seem to be the case.
>> --
>> Neil Patel Quinn 
>> (he/him/his)
>> product analyst, Wikimedia Foundation
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] When is the new pages API updated?

2018-10-10 Thread Dan Andreescu
It should be updated soon, the jobs are all done successfully.  But
currently we do expect this kind of lag, I'll explain why.

When we started we were sqooping at the beginning of the month and the
processing takes something like 4 days total, most of it sqooping.  But
this put too much load on the database serves too close to the beginning of
the month when a bunch of other stuff is running.  So we had to move it
back to the 5th of the month [1].  Add 4 days onto that and we end up
finishing around the 9th of the month.  We don't like this at all and we're
trying to figure out a better way to import the data incrementally so we
can just start processing when we have all of it.  It's unfortunate but we
couldn't foresee the infrastructure limitation, too much was up in the air
about even where we would sqoop from when we started this work.  Joseph and
I have a weekly meeting to discuss moving towards a more incremental
approach, and this task is the parent task to watch for now:
https://phabricator.wikimedia.org/T193650 (priority is low because we have
too many other commitments, but it's something I'd love to see before we
call wikistats 2 "production" quality)

[1]
https://github.com/wikimedia/puppet/blob/28b78985d3612a6e19720be1fe8eef5f0dfc2ed7/modules/profile/manifests/analytics/refinery/job/sqoop_mediawiki.pp#L43

On Wed, Oct 10, 2018 at 10:00 PM Neil Patel Quinn 
wrote:

> Hey there!
>
> I just wrote a script that fetches data from the AQS new pages endpoint
> 
> in order to prepare the our monthly health metrics (T199459
> ).
>
> However, it seems like that endpoint doesn't yet have monthly data for
> September. For example, a query for Commons with a start of July 1 and
> and an end of October 1
> 
> returns only data for July and August. What's the schedule for updating
> this data?
>
> To be honest, I feel pretty frustrated by this. Wikistats 1 generates data
> on content pages with a delay of 10-15 days after the end of the month,
> which has made it difficult for us to provide timely metrics to executives
> and the board. I had assumed (to a degree that I didn't even check) that by
> switching to this API, we would instead only have to deal with the delay in
> generating the mediawiki_history snapshot (5-7 days after the end of the
> month). But that doesn't seem to be the case.
> --
> Neil Patel Quinn 
> (he/him/his)
> product analyst, Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] When is the new pages API updated?

2018-10-10 Thread Neil Patel Quinn
Hey there!

I just wrote a script that fetches data from the AQS new pages endpoint

in order to prepare the our monthly health metrics (T199459
).

However, it seems like that endpoint doesn't yet have monthly data for
September. For example, a query for Commons with a start of July 1 and and
an end of October 1

returns only data for July and August. What's the schedule for updating
this data?

To be honest, I feel pretty frustrated by this. Wikistats 1 generates data
on content pages with a delay of 10-15 days after the end of the month,
which has made it difficult for us to provide timely metrics to executives
and the board. I had assumed (to a degree that I didn't even check) that by
switching to this API, we would instead only have to deal with the delay in
generating the mediawiki_history snapshot (5-7 days after the end of the
month). But that doesn't seem to be the case.
-- 
Neil Patel Quinn 
(he/him/his)
product analyst, Wikimedia Foundation
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Hive and Oozie unavailable for maintenance on Tue Oct 9th 10 AM CEST

2018-10-10 Thread Luca Toscano
Thanks for the note Neil, should have been clearer! I am also going to fix
all the analytics1003's references in Wikitech later on today :)

Luca

Il giorno mer 10 ott 2018 alle ore 01:08 Neil Patel Quinn <
nqu...@wikimedia.org> ha scritto:

> Quick note: since the Hive coordinator has moved, you'll have to update
> its url from *analytics1003.eqiad.wmnet *to *an-coord1001.eqiad.wmnet *in
> any scripts you have.
>
> On Fri, 5 Oct 2018 at 09:54, Luca Toscano  wrote:
>
>> Hi everybody,
>>
>> the Analytics team is going to move the Oozie and Hive daemons from the
>> analytics1003 host to an-coord1001 (new host, hardware refresh) on Tuesday
>> Oct 9th at 10 AM CEST. This will require downtime for Oozie and Hive, so
>> some jobs might fail or not work at all during the maintenance. We have
>> allocated two hours for this procedure but it should require less time.
>>
>> Tracking task: T205509
>>
>> As always, please follow up with me or anybody in the analytics team for
>> clarifications and/or comments (via Phabricator or IRC Freenode
>> #wikimedia-analytics).
>>
>> Thanks for the patience!
>>
>> Luca (on behalf of the Analytics team)
>> ___
>> Wiki-research-l mailing list
>> wiki-researc...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>
>
> --
> Neil Patel Quinn 
> (he/him/his)
> product analyst, Wikimedia Foundation
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics