Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
>Will there be a release for these two tables?
No, sorry, there will not be. The dataset release is about pages and users.
To be extra clear though, it is not tables but a denormalized
reconstruction of the edit history.

> Could I connect to the Hadoop to see if the queries on pagelinks and
categorylinks run faster?
It is a bit more complicated that just "connecting"  but I do not think we
have to dwell on that, cause, as far as I know, there is no categorylink
info in hadoop.

Hadoop has the set of data from mediawiki that we use to create the dataset
I pointed you to:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
and
a bit more.

Is it possible to extract some of this information from the xml dumps?
Perhaps somebody in the list has other ideas?

Thanks,

Nuria

P.S. So you know in order to facilitate access to our computing resources
and private data (there is no way for us to give access to only "part" of
the data we hold in hadoop)  we require an active collaboration with our
research team. We cannot support ad-hoc access to hadoop for community
members.
Here is some info:
https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations






On Mon, Jul 8, 2019 at 4:14 PM Marc Miquel  wrote:

> Hello Nuria,
>
> This seems like an interesting alternative for some data (page, users,
> revision). It can really help and make some processes faster (at the moment
> we gave up running again the revision, as the new user_agent change made it
> also slower). So we will take a look at it as soon as it is ready.
>
> However, the scripts are struggling with other tables: pagelinks and
> category graph.
>
> For instance, we need to count the percentage of links an article directs
> to other pages or the percentage of links it receives from a group of
> pages. Likewise, we need to run down the category graph starting from a
> specific group of categories. At the moment, the query that uses pagelinks
> is not really working when counting when passing parameters for the entire
> table or for specific parts (using batches).
>
> Will there be a release for these two tables? Could I connect to the
> Hadoop to see if the queries on pagelinks and categorylinks run faster?
>
> If there is any other alternative we'd be happy to try as we cannot
> progress for several weeks.
> Thanks again,
>
> Marc
>
> Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019
> a les 0:56:
>
>> Hello,
>>
>> From your description seems that your problem is not one of computation
>> (well,  your main problem) but rather data extraction. The labs replicas
>> are not meant for big data extraction jobs as you have just found out.
>> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
>> denormalized data that you can probably use, it is still up for discussion
>> whether the data will be released as a JSON dump or other but basically is
>> a denormalized version of all the data held in the replicas that will be
>> created monthly.
>>
>> Please take a look at the documentation of the dataset:
>>
>> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>>
>> This is the phab ticket:
>> https://phabricator.wikimedia.org/T208612
>>
>> So, to sum up, once this dataset is out (we hope late this quarter or
>> early next) you can probably build your own datasets from it thus rendering
>> your usage of the replicas obsolete. Hopefully this makes sense.
>>
>> Thanks,
>>
>> Nuria
>>
>>
>>
>>
>> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:
>>
>>> To whom it might concern,
>>>
>>> I am writing in regards of the project *Cultural Diversity Observatory*
>>> and the data we are collecting. In short, this project aims at bridging the
>>> content gaps between language editions that relate to cultural and
>>> geographical aspects. For this we need to retrieve data from all language
>>> editions and Wikidata, and run some scripts in order to crawl down the
>>> category and the link graph, in order to create some datasets and
>>> statistics.
>>>
>>> The reason that I am writing is because we are stuck as we cannot
>>> automatize the scripts to retrieve data from the Replicas. We could create
>>> the datasets few months ago but during the past months it is impossible.
>>>
>>> We are concerned because one thing is to create the dataset once for
>>> research purposes and another thing is to create them on monthly basis.
>>> This is what we promised in the project grant
>>> 
>>> details and now we cannot do it because of the infrastructure. It is
>>> important to do it on monthly basis because the data visualizations and
>>> statistics Wikipedia communities will receive need to be updated.
>>>
>>> Lately there had been some changes in the Replicas databases and the
>>> queries that used to take several hours are getting stuck completely. We
>>> tried to code them in multiple ways: a) using complex 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Marc Miquel
Hello Nuria,

This seems like an interesting alternative for some data (page, users,
revision). It can really help and make some processes faster (at the moment
we gave up running again the revision, as the new user_agent change made it
also slower). So we will take a look at it as soon as it is ready.

However, the scripts are struggling with other tables: pagelinks and
category graph.

For instance, we need to count the percentage of links an article directs
to other pages or the percentage of links it receives from a group of
pages. Likewise, we need to run down the category graph starting from a
specific group of categories. At the moment, the query that uses pagelinks
is not really working when counting when passing parameters for the entire
table or for specific parts (using batches).

Will there be a release for these two tables? Could I connect to the Hadoop
to see if the queries on pagelinks and categorylinks run faster?

If there is any other alternative we'd be happy to try as we cannot
progress for several weeks.
Thanks again,

Marc

Missatge de Nuria Ruiz  del dia dt., 9 de jul. 2019 a
les 0:56:

> Hello,
>
> From your description seems that your problem is not one of computation
> (well,  your main problem) but rather data extraction. The labs replicas
> are not meant for big data extraction jobs as you have just found out.
> Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
> denormalized data that you can probably use, it is still up for discussion
> whether the data will be released as a JSON dump or other but basically is
> a denormalized version of all the data held in the replicas that will be
> created monthly.
>
> Please take a look at the documentation of the dataset:
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history
>
> This is the phab ticket:
> https://phabricator.wikimedia.org/T208612
>
> So, to sum up, once this dataset is out (we hope late this quarter or
> early next) you can probably build your own datasets from it thus rendering
> your usage of the replicas obsolete. Hopefully this makes sense.
>
> Thanks,
>
> Nuria
>
>
>
>
> On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:
>
>> To whom it might concern,
>>
>> I am writing in regards of the project *Cultural Diversity Observatory*
>> and the data we are collecting. In short, this project aims at bridging the
>> content gaps between language editions that relate to cultural and
>> geographical aspects. For this we need to retrieve data from all language
>> editions and Wikidata, and run some scripts in order to crawl down the
>> category and the link graph, in order to create some datasets and
>> statistics.
>>
>> The reason that I am writing is because we are stuck as we cannot
>> automatize the scripts to retrieve data from the Replicas. We could create
>> the datasets few months ago but during the past months it is impossible.
>>
>> We are concerned because one thing is to create the dataset once for
>> research purposes and another thing is to create them on monthly basis.
>> This is what we promised in the project grant
>> 
>> details and now we cannot do it because of the infrastructure. It is
>> important to do it on monthly basis because the data visualizations and
>> statistics Wikipedia communities will receive need to be updated.
>>
>> Lately there had been some changes in the Replicas databases and the
>> queries that used to take several hours are getting stuck completely. We
>> tried to code them in multiple ways: a) using complex queries, b) doing the
>> joins as code logics and in-memory, c) downloading the parts of the table
>> that we require and storing them in a local database. *None is an option
>> now *considering the current performance of the replicas.
>>
>> Bryan Davis suggested that this might be a moment to consult the
>> Analytics team, considering the Hadoop environemnt is design to run long,
>> complex queries and it has massively more compute power than the Wiki
>> Replicas cluster. We would certainly be relieved If you considerd we could
>> connect to these Analytics databases (Hadoop).
>>
>> Let us know if you need more information on the specific queries or the
>> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
>> will be happy to explain in detail anything you require.
>>
>> Thanks.
>> Best regards,
>>
>> Marc Miquel
>>
>> PS: You can read about the method we follow to retrieve data and create
>> the dataset here:
>>
>> *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
>> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
>> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
>> 2334-0770 *
>> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
>> ___
>> Analytics mailing list
>> 

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
Hello,

>From your description seems that your problem is not one of computation
(well,  your main problem) but rather data extraction. The labs replicas
are not meant for big data extraction jobs as you have just found out.
Neither is Hadoop. Now, our team will be releasing soon a dataset of edit
denormalized data that you can probably use, it is still up for discussion
whether the data will be released as a JSON dump or other but basically is
a denormalized version of all the data held in the replicas that will be
created monthly.

Please take a look at the documentation of the dataset:
https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history

This is the phab ticket:
https://phabricator.wikimedia.org/T208612

So, to sum up, once this dataset is out (we hope late this quarter or early
next) you can probably build your own datasets from it thus rendering your
usage of the replicas obsolete. Hopefully this makes sense.

Thanks,

Nuria




On Mon, Jul 8, 2019 at 3:34 PM Marc Miquel  wrote:

> To whom it might concern,
>
> I am writing in regards of the project *Cultural Diversity Observatory*
> and the data we are collecting. In short, this project aims at bridging the
> content gaps between language editions that relate to cultural and
> geographical aspects. For this we need to retrieve data from all language
> editions and Wikidata, and run some scripts in order to crawl down the
> category and the link graph, in order to create some datasets and
> statistics.
>
> The reason that I am writing is because we are stuck as we cannot
> automatize the scripts to retrieve data from the Replicas. We could create
> the datasets few months ago but during the past months it is impossible.
>
> We are concerned because one thing is to create the dataset once for
> research purposes and another thing is to create them on monthly basis.
> This is what we promised in the project grant
> 
> details and now we cannot do it because of the infrastructure. It is
> important to do it on monthly basis because the data visualizations and
> statistics Wikipedia communities will receive need to be updated.
>
> Lately there had been some changes in the Replicas databases and the
> queries that used to take several hours are getting stuck completely. We
> tried to code them in multiple ways: a) using complex queries, b) doing the
> joins as code logics and in-memory, c) downloading the parts of the table
> that we require and storing them in a local database. *None is an option
> now *considering the current performance of the replicas.
>
> Bryan Davis suggested that this might be a moment to consult the Analytics
> team, considering the Hadoop environemnt is design to run long, complex
> queries and it has massively more compute power than the Wiki Replicas
> cluster. We would certainly be relieved If you considerd we could connect
> to these Analytics databases (Hadoop).
>
> Let us know if you need more information on the specific queries or the
> processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
> will be happy to explain in detail anything you require.
>
> Thanks.
> Best regards,
>
> Marc Miquel
>
> PS: You can read about the method we follow to retrieve data and create
> the dataset here:
>
> *Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
> Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
> the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
> 2334-0770 *
> wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Marc Miquel
To whom it might concern,

I am writing in regards of the project *Cultural Diversity Observatory* and
the data we are collecting. In short, this project aims at bridging the
content gaps between language editions that relate to cultural and
geographical aspects. For this we need to retrieve data from all language
editions and Wikidata, and run some scripts in order to crawl down the
category and the link graph, in order to create some datasets and
statistics.

The reason that I am writing is because we are stuck as we cannot
automatize the scripts to retrieve data from the Replicas. We could create
the datasets few months ago but during the past months it is impossible.

We are concerned because one thing is to create the dataset once for
research purposes and another thing is to create them on monthly basis.
This is what we promised in the project grant

details and now we cannot do it because of the infrastructure. It is
important to do it on monthly basis because the data visualizations and
statistics Wikipedia communities will receive need to be updated.

Lately there had been some changes in the Replicas databases and the
queries that used to take several hours are getting stuck completely. We
tried to code them in multiple ways: a) using complex queries, b) doing the
joins as code logics and in-memory, c) downloading the parts of the table
that we require and storing them in a local database. *None is an option
now *considering the current performance of the replicas.

Bryan Davis suggested that this might be a moment to consult the Analytics
team, considering the Hadoop environemnt is design to run long, complex
queries and it has massively more compute power than the Wiki Replicas
cluster. We would certainly be relieved If you considerd we could connect
to these Analytics databases (Hadoop).

Let us know if you need more information on the specific queries or the
processes we are running. The server we are using is wcdo.eqiad.wmflabs. We
will be happy to explain in detail anything you require.

Thanks.
Best regards,

Marc Miquel

PS: You can read about the method we follow to retrieve data and create the
dataset here:

*Miquel-Ribé, M., & Laniado, D. (2019). Wikipedia Cultural Diversity
Dataset: A Complete Cartography for 300 Language Editions. Proceedings of
the 13th International AAAI Conference on Web and Social Media. ICWSM. ACM.
2334-0770 *
wvvw.aaai.org/ojs/index.php/ICWSM/article/download/3260/3128/
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] Pageviews and unique devices to a specific set of pages

2019-07-08 Thread Dan Andreescu
Forwarding a quick question from Peter so we can answer it publicly or take
advantage of work others have done:

[Can we] estimate how many visitors visit pages with equations (i.e.,
wikitext math tags)?

When we're talking about "how many visitors" we're talking about our Unique
Devices data
.  This
is an estimate and the way it's computed restricts us to only knowing high
level numbers at the project or project family level (like de.wikipedia or
"all wikipedias").  So if an estimate for number of visitors to a specific
subset of pages is required, we don't collect that data.

If what's needed is "how many visits", then we have Pageview data
.  This is
broken down per page, and available as a bulk download, through an API,
etc.  So if you can compile a list of pages that have some feature (like
equations for example), then it's possible to cross-reference that with
pageview data and get the answer.  To compile this list in the specific
case of equations, you may be able to use the templatelinks
 table of the
project you're interested in.  These are mirrored to the cloud db replicas
.  So if
equations generally use templates, then you can search for pages with links
to those templates in that table, and that would be the list of pages
you're interested in.  With those page titles / page ids you can then query
the pageview data.

Hope this helps explain a bit more our data, but feel free to follow up
with questions.
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics