[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

Risker Thu, 16 Sep 2021 11:21:19 -0700

Mike's suggestion is good.  You would likely get better responses by asking
this question to the Wikimedia developers, so I am forwarding to that list.


Risker

On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
[email protected]> wrote:

> Hello everyone,
>
>
>
> It is my first time interacting in this mailing list, so I will be happy
> to receive further feedbacks on how to better interact with the community :)
>
>
>
> I am trying to access Wikipedia meta data in a streaming and time/resource
> sustainable manner. By meta data I mean many of the voices that can be
> found in the statistics of a wiki article, such as edits, editors list,
> page views etc.
>
> I would like to do such for an online classifier type of structure:
> retrieve the data from a big number of wiki pages every tot time and use it
> as input for predictions.
>
>
>
> I tried to use the Wiki API, however it is time and resource expensive,
> both for me and Wikipedia.
>
>
>
> My preferred choice now would be to query the specific tables in the
> Wikipedia database, in the same way this is done through the Quarry tool.
> The problem with Quarry is that I would like to build a standalone script,
> without having to depend on a user interface like Quarry. Do you think that
> this is possible? I am still fairly new to all of this and I don’t know
> exactly which is the best direction.
>
> I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could
> access wiki replicas both through Toolforge and PAWS, however I didn’t
> understand which one would serve me better, could I ask you for some
> feedback?
>
>
>
> Also, as far as I understood [2]
> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly
> accessing the DB through Hive is too technical for what I need, right?
> Especially because it seems that I would need an account with production
> shell access and I honestly don’t think that I would be granted access to
> it. Also, I am not interested in accessing sensible and private data.
>
>
>
> Last resource is parsing analytics dumps, however this seems less organic
> in the way of retrieving and polishing the data. As also, it would be
> strongly decentralised and physical-machine dependent, unless I upload the
> polished data online every time.
>
>
>
> Sorry for this long message, but I thought it was better to give you a
> clearer picture (hoping this is clear enough). If you could give me even
> some hint it would be highly appreciated.
>
>
>
> Best,
>
> Cristina
>
>
>
> [1] https://meta.wikimedia.org/wiki/Research:Data
>
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> _______________________________________________
> Wikimedia-l mailing list -- [email protected], guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> To unsubscribe send an email to [email protected]

_______________________________________________
Wikitech-l mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: [Wikimedia-l] Accessing wikipedia metadata

Reply via email to