[Wikimedia-l] Re: Accessing wikipedia metadata

Mike Peel Thu, 16 Sep 2021 11:16:58 -0700

Hi Cristina,

I'd recommend Toolforge, which I used to run regular queries that powersome of my bot tools. For an example of a Python script I run there toquery info and ftp it to somewhere I can easily access, see:

https://bitbucket.org/mikepeel/wikicode/src/master/query_enwp_articles_no_wikidata.py


Thanks,
Mike

On 16/9/21 16:42:31, Gava, Cristina via Wikimedia-l wrote:

Hello everyone,
It is my first time interacting in this mailing list, so I will be happyto receive further feedbacks on how to better interact with the community :)
I am trying to access Wikipedia meta data in a streaming andtime/resource sustainable manner. By meta data I mean many of the voicesthat can be found in the statistics of a wiki article, such as edits,editors list, page views etc.
I would like to do such for an online classifier type of structure:retrieve the data from a big number of wiki pages every tot time and useit as input for predictions.
I tried to use the Wiki API, however it is time and resource expensive,both for me and Wikipedia.
My preferred choice now would be to query the specific tables in theWikipedia database, in the same way this is done through the Quarrytool. The problem with Quarry is that I would like to build a standalonescript, without having to depend on a user interface like Quarry. Do youthink that this is possible? I am still fairly new to all of this and Idon’t know exactly which is the best direction.
I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I couldaccess wiki replicas both through Toolforge and PAWS, however I didn’tunderstand which one would serve me better, could I ask you for somefeedback?
Also, as far as I understood [2]<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directlyaccessing the DB through Hive is too technical for what I need, right?Especially because it seems that I would need an account with productionshell access and I honestly don’t think that I would be granted accessto it. Also, I am not interested in accessing sensible and private data.
Last resource is parsing analytics dumps, however this seems lessorganic in the way of retrieving and polishing the data. As also, itwould be strongly decentralised and physical-machine dependent, unless Iupload the polished data online every time.
Sorry for this long message, but I thought it was better to give you aclearer picture (hoping this is clear enough). If you could give me evensome hint it would be highly appreciated.
Best,

Cristina
[1] https://meta.wikimedia.org/wiki/Research:Data<https://meta.wikimedia.org/wiki/Research:Data>
[2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>
_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

_______________________________________________
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/B3TS4PSMBHQXXGR3XRB2LUOYQXAX62IQ/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Accessing wikipedia metadata

Reply via email to