Mike's suggestion is good. You would likely get better responses by asking this question to the Wikimedia developers, so I am forwarding to that list.
Risker On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l < wikimedi...@lists.wikimedia.org> wrote: > Hello everyone, > > > > It is my first time interacting in this mailing list, so I will be happy > to receive further feedbacks on how to better interact with the community :) > > > > I am trying to access Wikipedia meta data in a streaming and time/resource > sustainable manner. By meta data I mean many of the voices that can be > found in the statistics of a wiki article, such as edits, editors list, > page views etc. > > I would like to do such for an online classifier type of structure: > retrieve the data from a big number of wiki pages every tot time and use it > as input for predictions. > > > > I tried to use the Wiki API, however it is time and resource expensive, > both for me and Wikipedia. > > > > My preferred choice now would be to query the specific tables in the > Wikipedia database, in the same way this is done through the Quarry tool. > The problem with Quarry is that I would like to build a standalone script, > without having to depend on a user interface like Quarry. Do you think that > this is possible? I am still fairly new to all of this and I don’t know > exactly which is the best direction. > > I saw [1] <https://meta.wikimedia.org/wiki/Research:Data> that I could > access wiki replicas both through Toolforge and PAWS, however I didn’t > understand which one would serve me better, could I ask you for some > feedback? > > > > Also, as far as I understood [2] > <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake>, directly > accessing the DB through Hive is too technical for what I need, right? > Especially because it seems that I would need an account with production > shell access and I honestly don’t think that I would be granted access to > it. Also, I am not interested in accessing sensible and private data. > > > > Last resource is parsing analytics dumps, however this seems less organic > in the way of retrieving and polishing the data. As also, it would be > strongly decentralised and physical-machine dependent, unless I upload the > polished data online every time. > > > > Sorry for this long message, but I thought it was better to give you a > clearer picture (hoping this is clear enough). If you could give me even > some hint it would be highly appreciated. > > > > Best, > > Cristina > > > > [1] https://meta.wikimedia.org/wiki/Research:Data > > [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake > _______________________________________________ > Wikimedia-l mailing list -- wikimedi...@lists.wikimedia.org, guidelines > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and > https://meta.wikimedia.org/wiki/Wikimedia-l > Public archives at > https://lists.wikimedia.org/hyperkitty/list/wikimedi...@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/ > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/