[Wikimedia-l] Re: Accessing wikipedia metadata

2021-09-16 Thread Gava, Cristina via Wikimedia-l
Hi Risker,

Thank you kindly for redirecting me to a more appropriate forum :)

Cristina
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/EMUA2JRDOGHT4HLLKI7H52IVPHK4XIFY/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Accessing wikipedia metadata

2021-09-16 Thread WereSpielChequers
Dear Christina,

You are likely to find more researchers and people who regularly work with
our metadata on the research mailing list.


Send Wiki-research-l mailing list submissions to
wiki-researc...@lists.wikimedia.org

To subscribe or unsubscribe, please visit

https://aus01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.wikimedia.org%2Fpostorius%2Flists%2Fwiki-research-l.lists.wikimedia.org%2F&data=04%7C01%7Cmathieu.oneil%40anu.edu.au%7Ccf5affacf28a4c48288d08d978fbdcf5%7Ce37d725cab5c46249ae5f0533e486437%7C0%7C0%7C637673846179528728%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=7xcObPrPYLKzpdxmkxQ5Vxhb%2BZhsuEmApAfEDfg0WPQ%3D&reserved=0

Regards

WSC


>
>2. Re: Accessing wikipedia metadata (Gava, Cristina)
>
>
> On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
> wikimedia-l@lists.wikimedia.org> wrote:
>
> > Hello everyone,
> >
> >
> >
> > It is my first time interacting in this mailing list, so I will be happy
> > to receive further feedbacks on how to better interact with the
> community :)
> >
> >
> >
> > I am trying to access Wikipedia meta data in a streaming and
> time/resource
> > sustainable manner. By meta data I mean many of the voices that can be
> > found in the statistics of a wiki article, such as edits, editors list,
> > page views etc.
> >
> > I would like to do such for an online classifier type of structure:
> > retrieve the data from a big number of wiki pages every tot time and use
> it
> > as input for predictions.
> >
> >
> >
> > I tried to use the Wiki API, however it is time and resource expensive,
> > both for me and Wikipedia.
> >
> >
> >
> > My preferred choice now would be to query the specific tables in the
> > Wikipedia database, in the same way this is done through the Quarry tool.
> > The problem with Quarry is that I would like to build a standalone
> script,
> > without having to depend on a user interface like Quarry. Do you think
> that
> > this is possible? I am still fairly new to all of this and I don’t know
> > exactly which is the best direction.
> >
> > I saw [1]  that I could
> > access wiki replicas both through Toolforge and PAWS, however I didn’t
> > understand which one would serve me better, could I ask you for some
> > feedback?
> >
> >
> >
> > Also, as far as I understood [2]
> > , directly
> > accessing the DB through Hive is too technical for what I need, right?
> > Especially because it seems that I would need an account with production
> > shell access and I honestly don’t think that I would be granted access to
> > it. Also, I am not interested in accessing sensible and private data.
> >
> >
> >
> > Last resource is parsing analytics dumps, however this seems less organic
> > in the way of retrieving and polishing the data. As also, it would be
> > strongly decentralised and physical-machine dependent, unless I upload
> the
> > polished data online every time.
> >
> >
> >
> > Sorry for this long message, but I thought it was better to give you a
> > clearer picture (hoping this is clear enough). If you could give me even
> > some hint it would be highly appreciated.
> >
> >
> >
> > Best,
> >
> > Cristina
> >
> >
> >
> > [1] https://meta.wikimedia.org/wiki/Research:Data
> >
> > [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> > ___
> > Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> > at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> > https://meta.wikimedia.org/wiki/Wikimedia-l
> > Public archives at
> >
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> > To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
> -- next part --
> A message part incompatible with plain text digests has been removed ...
> Name: not available
> Type: text/html
> Size: 5081 bytes
> Desc: not available
>
>
>
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/IVKWQC6OQKRULEHSBD5HDVTBNTXCQ5V2/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Accessing wikipedia metadata

2021-09-16 Thread Gava, Cristina via Wikimedia-l
Hi Mike,

Thank you very much for the reply and for giving me sample material, I'll look 
into that now.

Cristina
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/GQLQYQ5KRECRN4BV4UBMWZJ2ZXPCRIPB/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Accessing wikipedia metadata

2021-09-16 Thread Risker
Mike's suggestion is good.  You would likely get better responses by asking
this question to the Wikimedia developers, so I am forwarding to that list.

Risker

On Thu, 16 Sept 2021 at 14:04, Gava, Cristina via Wikimedia-l <
wikimedia-l@lists.wikimedia.org> wrote:

> Hello everyone,
>
>
>
> It is my first time interacting in this mailing list, so I will be happy
> to receive further feedbacks on how to better interact with the community :)
>
>
>
> I am trying to access Wikipedia meta data in a streaming and time/resource
> sustainable manner. By meta data I mean many of the voices that can be
> found in the statistics of a wiki article, such as edits, editors list,
> page views etc.
>
> I would like to do such for an online classifier type of structure:
> retrieve the data from a big number of wiki pages every tot time and use it
> as input for predictions.
>
>
>
> I tried to use the Wiki API, however it is time and resource expensive,
> both for me and Wikipedia.
>
>
>
> My preferred choice now would be to query the specific tables in the
> Wikipedia database, in the same way this is done through the Quarry tool.
> The problem with Quarry is that I would like to build a standalone script,
> without having to depend on a user interface like Quarry. Do you think that
> this is possible? I am still fairly new to all of this and I don’t know
> exactly which is the best direction.
>
> I saw [1]  that I could
> access wiki replicas both through Toolforge and PAWS, however I didn’t
> understand which one would serve me better, could I ask you for some
> feedback?
>
>
>
> Also, as far as I understood [2]
> , directly
> accessing the DB through Hive is too technical for what I need, right?
> Especially because it seems that I would need an account with production
> shell access and I honestly don’t think that I would be granted access to
> it. Also, I am not interested in accessing sensible and private data.
>
>
>
> Last resource is parsing analytics dumps, however this seems less organic
> in the way of retrieving and polishing the data. As also, it would be
> strongly decentralised and physical-machine dependent, unless I upload the
> polished data online every time.
>
>
>
> Sorry for this long message, but I thought it was better to give you a
> clearer picture (hoping this is clear enough). If you could give me even
> some hint it would be highly appreciated.
>
>
>
> Best,
>
> Cristina
>
>
>
> [1] https://meta.wikimedia.org/wiki/Research:Data
>
> [2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake
> ___
> Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines
> at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and
> https://meta.wikimedia.org/wiki/Wikimedia-l
> Public archives at
> https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
> To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org
___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6IQMN7H6P2DWRFA2OMSPAOUEBF3R472R/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org

[Wikimedia-l] Re: Accessing wikipedia metadata

2021-09-16 Thread Mike Peel

Hi Cristina,

I'd recommend Toolforge, which I used to run regular queries that power 
some of my bot tools. For an example of a Python script I run there to 
query info and ftp it to somewhere I can easily access, see:

https://bitbucket.org/mikepeel/wikicode/src/master/query_enwp_articles_no_wikidata.py

Thanks,
Mike

On 16/9/21 16:42:31, Gava, Cristina via Wikimedia-l wrote:

Hello everyone,

It is my first time interacting in this mailing list, so I will be happy 
to receive further feedbacks on how to better interact with the community :)


I am trying to access Wikipedia meta data in a streaming and 
time/resource sustainable manner. By meta data I mean many of the voices 
that can be found in the statistics of a wiki article, such as edits, 
editors list, page views etc.


I would like to do such for an online classifier type of structure: 
retrieve the data from a big number of wiki pages every tot time and use 
it as input for predictions.


I tried to use the Wiki API, however it is time and resource expensive, 
both for me and Wikipedia.


My preferred choice now would be to query the specific tables in the 
Wikipedia database, in the same way this is done through the Quarry 
tool. The problem with Quarry is that I would like to build a standalone 
script, without having to depend on a user interface like Quarry. Do you 
think that this is possible? I am still fairly new to all of this and I 
don’t know exactly which is the best direction.


I saw [1]  that I could 
access wiki replicas both through Toolforge and PAWS, however I didn’t 
understand which one would serve me better, could I ask you for some 
feedback?


Also, as far as I understood [2] 
, directly 
accessing the DB through Hive is too technical for what I need, right? 
Especially because it seems that I would need an account with production 
shell access and I honestly don’t think that I would be granted access 
to it. Also, I am not interested in accessing sensible and private data.


Last resource is parsing analytics dumps, however this seems less 
organic in the way of retrieving and polishing the data. As also, it 
would be strongly decentralised and physical-machine dependent, unless I 
upload the polished data online every time.


Sorry for this long message, but I thought it was better to give you a 
clearer picture (hoping this is clear enough). If you could give me even 
some hint it would be highly appreciated.


Best,

Cristina

[1] https://meta.wikimedia.org/wiki/Research:Data 



[2] https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake 




___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/6OZE7WIRDCMRA7TESD6XVCVB6ZQV4OFP/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org


___
Wikimedia-l mailing list -- wikimedia-l@lists.wikimedia.org, guidelines at: 
https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and 
https://meta.wikimedia.org/wiki/Wikimedia-l
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikimedia-l@lists.wikimedia.org/message/B3TS4PSMBHQXXGR3XRB2LUOYQXAX62IQ/
To unsubscribe send an email to wikimedia-l-le...@lists.wikimedia.org