Re: [Wikitech-l] Incoming and outgoing links enquiry

2018-03-19 Thread Erik Bernhardson
This information is available mostly pre-calculated in the CirrusSearch
dumps at http://dumps.wikimedia.your.org/other/cirrussearch/current/

Each article is represented by a line of json in those dumps. There is a
field called 'incoming_links' which is the number of unique articles with
links from the content namespace(s) to that article. Each article
additionally contains an `outgoing_link` field which contains a list of
strings representing the pages the article links to (incoming_links is
calculated by querying the outgoing_link field). I've done graph work on
wikipedia before using this and the outgoing_link field is typically enough
to build a full graph.



On Sun, Mar 18, 2018 at 2:18 PM, John  wrote:

> I would second the recommendation of using the dumps for such a large
> graphing project. If it's more than a couple hundred pages the API/database
> queries can get bulky
>
> On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff  wrote:
>
> > Hi,
> >
> > You can run longer queries by getting access to toolforge (
> > https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from
> the
> > command line.
> >
> > However the query in question might  still take an excessively long time
> > (if you are doing all of wikipedia). I would expect that query to result
> in
> > about 150mb of data and maybe take days to complete.
> >
> > You can also break it down into parts by adding WHERE page_title >='a'
> AND
> > page_title < 'b'
> >
> > Note, also of interest: full dumps of all the links is available at
> >
> > https://dumps.wikimedia.org/enwiki/20180301/enwiki-
> 20180301-pagelinks.sql.gz
> > (you would also need
> > https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz
> to
> > convert page ids to page names)
> > --
> > Brian
> > On Sunday, March 18, 2018, Nick Bell  wrote:
> > > Hi there,
> > >
> > > I'm a final year Mathematics student at the University of Bristol, and
> > I'm
> > > studying Wikipedia as a graph for my project.
> > >
> > > I'd like to get data regarding the number of outgoing links on each
> page,
> > > and the number of pages with links to each page. I have already
> > > inquired about this with the Analytics Team mailing list, who gave me a
> > few
> > > suggestions.
> > >
> > > One of these was to run the code at this link
> > https://quarry.wmflabs.org/
> > > query/25400
> > > with these instructions:
> > >
> > > "You will have to fork it and remove the "LIMIT 10" to get it to run on
> > > all the English Wikipedia articles. It may take too long or produce
> > > too much data, in which case please ask on this list for someone who
> > > can run it for you."
> > >
> > > I ran the code as instructed, but the query was killed as it took
> longer
> > > than 30 minutes to run. I asked if anyone on the mailing list could run
> > it
> > > for me, but no one replied saying they could. The guy who wrote the
> code
> > > suggested I try this mailing list to see if anyone can help.
> > >
> > > I'm a beginner in programming and coding etc., so any and all help you
> > can
> > > give me would be greatly appreciated.
> > >
> > > Many thanks,
> > > Nick Bell
> > > University of Bristol
> > > ___
> > > Wikitech-l mailing list
> > > Wikitech-l@lists.wikimedia.org
> > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Incoming and outgoing links enquiry

2018-03-18 Thread John
I would second the recommendation of using the dumps for such a large
graphing project. If it's more than a couple hundred pages the API/database
queries can get bulky

On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff  wrote:

> Hi,
>
> You can run longer queries by getting access to toolforge (
> https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the
> command line.
>
> However the query in question might  still take an excessively long time
> (if you are doing all of wikipedia). I would expect that query to result in
> about 150mb of data and maybe take days to complete.
>
> You can also break it down into parts by adding WHERE page_title >='a' AND
> page_title < 'b'
>
> Note, also of interest: full dumps of all the links is available at
>
> https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz
> (you would also need
> https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to
> convert page ids to page names)
> --
> Brian
> On Sunday, March 18, 2018, Nick Bell  wrote:
> > Hi there,
> >
> > I'm a final year Mathematics student at the University of Bristol, and
> I'm
> > studying Wikipedia as a graph for my project.
> >
> > I'd like to get data regarding the number of outgoing links on each page,
> > and the number of pages with links to each page. I have already
> > inquired about this with the Analytics Team mailing list, who gave me a
> few
> > suggestions.
> >
> > One of these was to run the code at this link
> https://quarry.wmflabs.org/
> > query/25400
> > with these instructions:
> >
> > "You will have to fork it and remove the "LIMIT 10" to get it to run on
> > all the English Wikipedia articles. It may take too long or produce
> > too much data, in which case please ask on this list for someone who
> > can run it for you."
> >
> > I ran the code as instructed, but the query was killed as it took longer
> > than 30 minutes to run. I asked if anyone on the mailing list could run
> it
> > for me, but no one replied saying they could. The guy who wrote the code
> > suggested I try this mailing list to see if anyone can help.
> >
> > I'm a beginner in programming and coding etc., so any and all help you
> can
> > give me would be greatly appreciated.
> >
> > Many thanks,
> > Nick Bell
> > University of Bristol
> > ___
> > Wikitech-l mailing list
> > Wikitech-l@lists.wikimedia.org
> > https://lists.wikimedia.org/mailman/listinfo/wikitech-l
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Incoming and outgoing links enquiry

2018-03-18 Thread Brian Wolff
Hi,

You can run longer queries by getting access to toolforge (
https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the
command line.

However the query in question might  still take an excessively long time
(if you are doing all of wikipedia). I would expect that query to result in
about 150mb of data and maybe take days to complete.

You can also break it down into parts by adding WHERE page_title >='a' AND
page_title < 'b'

Note, also of interest: full dumps of all the links is available at
https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz
(you would also need
https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to
convert page ids to page names)
--
Brian
On Sunday, March 18, 2018, Nick Bell  wrote:
> Hi there,
>
> I'm a final year Mathematics student at the University of Bristol, and I'm
> studying Wikipedia as a graph for my project.
>
> I'd like to get data regarding the number of outgoing links on each page,
> and the number of pages with links to each page. I have already
> inquired about this with the Analytics Team mailing list, who gave me a
few
> suggestions.
>
> One of these was to run the code at this link https://quarry.wmflabs.org/
> query/25400
> with these instructions:
>
> "You will have to fork it and remove the "LIMIT 10" to get it to run on
> all the English Wikipedia articles. It may take too long or produce
> too much data, in which case please ask on this list for someone who
> can run it for you."
>
> I ran the code as instructed, but the query was killed as it took longer
> than 30 minutes to run. I asked if anyone on the mailing list could run it
> for me, but no one replied saying they could. The guy who wrote the code
> suggested I try this mailing list to see if anyone can help.
>
> I'm a beginner in programming and coding etc., so any and all help you can
> give me would be greatly appreciated.
>
> Many thanks,
> Nick Bell
> University of Bristol
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l