Re: [Wikitech-l] Incoming and outgoing links enquiry
This information is available mostly pre-calculated in the CirrusSearch dumps at http://dumps.wikimedia.your.org/other/cirrussearch/current/ Each article is represented by a line of json in those dumps. There is a field called 'incoming_links' which is the number of unique articles with links from the content namespace(s) to that article. Each article additionally contains an `outgoing_link` field which contains a list of strings representing the pages the article links to (incoming_links is calculated by querying the outgoing_link field). I've done graph work on wikipedia before using this and the outgoing_link field is typically enough to build a full graph. On Sun, Mar 18, 2018 at 2:18 PM, Johnwrote: > I would second the recommendation of using the dumps for such a large > graphing project. If it's more than a couple hundred pages the API/database > queries can get bulky > > On Sun, Mar 18, 2018 at 5:07 PM Brian Wolff wrote: > > > Hi, > > > > You can run longer queries by getting access to toolforge ( > > https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from > the > > command line. > > > > However the query in question might still take an excessively long time > > (if you are doing all of wikipedia). I would expect that query to result > in > > about 150mb of data and maybe take days to complete. > > > > You can also break it down into parts by adding WHERE page_title >='a' > AND > > page_title < 'b' > > > > Note, also of interest: full dumps of all the links is available at > > > > https://dumps.wikimedia.org/enwiki/20180301/enwiki- > 20180301-pagelinks.sql.gz > > (you would also need > > https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz > to > > convert page ids to page names) > > -- > > Brian > > On Sunday, March 18, 2018, Nick Bell wrote: > > > Hi there, > > > > > > I'm a final year Mathematics student at the University of Bristol, and > > I'm > > > studying Wikipedia as a graph for my project. > > > > > > I'd like to get data regarding the number of outgoing links on each > page, > > > and the number of pages with links to each page. I have already > > > inquired about this with the Analytics Team mailing list, who gave me a > > few > > > suggestions. > > > > > > One of these was to run the code at this link > > https://quarry.wmflabs.org/ > > > query/25400 > > > with these instructions: > > > > > > "You will have to fork it and remove the "LIMIT 10" to get it to run on > > > all the English Wikipedia articles. It may take too long or produce > > > too much data, in which case please ask on this list for someone who > > > can run it for you." > > > > > > I ran the code as instructed, but the query was killed as it took > longer > > > than 30 minutes to run. I asked if anyone on the mailing list could run > > it > > > for me, but no one replied saying they could. The guy who wrote the > code > > > suggested I try this mailing list to see if anyone can help. > > > > > > I'm a beginner in programming and coding etc., so any and all help you > > can > > > give me would be greatly appreciated. > > > > > > Many thanks, > > > Nick Bell > > > University of Bristol > > > ___ > > > Wikitech-l mailing list > > > Wikitech-l@lists.wikimedia.org > > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Incoming and outgoing links enquiry
I would second the recommendation of using the dumps for such a large graphing project. If it's more than a couple hundred pages the API/database queries can get bulky On Sun, Mar 18, 2018 at 5:07 PM Brian Wolffwrote: > Hi, > > You can run longer queries by getting access to toolforge ( > https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the > command line. > > However the query in question might still take an excessively long time > (if you are doing all of wikipedia). I would expect that query to result in > about 150mb of data and maybe take days to complete. > > You can also break it down into parts by adding WHERE page_title >='a' AND > page_title < 'b' > > Note, also of interest: full dumps of all the links is available at > > https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz > (you would also need > https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to > convert page ids to page names) > -- > Brian > On Sunday, March 18, 2018, Nick Bell wrote: > > Hi there, > > > > I'm a final year Mathematics student at the University of Bristol, and > I'm > > studying Wikipedia as a graph for my project. > > > > I'd like to get data regarding the number of outgoing links on each page, > > and the number of pages with links to each page. I have already > > inquired about this with the Analytics Team mailing list, who gave me a > few > > suggestions. > > > > One of these was to run the code at this link > https://quarry.wmflabs.org/ > > query/25400 > > with these instructions: > > > > "You will have to fork it and remove the "LIMIT 10" to get it to run on > > all the English Wikipedia articles. It may take too long or produce > > too much data, in which case please ask on this list for someone who > > can run it for you." > > > > I ran the code as instructed, but the query was killed as it took longer > > than 30 minutes to run. I asked if anyone on the mailing list could run > it > > for me, but no one replied saying they could. The guy who wrote the code > > suggested I try this mailing list to see if anyone can help. > > > > I'm a beginner in programming and coding etc., so any and all help you > can > > give me would be greatly appreciated. > > > > Many thanks, > > Nick Bell > > University of Bristol > > ___ > > Wikitech-l mailing list > > Wikitech-l@lists.wikimedia.org > > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Re: [Wikitech-l] Incoming and outgoing links enquiry
Hi, You can run longer queries by getting access to toolforge ( https://wikitech.wikimedia.org/wiki/Portal:Toolforge) and running from the command line. However the query in question might still take an excessively long time (if you are doing all of wikipedia). I would expect that query to result in about 150mb of data and maybe take days to complete. You can also break it down into parts by adding WHERE page_title >='a' AND page_title < 'b' Note, also of interest: full dumps of all the links is available at https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-pagelinks.sql.gz (you would also need https://dumps.wikimedia.org/enwiki/20180301/enwiki-20180301-page.sql.gz to convert page ids to page names) -- Brian On Sunday, March 18, 2018, Nick Bellwrote: > Hi there, > > I'm a final year Mathematics student at the University of Bristol, and I'm > studying Wikipedia as a graph for my project. > > I'd like to get data regarding the number of outgoing links on each page, > and the number of pages with links to each page. I have already > inquired about this with the Analytics Team mailing list, who gave me a few > suggestions. > > One of these was to run the code at this link https://quarry.wmflabs.org/ > query/25400 > with these instructions: > > "You will have to fork it and remove the "LIMIT 10" to get it to run on > all the English Wikipedia articles. It may take too long or produce > too much data, in which case please ask on this list for someone who > can run it for you." > > I ran the code as instructed, but the query was killed as it took longer > than 30 minutes to run. I asked if anyone on the mailing list could run it > for me, but no one replied saying they could. The guy who wrote the code > suggested I try this mailing list to see if anyone can help. > > I'm a beginner in programming and coding etc., so any and all help you can > give me would be greatly appreciated. > > Many thanks, > Nick Bell > University of Bristol > ___ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l