Hi Giuseppe, Ward On Tue, Feb 21, 2017 at 5:48 PM, Giuseppe Profiti <[email protected]> wrote:
> 2017-02-19 20:56 GMT+01:00 Mara Sorella <[email protected]>: > > Hi everybody, I'm new to the list and have been referred here by a > comment > > from a SO user as per my question [1], that I'm quoting next: > > > > > > I have been successfully able to use the Wikipedia pagelinks SQL dump to > > obtain hyperlinks between Wikipedia pages for a specific revision time. > > > > However, there are cases where multiple instances of such links exist, > e.g. > > the very same https://en.wikipedia.org/wiki/Wikipedia page and > > https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to > find > > number of links between pairs of pages for a specific revision. > > > > Ideal solutions would involve dump files other than pagelinks (which I'm > not > > aware of), or using the MediaWiki API. > > > > > > > > To elaborate, I need this information to weight (almost) every hyperlink > > between article pages (that is, in NS0), that was present in a specific > > wikipedia revision (end of 2015), therefore, I would prefer not to follow > > the solution suggested by the SO user, that would be rather impractical. > > Hi Mara, > Mediawiki API does not return the multiplicity of the links [1]. As > far as I can see from the database layout, you can't get the > multiplicity of links from it either [2]. The only solution that > occurs to me is to parse the wikitext of the page, like the SO user > suggested. > > In any case, some communities established writing styles that > discourage multiple links towards the same article (i.e. in the > Italian Wikipedia a link is associated only to the first occurrence of > the word). Then, the numbers you could get may vary depending on the > style of the community and/or last editor. > Yes, this is a good practice that I noticed being very widespread. Indeed this would lead the link-multiplicity based weighting approach to fail. A (costly) option would be inspecting the actual article text (possibly only the abstract). I guess this can be done starting from the dump files. @Ward: could your technology be of help for this task? > > > > Indeed, my final aim is to use this weight in a thresholding fashion to > > sparsify the wikipedia graph (that due to the short diameter is more or > less > > a giant connected component), in a way that should reflect the > "relatedness" > > of the linked pages (where relatedness is not intended as strictly > semantic, > > but at a higher "concept" level, if I may say so). > > For this reason, other suggestions on how determine such weights > (possibly > > using other data sources -- ontologies?) are more than welcome. > > When you get the graph of connections, instead of using the > multiplicity as weight, you could try to use community detection > methods to isolate subclusters of strongly connected articles. > Another approach my be to use centrality measures, however the only > one that can be applied to edges instead of just nodes is betweenness > centrality, if I remember correctly. > Currently, I resorted to keep only reciprocal links, but I still get quite big connected components (despite the fact that I'm actually carrying out a temporal analysis, where I consider, for each time instant, only pages exhibiting an unusually high traffic). Concerning community detection techniques/centrality: I discarded them because I don't want to "impose" connectedness (reachability) at the subgraph level, but only between single entities (since my algorithm aims to find some sort of temporally persistent subgraphs having some properties). > In case of a fast technical solution may come to mind, I'll write here > again. > > Best, > Giuseppe > > [1] https://en.wikipedia.org/w/api.php?action=query&prop=links& > titles=Wikipedia&plnamespace=0&pllimit=500&pltitles=Wikimedia_Foundation > [2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWik > i_1.28.0_database_schema.svg > > Thank you both for your feedback! Best, Mara
_______________________________________________ Wiki-research-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
