Hi Giuseppe, Ward

On Tue, Feb 21, 2017 at 5:48 PM, Giuseppe Profiti <[email protected]>
wrote:

> 2017-02-19 20:56 GMT+01:00 Mara Sorella <[email protected]>:
> > Hi everybody, I'm new to the list and have been referred here by a
> comment
> > from a SO user as per my question [1], that I'm quoting next:
> >
> >
> > I have been successfully able to use the Wikipedia pagelinks SQL dump to
> > obtain hyperlinks between Wikipedia pages for a specific revision time.
> >
> > However, there are cases where multiple instances of such links exist,
> e.g.
> > the very same https://en.wikipedia.org/wiki/Wikipedia page and
> > https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to
> find
> > number of links between pairs of pages for a specific revision.
> >
> > Ideal solutions would involve dump files other than pagelinks (which I'm
> not
> > aware of), or using the MediaWiki API.
> >
> >
> >
> > To elaborate, I need this information to weight (almost) every hyperlink
> > between article pages (that is, in NS0), that was present in a specific
> > wikipedia revision (end of 2015), therefore, I would prefer not to follow
> > the solution suggested by the SO user, that would be rather impractical.
>
> Hi Mara,
> Mediawiki API does not return the multiplicity of the links [1]. As
> far as I can see from the database layout, you can't get the
> multiplicity of links from it either [2]. The only solution that
> occurs to me is to parse the wikitext of the page, like the SO user
> suggested.
>
> In any case, some communities established writing styles that
> discourage multiple links towards the same article (i.e. in the
> Italian Wikipedia a link is associated only to the first occurrence of
> the word). Then, the numbers you could get may vary depending on the
> style of the community and/or last editor.
>
Yes, this is a good practice that I noticed being very widespread. Indeed
this would lead the link-multiplicity based weighting approach to fail.
A (costly) option would be inspecting the actual article text (possibly
only the abstract). I guess this can be done starting from the dump files.

@Ward: could your technology be of help for this task?


> >
> > Indeed, my final aim is to use this weight in a thresholding fashion to
> > sparsify the wikipedia graph (that due to the short diameter is more or
> less
> > a giant connected component), in a way that should reflect the
> "relatedness"
> > of the linked pages (where relatedness is not intended as strictly
> semantic,
> > but at a higher "concept" level, if I may say so).
> > For this reason, other suggestions on how determine such weights
> (possibly
> > using other data sources -- ontologies?) are more than welcome.
>
> When you get the graph of connections, instead of using the
> multiplicity as weight, you could try to use community detection
> methods to isolate subclusters of strongly connected articles.
> Another approach my be to use centrality measures, however the only
> one that can be applied to edges instead of just nodes is betweenness
> centrality, if I remember correctly.
>

Currently, I resorted to keep only reciprocal links, but I still get quite
big connected components (despite the fact that I'm actually carrying out a
temporal analysis, where I consider, for each time instant, only pages
exhibiting an unusually high traffic).
Concerning community detection techniques/centrality: I discarded them
because I don't want to "impose" connectedness (reachability) at the
subgraph level, but only between single entities (since my algorithm aims
to find some sort of temporally persistent subgraphs having some
properties).


> In case of a fast technical solution may come to mind, I'll write here
> again.
>
> Best,
> Giuseppe
>
> [1] https://en.wikipedia.org/w/api.php?action=query&prop=links&;
> titles=Wikipedia&plnamespace=0&pllimit=500&pltitles=Wikimedia_Foundation
> [2] https://upload.wikimedia.org/wikipedia/commons/9/94/MediaWik
> i_1.28.0_database_schema.svg
>
>
Thank you both for your feedback!

Best,

Mara
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to