Indeed, the purpose does  matter. Is the end goal the content similarity of 
articles themselves (perhaps say to detect articles that might be merged) or is 
the end goal the relatedness of topics represented by those articles? If the 
latter is the goal, then the Wikipedia category system relates articles with 
some commonality of topic, and distance between articles via the category 
hierarchy is an indicator of levels of relatedness. Similarly navboxes relate 
articles that have something in common, as do list articles. All of these three 
things are manually curated, and may be a much cheaper way to determine 
relatedness of topics than messing about with bags of words, etc. But it all 
really depends on the end goal.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[email protected]] On 
Behalf Of Isaac Johnson
Sent: Wednesday, 8 May 2019 1:35 AM
To: Research into Wikimedia content and communities 
<[email protected]>
Subject: Re: [Wiki-research-l] Content similarity between two Wikipedia articles

Hey Haifeng,
On top of all the excellent answers provided, I'd also add that the answer to 
your question depends on what you want to use the similarity scores for.
For some insight into what it might mean to make choose one approach over 
another, see this recent publication:
https://dl.acm.org/citation.cfm?id=3213769

At a high level, I'd say that there are three ways you might approach article 
similarity on Wikipedia:
* Reader similarity: two articles are similar if the same people who read one 
also frequently read the other. Navigation embeddings that implement this 
definition based on page views were generated last in 2017, so newer articles 
will not be represented, but here is the dataset [
https://figshare.com/articles/Wikipedia_Vectors/3146878 ] and meta page [ 
https://meta.wikimedia.org/wiki/Research:Wikipedia_Navigation_Vectors ].
The clickstream dataset [
https://dumps.wikimedia.org/other/clickstream/readme.html ], which is more 
recent, might be used in a similar way.
* Content similarity: two articles are similar if they contain similar content 
-- i.e. in most cases, similar text. This covers most of the suggestions 
provided to you in this email chain. Some are simpler but are language-specific 
unless you make substantial modifications (e.g., ESA, the LDA model described 
here:
https://cs.stanford.edu/people/jure/pubs/wikipedia-www17.pdf) while others are 
more complicated but work across multiple languages (e.g., recent WSDM
paper: https://twitter.com/cervisiarius/status/1115510356976242688).
* Link similarity: two articles are similar if they link to similar articles. 
Generally, this approach involves creating a graph of Wikipedia's link 
structure and then using an approach such as node2vec to reduce the graph to 
article embeddings. I know less about the current approaches in this space, but 
some searching should turn up a variety of approaches -- e.g., Milne and 
Witten's 2008 approach [ 
http://www.aaai.org/Papers/Workshops/2008/WS-08-15/WS08-15-005.pdf ], which is 
implemented in WikiBrain as Morten mentioned.

There are also other, more structured approaches like ORES drafttopic, which 
predicts which topics (based on WikiProjects) are most likely to apply to a 
given English Wikipedia article:
https://www.mediawiki.org/wiki/Talk:ORES/Draft_topic

On Tue, May 7, 2019 at 9:54 AM <[email protected]> wrote:

> Dear Haifeng,
>
>
> Would you not be able to use ordinary information retrieval techniques 
> such as bag-of-words/phrases and tfidf? Explicit semantic analysis 
> (ESA) uses this approach (though its primary focus is word semantic 
> similarity).
>
> There are a few papers for ESA:
> https://tools.wmflabs.org/scholia/topic/Q5421270
>
> I have also used it in "Open semantic analysis: The case of word level 
> semantics in Danish"
>
> http://www2.compute.dtu.dk/pubdb/views/edoc_download.php/7029/pdf/imm7
> 029.pdf
>
>
> Finn Årup Nielsen
> http://people.compute.dtu.dk/faan/
>
>
>
> On 04/05/2019 13:47, Haifeng Zhang wrote:
> > Dear folks,
> >
> > Is there a way to compute content similarity between two Wikipedia
> articles?
> >
> > For example, I can think of representing each article as a vector of
> likelihoods over possible topics.
> >
> > But, I wonder there are other work people have already explored in 
> > the
> past.
> >
> >
> > Thanks,
> >
> > Haifeng
> > _______________________________________________
> > Wiki-research-l mailing list
> > [email protected]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
>
> _______________________________________________
> Wiki-research-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>


--
Isaac Johnson -- Research Scientist -- Wikimedia Foundation 
_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Reply via email to