Re: [Wikitech-l] Topic and cathegory analyser

2011-03-05 Thread Dávid Tóth
Although there might be links that do not have strong connections, if the
articles are written according to Wikipedia guidelines there should be a
minimal amount of such links (distractions and noise).
Every article consists of words and semantic structures. If we could
partition all the articles and analyse the occurrence of these structures
statistically we could see a different distribution for every article. If we
were proceeding further analysing these distributions, we could notice that
there are different types of these distributions and that they can be
categorised according to what they have in common. Let us pick up some
articles that represent categories,
http://en.wikipedia.org/wiki/Mathematicsbeing one of them. Then the
system would assign to each article a
probability or closeness to a certain category. So for example
http://en.wikipedia.org/wiki/Isac_Newton could have 15% Physics, 13%
Mathematics, 11% Famous People...
There could be numerous methods applied for the analysis - Bayesian
probabilities, PageRank and so on or their combinations.
Percentages are highly illustrative, relevance can be expressed with more
than one dimension as some combination of vectors or functions dependent on
some factors (time, location, depth of information span...).

Possible applications:
See also suggestions
Search results
Experimental semantic navigation
Analysis of scientific papers
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Topic and cathegory analyser

2011-03-04 Thread Paul Houle
  On 3/3/2011 7:12 PM, Dávid Tóth wrote:
 Would it be useful to make a program that would create topic relations for
 each wikipedia article based on the links and the distribution of semantic
 structures?
 This would be very useful for me.

 I'm thinking about attack this problem by discovering 'low hanging 
fruits'.

 To some extent you can assume that

:X :wikiLink :Y - :X skos:related :Y

 but the nature and strength of links is hard to estimate.  I've 
developed a good metric for approximating the importance of topic :X,  
but I've yet to get a handle on relationship strength.  To take an 
example,  there's a link from :Metallica to :Yale_University because

:Metallica :Sued :Yale_University

 That's not a very strong connection.  Now,  if Wikipedia mentioned 
the Dead at Cornell recording which was made when Jerry Garcia had 
just gotten hooked on opium and the band was playing at it's best,  we 
might say

:Grateful_Dead :PlayedAt :Cornell_University

 maybe you think that's a stronger connection than the above,  maybe 
you don't.  Then again,

:Rod_Serling :TaughtAt :Ithaca_College

 is one of the stronger links involving :Ithaca_College in my opinion.

 There are two angles I see for extracting better relationships from 
Wikipedia and these are

(i) databases such as Freebase and DBPedia,  in particular,  these have 
certain relationships already semantized and other information that can 
be used to infer about possible relationships.  For instance,

:Brown_Bear :Sued :Pelican

doesn't make any sense and should be rejected.

(ii) analysis of the text around a link.  You could certainly see 
certain language patterns that are frequently used,  for instance

A is a B,  C married D,  E was born at F

you could either find some of these by hand or you could write something 
that uses machine learning techniques to discover these.  Information 
from type (i) could be useful here.  For instance,  we could find a 
bunch of relationships that exist in Freebase and use these as positive 
training examples.  The trouble I see here is the creation of a good set 
of negative training examples,  which has a few aspects:  one is that 
examples that should be positive will slip into a negative sample,  
attempts to automatically exclude positives will probably also exclude 
'near miss' negatives that would be especially important to include 
training set,  and generally,  the number of negatives would be 1000 or 
more times prevalent than positives,  which gives most ML methods 
Bayesian priors that destroy recall.

Another issue is that you'll see the patterns

E was born at F
[[E]] was born at F
E was born at [[F]]
[[E]] was born at [[F]]

all occur (sometimes they make the text describing the subject a link,  
sometimes they don't.)  Getting good recall then means solving the named 
entity extraction problem as well,  however,  making this part of a 
'whole system' might create the kind of feedback control loop that's 
necessary for high-performing A.I.

The best attack on this,  I think,  is to pick one particular 
relationship that you want to extract,  particularly one that has a bit 
of a 'closed world' aspect in that you can presume that that property 
ought to exist for all members of a type.  For instance, we can say that

any person was born at some location

but even there you can get into trouble quick,  if you look at 
:Joan_of_arc,  you see that wikipedia says that she was

A peasant http://en.wikipedia.org/wiki/Peasant girl born in eastern 
France


you note that A peasant girl == :Joan_of_arc and that a more specific 
birthplace can be found in the infobox.


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Topic and cathegory analyser

2011-03-04 Thread Alex Brollo
2011/3/4 Paul Houle p...@ontology2.com

Briefly, atthe border of OT: I see the magic word ontology into your mail
address. :-) :-)

I discovered ontology ... well, a long history. Ontological classification
is used to collect data on cancer by National Cancer Insititute; and,
strange to tell, I discovered it as an unexpected result of posting a
picture on Commons, a low grade prostatic PIN... then I found that NCI  use
SemanticWiki. In other terms: from wiki, to wiki again. :-)

My aim about ontologies is very, very simpler; it's simply to create
something I called catwords, t.i. a system of categorization (wiki sistem
is perfect) that can be used too as a list of keywords. I can't wait for
installation of DynamicPageList into it.source, since the engine I need is
simply a good method to get intersection of categories; but I found that
it's not sufficient, some peculiar conventions in categorization are needed
too, far from complex well, I'll tell you news as soon as I will get my
tool. :-)

Alex
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Topic and cathegory analyser

2011-03-04 Thread Platonides
Paul Houle wrote:
 A peasant http://en.wikipedia.org/wiki/Peasant girl born in eastern 
 France
 
 you note that A peasant girl == :Joan_of_arc and that a more specific 
 birthplace can be found in the infobox.

You will find that the infoboxes are the best article pieces to mine.



___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Topic and cathegory analyser

2011-03-03 Thread Dávid Tóth
Would it be useful to make a program that would create topic relations for
each wikipedia article based on the links and the distribution of semantic
structures?
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Topic and cathegory analyser

2011-03-03 Thread Diederik van Liere
Please elaborate.
Diederik

Sent from my iPhone

On 2011-03-03, at 16:12, Dávid Tóth 90010...@gmail.com wrote:

 Would it be useful to make a program that would create topic relations for
 each wikipedia article based on the links and the distribution of semantic
 structures?
 ___
 Wikitech-l mailing list
 Wikitech-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l