Re: [Wikidata-l] Subclass of/instance of

2014-05-27 Thread David Cuenca
On Thu, May 15, 2014 at 5:03 PM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 I applaud your comparison of inferencing with a form of decompression. I
 think this is a nice intuition (in fact, some people have researched
 semantic compression where one tries to reduce the size of a knowledge
 base by eliminating things that follow from the rest anyway).


Markus, sorry the delay answering to this, I had to let the ideas grow for
a while.

I also like the idea of decompression, that is what makes your database of
inferred data even more useful. There is a lot of data that can be
inferred, and not just from following the relationships, but by computing.
For instance population density which can be calculated from area and
population, or aggregates of the population of each town in a district.

Another source of inferred statements are wp categories. Most of them are
very easily translatable into statements, and the other way round too. A
place where to store and  process these inferences would be most useful if
WD is not the right place.

You also say: Constraints are a great start. We should now ask how we
could improve the management of constraints in the future, and which
constraints we will have then.
The first step will be having them as statements, then having them as
queries, and finally automating their correction, either by semi-automatic
tools, or with gamification. How to automatically transform a constraint
into a game to solve the outliers it might be also an interesting topic.
And of course, more far fetched, but nevertheless relevant is how to
connect the property to a perceptual mechanism.

About improving the reliability: yes, as wikidata grows bigger some
statements become more important. There is something to be learnt about how
neural nets work, specially strengthening most-used (or traveled, or
accepted, or viewed) connections. Another process little understood now is
the need to forget, or in wikidata terms auto-deprecate information that is
no longer current. Not very relevant now, but something to keep in mind for
the next years.

Cheers,
Micru
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Subclass of/instance of

2014-05-15 Thread Markus Krötzsch

On 14/05/14 19:33, Joe Filceolaire wrote:

Except that there are lots of people who have appeared in one movie who
don't consider themselves actors and should not have the
'occupation=actor/actress'. There are good reasons for some constraints
to be gadgets that can be overridden rather than hard coded semantic limits.


Sure, we completely agree here. It was just an example. But it shows why 
we need any such feature to be controlled by the community ;-)




I do think we should be able to have hard coded reverse properties and
symmettric properties.


By hard coded do you mean stored explicitly (as opposed to: 
inferred in some way)? It will always be possible to store anything 
explicitly in this sense (but I guess you know this; maybe I 
misunderstood what you said; feel free to clarify).


In general, what I mentioned about inferencing is not supposed to alter 
the way in which the site works. It would be more like a layer on top 
that could be useful for asking queries. For example, imagine you want 
to query for the grandmother of a person: we don't have this property in 
Wikidata but we have enough information to answer the query. So you 
would have to research how to get this information by combining existing 
properties. The idea is that one could have a place to keep this 
information (= the definition of grandmother in terms of Wikidata 
properties). We would then have a community approved way of finding 
grandmothers in Wikidata, and you would be much faster with your query. 
At the same time, you could look up the definition to find out how 
Wikidata really stores this information. None of this would would change 
how the underlying data works, but it could contribute to some data 
modelling problems because it gives you an option to support a 
property without the added maintenance cost on the data management level.


Cheers,

Markus




On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:



 I guess there is already a group of people who deal w
Hi Eric,

Thanks for all the information. This was very helpful. I only get to
answer now since we have been quite busy building RDF exports for
Wikidata (and writing a paper about it). I will soon announce this
here (we still need to fix a few details).

You were asking about using these properties like rdfs:subClassOf
and rdf:type. I think that's entirely possible, since the modelling
is very reasonable and would probably yield good results. Our
reasoner ELK could easily handle the class hierarchy in terms of
size, but you don't really need such a highly optimized tool for
this as long as you only have subClassOf. In fact, the page you
linked to shows that it is perfectly possible to compute the class
hierarchy with Wikidata Query and to display all of it on one page.
ELK's main task is to compute class hierarchies for more complicated
ontologies, which we do not have yet. OTOH, query answering and data
access are different tasks that ELK is not really intended for
(although it could do some of this as well).

Regarding future perspectives: one thing that we have also done is
to extract OWL axioms from property constraint templates on Wikidata
talk pages (we will publish the result soon, when announcing the
rest). This gives you only some specific types of OWL axioms, but it
is making things a bit more interesting already. In particular,
there are some constraints that tell you that an item should have a
certain class, so this is something you could reason with. However,
the current property constraint system does not work too well for
stating axioms that are not related to a particular property (such
as: Every [instance of] person who appears as an actor in some film
should be [instance of] in the class 'actor' -- which property or
item page should this be stated on?). But the constraints show that
it makes sense to express such information somehow.

In the end, however, the real use of OWL (and similar ontology
languages) is to remove the need for making everything explicit.
That is, instead of constraints (which say: if your data looks
like X, then your data should also include Y) you have axioms
(which say: if your data looks like X, then Y follows
automatically). So this allows you to remove redundancy rather than
to detect omissions. This would make more sense with derived
notions that one does not want to store in the database, but which
make sense for queries (like grandmother).

One would need a bit more infrastructure for this; in particular,
one would need to define grandmother (with labels in many
languages) even if one does not want to use it as a property but
only in queries. Maybe one could have a separate Wikibase
installation for defining such derived notions without needing 

Re: [Wikidata-l] Subclass of/instance of

2014-05-14 Thread Markus Krötzsch

Hi Eric,

Thanks for all the information. This was very helpful. I only get to 
answer now since we have been quite busy building RDF exports for 
Wikidata (and writing a paper about it). I will soon announce this here 
(we still need to fix a few details).


You were asking about using these properties like rdfs:subClassOf and 
rdf:type. I think that's entirely possible, since the modelling is very 
reasonable and would probably yield good results. Our reasoner ELK could 
easily handle the class hierarchy in terms of size, but you don't really 
need such a highly optimized tool for this as long as you only have 
subClassOf. In fact, the page you linked to shows that it is perfectly 
possible to compute the class hierarchy with Wikidata Query and to 
display all of it on one page. ELK's main task is to compute class 
hierarchies for more complicated ontologies, which we do not have yet. 
OTOH, query answering and data access are different tasks that ELK is 
not really intended for (although it could do some of this as well).


Regarding future perspectives: one thing that we have also done is to 
extract OWL axioms from property constraint templates on Wikidata talk 
pages (we will publish the result soon, when announcing the rest). This 
gives you only some specific types of OWL axioms, but it is making 
things a bit more interesting already. In particular, there are some 
constraints that tell you that an item should have a certain class, so 
this is something you could reason with. However, the current property 
constraint system does not work too well for stating axioms that are not 
related to a particular property (such as: Every [instance of] person 
who appears as an actor in some film should be [instance of] in the 
class 'actor' -- which property or item page should this be stated 
on?). But the constraints show that it makes sense to express such 
information somehow.


In the end, however, the real use of OWL (and similar ontology 
languages) is to remove the need for making everything explicit. That 
is, instead of constraints (which say: if your data looks like X, 
then your data should also include Y) you have axioms (which say: if 
your data looks like X, then Y follows automatically). So this allows 
you to remove redundancy rather than to detect omissions. This would 
make more sense with derived notions that one does not want to store 
in the database, but which make sense for queries (like grandmother).


One would need a bit more infrastructure for this; in particular, one 
would need to define grandmother (with labels in many languages) even 
if one does not want to use it as a property but only in queries. Maybe 
one could have a separate Wikibase installation for defining such 
derived notions without needing to change Wikidata? There are no 
statements on properties yet, but one could also use item pages to 
define derived properties when using another site ...


Best regards,

Markus

P.S. Thanks for all the work on the semantic modelling aspects of 
Wikidata. I have seen that you have done a lot in the discussions to 
clarify things there.



On 06/05/14 04:53, emw wrote:

Hi Markus,

You asked who is creating all these [subclass of] statements and how is
this done?

The class hierarchy in
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=en
shows a few relatively large subclass trees for specialist domains,
including molecular biology and mineralogy.  The several thousand
subclass of 'gene' and 'protein' subclass claims were created by members
of WikiProject Molecular biology (WD:MB), based on discussions in [1]
and [2].  The decision to use P279 instead of P31 there was based on the
fact that the is-a relation in Gene Ontology maps to rdfs:subClassOf,
which P279 is based on.  The claims were added by a bot [3], with input
from WD:MB members.  The data ultimately comes from external biological
databases.

A glance at the mineralogy class hierarchy indicates it has been
constructed by WikiProject Mineralogy [4] members through non-bot
edits.  I imagine most of the other subclass of claims are done manually
or semi-automatically outside specific Wikiproject efforts.  In other
words, I think most of the other P279 claims are added by Wikidata users
going into the UI and building usually-reasonable concept hierarchies on
domains they're interested in.  I've worked on constructing class
hierarchies for health problems (e.g. diseases and injuries) [5] and
medical procedures [6] based on classifications like ICD-10 and
assertions and templates on Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000
subclass of (P279) claims [9].  The property has been around for over a
year and is a regular topic of discussion [10] along with instance of
(P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen'
(Q130875).  I agree that instance of would probably be the better
membership property 

Re: [Wikidata-l] Subclass of/instance of

2014-05-14 Thread Joe Filceolaire
Except that there are lots of people who have appeared in one movie who
don't consider themselves actors and should not have the
'occupation=actor/actress'. There are good reasons for some constraints to
be gadgets that can be overridden rather than hard coded semantic limits.

I do think we should be able to have hard coded reverse properties and
symmettric properties.

Joe filceolaire


On Wed, May 14, 2014 at 2:33 PM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 Hi Eric,

 Thanks for all the information. This was very helpful. I only get to
 answer now since we have been quite busy building RDF exports for Wikidata
 (and writing a paper about it). I will soon announce this here (we still
 need to fix a few details).

 You were asking about using these properties like rdfs:subClassOf and
 rdf:type. I think that's entirely possible, since the modelling is very
 reasonable and would probably yield good results. Our reasoner ELK could
 easily handle the class hierarchy in terms of size, but you don't really
 need such a highly optimized tool for this as long as you only have
 subClassOf. In fact, the page you linked to shows that it is perfectly
 possible to compute the class hierarchy with Wikidata Query and to display
 all of it on one page. ELK's main task is to compute class hierarchies for
 more complicated ontologies, which we do not have yet. OTOH, query
 answering and data access are different tasks that ELK is not really
 intended for (although it could do some of this as well).

 Regarding future perspectives: one thing that we have also done is to
 extract OWL axioms from property constraint templates on Wikidata talk
 pages (we will publish the result soon, when announcing the rest). This
 gives you only some specific types of OWL axioms, but it is making things a
 bit more interesting already. In particular, there are some constraints
 that tell you that an item should have a certain class, so this is
 something you could reason with. However, the current property constraint
 system does not work too well for stating axioms that are not related to a
 particular property (such as: Every [instance of] person who appears as an
 actor in some film should be [instance of] in the class 'actor' -- which
 property or item page should this be stated on?). But the constraints show
 that it makes sense to express such information somehow.

 In the end, however, the real use of OWL (and similar ontology languages)
 is to remove the need for making everything explicit. That is, instead of
 constraints (which say: if your data looks like X, then your data should
 also include Y) you have axioms (which say: if your data looks like X,
 then Y follows automatically). So this allows you to remove redundancy
 rather than to detect omissions. This would make more sense with derived
 notions that one does not want to store in the database, but which make
 sense for queries (like grandmother).

 One would need a bit more infrastructure for this; in particular, one
 would need to define grandmother (with labels in many languages) even if
 one does not want to use it as a property but only in queries. Maybe one
 could have a separate Wikibase installation for defining such derived
 notions without needing to change Wikidata? There are no statements on
 properties yet, but one could also use item pages to define derived
 properties when using another site ...

 Best regards,

 Markus

 P.S. Thanks for all the work on the semantic modelling aspects of
 Wikidata. I have seen that you have done a lot in the discussions to
 clarify things there.



 On 06/05/14 04:53, emw wrote:

 Hi Markus,

 You asked who is creating all these [subclass of] statements and how is
 this done?

 The class hierarchy in
 http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=en
 shows a few relatively large subclass trees for specialist domains,
 including molecular biology and mineralogy.  The several thousand
 subclass of 'gene' and 'protein' subclass claims were created by members
 of WikiProject Molecular biology (WD:MB), based on discussions in [1]
 and [2].  The decision to use P279 instead of P31 there was based on the
 fact that the is-a relation in Gene Ontology maps to rdfs:subClassOf,
 which P279 is based on.  The claims were added by a bot [3], with input
 from WD:MB members.  The data ultimately comes from external biological
 databases.

 A glance at the mineralogy class hierarchy indicates it has been
 constructed by WikiProject Mineralogy [4] members through non-bot
 edits.  I imagine most of the other subclass of claims are done manually
 or semi-automatically outside specific Wikiproject efforts.  In other
 words, I think most of the other P279 claims are added by Wikidata users
 going into the UI and building usually-reasonable concept hierarchies on
 domains they're interested in.  I've worked on constructing class
 hierarchies for health problems (e.g. diseases and injuries) [5] and
 medical 

Re: [Wikidata-l] Subclass of/instance of

2014-05-05 Thread emw
Hi Markus,

You asked who is creating all these [subclass of] statements and how is
this done?

The class hierarchy in
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=enshows
a few relatively large subclass trees for specialist domains,
including molecular biology and mineralogy.  The several thousand subclass
of 'gene' and 'protein' subclass claims were created by members of
WikiProject Molecular biology (WD:MB), based on discussions in [1] and
[2].  The decision to use P279 instead of P31 there was based on the fact
that the is-a relation in Gene Ontology maps to rdfs:subClassOf, which
P279 is based on.  The claims were added by a bot [3], with input from
WD:MB members.  The data ultimately comes from external biological
databases.

A glance at the mineralogy class hierarchy indicates it has been
constructed by WikiProject Mineralogy [4] members through non-bot edits.  I
imagine most of the other subclass of claims are done manually or
semi-automatically outside specific Wikiproject efforts.  In other words, I
think most of the other P279 claims are added by Wikidata users going into
the UI and building usually-reasonable concept hierarchies on domains
they're interested in.  I've worked on constructing class hierarchies for
health problems (e.g. diseases and injuries) [5] and medical procedures [6]
based on classifications like ICD-10 and assertions and templates on
Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000
subclass of (P279) claims [9].  The property has been around for over a
year and is a regular topic of discussion [10] along with instance of
(P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen'
(Q130875).  I agree that instance of would probably be the better
membership property to use there.  Such questionable usage of P279 is
probably uncommon, but definitely not singular.  The dynasty class
hierarchy shows 13 dubious cases at the moment [11].  I would guess less
than 5% of subclass of claims have that kind of issue, where instance of
would make more sense.  I think there are probably vastly more cases of the
converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of
rdf:type and rdfs:subClassOf per community decision.  A while ago I read a
bit about the ELK reasoner you were involved with [12], which makes use of
the seemingly class-centric OWL EL profile.  Do you have any plans to
integrate features of ELK with the Wikidata Toolkit [13]?  How do you see
reasoning engines using P31 and P279 in the future, if at all?

Thanks,
Eric

https://www.wikidata.org/wiki/User:Emw

[1]
https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_proteins
[2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID
[3] https://www.wikidata.org/wiki/User:ProteinBoxBot.  Chinmay Nalk (
https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this,
with input from WD:MB.
[4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy
[5]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399rp=279lang=en
[6]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194rp=279lang=en
[7] http://apps.who.int/classifications/icd10/browse/2010/en
[8] https://en.wikipedia.org/wiki/Template:Surgeries
[9]
https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular_propertiesoldid=125595374
[10] Examples include
- https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element
-
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_the_subclass_tree
-
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27
[11]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950rp=279lang=en
[12] http://korrekt.org/page/The_Incredible_ELK
[13] https://www.mediawiki.org/wiki/Wikidata_Toolkit


On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch 
markus.kroetz...@tu-dresden.de wrote:

 Hi,

 I got interested in subclass of (P279) and instance of (P31) statements
 recently. I was surprised by two things:

 (1) There are quite a lot of subclass of statements: tenth of thousands.
 (2) Many of them make a lot of sense, and (in particular) are not
 (obvious) copies of Wikipedia categories.

 My big question is: who is creating all these statements and how is this
 done? It seems too much data to be created manually, but I don't see
 obvious automated approaches either (and there are usually no references
 given).

 I also found some rare issues. A subclass of B should be read as Every
 A is also a B. For example, we have Every piano (Q5994) is also a
 keyboard instrument (Q52954). Overall, the great majority of cases I
 looked at had remarkably sane modelling (which reinforces my big question).

 But there are still cases where subclass of is mixed up with instance
 of. For example,