Re: [Wikidata-l] Subclass of/instance of

Thomas Douillard Thu, 15 May 2014 02:48:39 -0700

Hi Markus.

Concerning redundancy, I question myself. Is redundancy, at least at some
degree, something we absolutely want to remove from Wikidata ? I don't
think so. Wikidata is an open project where a lot of change happens on a
high number of "pages" (in the Wiki sense). This mean in my own the more
control mechanism there is the best it is.


I think redundancy is a powerful mechanism in robustness achievment, it
happens to some extent in a lot of complex systems. For example think of
claim deletion. Assume a reasoner would rely on that claim to make a lot of
inferences. In a sense it's a kind of compression of information. Then
there is a risk, if the deletion is unnoticed, that we lose a lot of datas
due to that claim deletion.

Now think that there is a redundant claim that is a part of the inferences
chain that come from our deleted claim, could a mechanism based on
inferences enlight the fact that the graph might be incomplete, or that the
deletion introduced an inconsistency, or whatever, where just a inference
system with a minimal set of claim to compress the stored data would just
not make the inference anymore ? I guess it could also compare the (would
we say completed graph, or partially completed ?) before and after the
change, and hint that there actually is a mass loss of datas. Or compute a
"inference score" based on the number of inferences a claim is a part of to
hint the patrollers for the deletion to verify (just random thoughts.

Anyway, any thoughts on redundancy in Wikidata ?

2014-05-14 15:33 GMT+02:00 Markus Krötzsch <[email protected]>:

> Hi Eric,
>
> Thanks for all the information. This was very helpful. I only get to
> answer now since we have been quite busy building RDF exports for Wikidata
> (and writing a paper about it). I will soon announce this here (we still
> need to fix a few details).
>
> You were asking about using these properties like rdfs:subClassOf and
> rdf:type. I think that's entirely possible, since the modelling is very
> reasonable and would probably yield good results. Our reasoner ELK could
> easily handle the class hierarchy in terms of size, but you don't really
> need such a highly optimized tool for this as long as you only have
> subClassOf. In fact, the page you linked to shows that it is perfectly
> possible to compute the class hierarchy with Wikidata Query and to display
> all of it on one page. ELK's main task is to compute class hierarchies for
> more complicated ontologies, which we do not have yet. OTOH, query
> answering and data access are different tasks that ELK is not really
> intended for (although it could do some of this as well).
>
> Regarding future perspectives: one thing that we have also done is to
> extract OWL axioms from property constraint templates on Wikidata talk
> pages (we will publish the result soon, when announcing the rest). This
> gives you only some specific types of OWL axioms, but it is making things a
> bit more interesting already. In particular, there are some constraints
> that tell you that an item should have a certain class, so this is
> something you could reason with. However, the current property constraint
> system does not work too well for stating axioms that are not related to a
> particular property (such as: "Every [instance of] person who appears as an
> actor in some film should be [instance of] in the class 'actor'" -- which
> property or item page should this be stated on?). But the constraints show
> that it makes sense to express such information somehow.
>
> In the end, however, the real use of OWL (and similar ontology languages)
> is to remove the need for making everything explicit. That is, instead of
> "constraints" (which say: "if your data looks like X, then your data should
> also include Y") you have "axioms" (which say: "if your data looks like X,
> then Y follows automatically"). So this allows you to remove redundancy
> rather than to detect omissions. This would make more sense with "derived"
> notions that one does not want to store in the database, but which make
> sense for queries (like "grandmother").
>
> One would need a bit more infrastructure for this; in particular, one
> would need to define "grandmother" (with labels in many languages) even if
> one does not want to use it as a property but only in queries. Maybe one
> could have a separate Wikibase installation for defining such derived
> notions without needing to change Wikidata? There are no statements on
> properties yet, but one could also use item pages to define derived
> properties when using another site ...
>
> Best regards,
>
> Markus
>
> P.S. Thanks for all the work on the "semantic" modelling aspects of
> Wikidata. I have seen that you have done a lot in the discussions to
> clarify things there.
>
>
>
> On 06/05/14 04:53, emw wrote:
>
>> Hi Markus,
>>
>> You asked "who is creating all these [subclass of] statements and how is
>> this done?"
>>
>> The class hierarchy in
>> http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120&rp=279&lang=en
>> shows a few relatively large subclass trees for specialist domains,
>> including molecular biology and mineralogy.  The several thousand
>> subclass of 'gene' and 'protein' subclass claims were created by members
>> of WikiProject Molecular biology (WD:MB), based on discussions in [1]
>> and [2].  The decision to use P279 instead of P31 there was based on the
>> fact that the "is-a" relation in Gene Ontology maps to rdfs:subClassOf,
>> which P279 is based on.  The claims were added by a bot [3], with input
>> from WD:MB members.  The data ultimately comes from external biological
>> databases.
>>
>> A glance at the mineralogy class hierarchy indicates it has been
>> constructed by WikiProject Mineralogy [4] members through non-bot
>> edits.  I imagine most of the other subclass of claims are done manually
>> or semi-automatically outside specific Wikiproject efforts.  In other
>> words, I think most of the other P279 claims are added by Wikidata users
>> going into the UI and building usually-reasonable concept hierarchies on
>> domains they're interested in.  I've worked on constructing class
>> hierarchies for health problems (e.g. diseases and injuries) [5] and
>> medical procedures [6] based on classifications like ICD-10 and
>> assertions and templates on Wikipedia (e.g. [8]).
>>
>> It's not incredibly surprising to me that Wikidata has about 36,000
>> subclass of (P279) claims [9].  The property has been around for over a
>> year and is a regular topic of discussion [10] along with instance of
>> (P31), which has over 6,600,000 claims.
>>
>> You noted a dubious claim subclass of claim for 'House of Staufen'
>> (Q130875).  I agree that instance of would probably be the better
>> membership property to use there.  Such questionable usage of P279 is
>> probably uncommon, but definitely not singular.  The dynasty class
>> hierarchy shows 13 dubious cases at the moment [11].  I would guess less
>> than 5% of subclass of claims have that kind of issue, where instance of
>> would make more sense.  I think there are probably vastly more cases of
>> the converse: instance of being used where subclass of would make more
>> sense.
>>
>> As you probably know, P31 and P279 are intended to have the semantics of
>> rdf:type and rdfs:subClassOf per community decision.  A while ago I read
>> a bit about the ELK reasoner you were involved with [12], which makes
>> use of the seemingly class-centric OWL EL profile.  Do you have any
>> plans to integrate features of ELK with the Wikidata Toolkit [13]?  How
>> do you see reasoning engines using P31 and P279 in the future, if at all?
>>
>> Thanks,
>> Eric
>>
>> https://www.wikidata.org/wiki/User:Emw
>>
>> [1]
>> https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_
>> genes_and_proteins
>> [2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID
>> [3] https://www.wikidata.org/wiki/User:ProteinBoxBot.  Chinmay Nalk
>> (https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this,
>> with input from WD:MB.
>> [4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy
>> [5]
>> http://tools.wmflabs.org/wikidata-todo/tree.html?q=
>> Q15281399&rp=279&lang=en
>> [6]
>> http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194&rp=279&lang=en
>> [7] http://apps.who.int/classifications/icd10/browse/2010/en
>> [8] https://en.wikipedia.org/wiki/Template:Surgeries
>> [9]
>> https://www.wikidata.org/w/index.php?title=Wikidata:
>> Database_reports/Popular_properties&oldid=125595374
>> [10] Examples include
>> - https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element
>> -
>> https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/
>> 2013/12#Top_of_the_subclass_tree
>>
>> -
>> https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/
>> 2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27
>> [11]
>> http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950&rp=279&lang=en
>> [12] http://korrekt.org/page/The_Incredible_ELK
>> [13] https://www.mediawiki.org/wiki/Wikidata_Toolkit
>>
>>
>> On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch
>> <[email protected] <mailto:[email protected]>>
>>
>> wrote:
>>
>>     Hi,
>>
>>     I got interested in subclass of (P279) and instance of (P31)
>>     statements recently. I was surprised by two things:
>>
>>     (1) There are quite a lot of subclass of statements: tenth of
>> thousands.
>>     (2) Many of them make a lot of sense, and (in particular) are not
>>     (obvious) copies of Wikipedia categories.
>>
>>     My big question is: who is creating all these statements and how is
>>     this done? It seems too much data to be created manually, but I
>>     don't see obvious automated approaches either (and there are usually
>>     no references given).
>>
>>     I also found some rare issues. "A subclass of B" should be read as
>>     "Every A is also a B". For example, we have "Every piano (Q5994) is
>>     also a keyboard instrument (Q52954)". Overall, the great majority of
>>     cases I looked at had remarkably sane modelling (which reinforces my
>>     big question).
>>
>>     But there are still cases where "subclass of" is mixed up with
>>     "instance of". For example, Wikidata also says "Every 'House of
>>     Staufen' (Q130875) is also a dynasty (Q164950)". This is dubious --
>>     how many instances of 'House of Staufen' are there? I guess we
>>     really want to say that "The House of Staufen is a(n instance of)
>>     dynasty." Is this a singular error or a systematic issue?
>>
>>     I guess there is already a group of people who deal with such issues
>>     -- or it would be a miracle that things are in such a good shape
>>     already :-) I have read the talk page for subclass of, but that does
>>     not seem to explain the original of all the data we have already.
>>     Pointers?
>>
>>     Cheers,
>>
>>     Markus
>>
>>
>>     _________________________________________________
>>     Wikidata-l mailing list
>>     [email protected] <mailto:[email protected]
>> >
>>     https://lists.wikimedia.org/__mailman/listinfo/wikidata-l
>>     <https://lists.wikimedia.org/mailman/listinfo/wikidata-l>
>>
>>
>>
>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>>
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Subclass of/instance of

Reply via email to