Hi Thomas,

On 15/05/14 11:47, Thomas Douillard wrote:
Hi Markus.

Concerning redundancy, I question myself. Is redundancy, at least at
some degree, something we absolutely want to remove from Wikidata ? I
don't think so. Wikidata is an open project where a lot of change
happens on a high number of "pages" (in the Wiki sense). This mean in my
own the more control mechanism there is the best it is.

I think redundancy is a powerful mechanism in robustness achievment, it
happens to some extent in a lot of complex systems. For example think of
claim deletion. Assume a reasoner would rely on that claim to make a lot
of inferences. In a sense it's a kind of compression of information.
Then there is a risk, if the deletion is unnoticed, that we lose a lot
of datas due to that claim deletion.

Now think that there is a redundant claim that is a part of the
inferences chain that come from our deleted claim, could a mechanism
based on inferences enlight the fact that the graph might be incomplete,
or that the deletion introduced an inconsistency, or whatever, where
just a inference system with a minimal set of claim to compress the
stored data would just not make the inference anymore ? I guess it could
also compare the (would we say completed graph, or partially completed
?) before and after the change, and hint that there actually is a mass
loss of datas. Or compute a "inference score" based on the number of
inferences a claim is a part of to hint the patrollers for the deletion
to verify (just random thoughts.

Anyway, any thoughts on redundancy in Wikidata ?

I agree with what you say. It is impossible to build a redundancy-free Wikidata (think of property "spouse" ;-), and there are several reasons for allowing for some kinds of redundancy. At the same time, we could never store *every* fact that implicitly follows from other statements. The community is having discussions about what should be in and what should be out, based on concrete use-cases (for example, we don't have "grandparent" but we do have "sister"). In the end, we have to leave this to the experts in each topic area.

I applaud your comparison of inferencing with a form of decompression. I think this is a nice intuition (in fact, some people have researched "semantic compression" where one tries to reduce the size of a knowledge base by eliminating things that follow from the rest anyway).

You are right that the ramifications of removing one statement might be bigger if inferences are used. On the other hand, there are many other reasons why a single change can have a big impact: soon we will have simple queries, and it is quite possible that thousands of template instances issue queries that all depend on the same single statement -- deleting it would change a lot of pages then. Redundancy cannot protect against this, since most queries (or inference rules) would refer to one form of the data and not check all possible redundant formulations. Moreover, many things in Wikidata are not stored redundantly at all, yet we want robustness in all cases. A mechanism to indicate importance like you describe might be a solution, or we could use a form of protection to avoid accidental changes to important statements. But this is yet another discussion, which is probably more important for Wikipedia than for the prototype I was having in mind.

Another interesting use of inferencing in a system that has redundancy could be to infer additional support for a statement (Example: if a reference says that X is a child of Y, then the same reference also supports the claim that Y is the parent of X). In other words: even if we don't want to infer statements, we might want to infer references :-)

Constraints are a great start. We should now ask how we could improve the management of constraints in the future, and which constraints we will have then.



2014-05-14 15:33 GMT+02:00 Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>:

    Hi Eric,

    Thanks for all the information. This was very helpful. I only get to
    answer now since we have been quite busy building RDF exports for
    Wikidata (and writing a paper about it). I will soon announce this
    here (we still need to fix a few details).

    You were asking about using these properties like rdfs:subClassOf
    and rdf:type. I think that's entirely possible, since the modelling
    is very reasonable and would probably yield good results. Our
    reasoner ELK could easily handle the class hierarchy in terms of
    size, but you don't really need such a highly optimized tool for
    this as long as you only have subClassOf. In fact, the page you
    linked to shows that it is perfectly possible to compute the class
    hierarchy with Wikidata Query and to display all of it on one page.
    ELK's main task is to compute class hierarchies for more complicated
    ontologies, which we do not have yet. OTOH, query answering and data
    access are different tasks that ELK is not really intended for
    (although it could do some of this as well).

    Regarding future perspectives: one thing that we have also done is
    to extract OWL axioms from property constraint templates on Wikidata
    talk pages (we will publish the result soon, when announcing the
    rest). This gives you only some specific types of OWL axioms, but it
    is making things a bit more interesting already. In particular,
    there are some constraints that tell you that an item should have a
    certain class, so this is something you could reason with. However,
    the current property constraint system does not work too well for
    stating axioms that are not related to a particular property (such
    as: "Every [instance of] person who appears as an actor in some film
    should be [instance of] in the class 'actor'" -- which property or
    item page should this be stated on?). But the constraints show that
    it makes sense to express such information somehow.

    In the end, however, the real use of OWL (and similar ontology
    languages) is to remove the need for making everything explicit.
    That is, instead of "constraints" (which say: "if your data looks
    like X, then your data should also include Y") you have "axioms"
    (which say: "if your data looks like X, then Y follows
    automatically"). So this allows you to remove redundancy rather than
    to detect omissions. This would make more sense with "derived"
    notions that one does not want to store in the database, but which
    make sense for queries (like "grandmother").

    One would need a bit more infrastructure for this; in particular,
    one would need to define "grandmother" (with labels in many
    languages) even if one does not want to use it as a property but
    only in queries. Maybe one could have a separate Wikibase
    installation for defining such derived notions without needing to
    change Wikidata? There are no statements on properties yet, but one
    could also use item pages to define derived properties when using
    another site ...

    Best regards,


    P.S. Thanks for all the work on the "semantic" modelling aspects of
    Wikidata. I have seen that you have done a lot in the discussions to
    clarify things there.

    On 06/05/14 04:53, emw wrote:

        Hi Markus,

        You asked "who is creating all these [subclass of] statements
        and how is
        this done?"

        The class hierarchy in
        shows a few relatively large subclass trees for specialist domains,
        including molecular biology and mineralogy.  The several thousand
        subclass of 'gene' and 'protein' subclass claims were created by
        of WikiProject Molecular biology (WD:MB), based on discussions
        in [1]
        and [2].  The decision to use P279 instead of P31 there was
        based on the
        fact that the "is-a" relation in Gene Ontology maps to
        which P279 is based on.  The claims were added by a bot [3],
        with input
        from WD:MB members.  The data ultimately comes from external

        A glance at the mineralogy class hierarchy indicates it has been
        constructed by WikiProject Mineralogy [4] members through non-bot
        edits.  I imagine most of the other subclass of claims are done
        or semi-automatically outside specific Wikiproject efforts.  In
        words, I think most of the other P279 claims are added by
        Wikidata users
        going into the UI and building usually-reasonable concept
        hierarchies on
        domains they're interested in.  I've worked on constructing class
        hierarchies for health problems (e.g. diseases and injuries) [5] and
        medical procedures [6] based on classifications like ICD-10 and
        assertions and templates on Wikipedia (e.g. [8]).

        It's not incredibly surprising to me that Wikidata has about 36,000
        subclass of (P279) claims [9].  The property has been around for
        over a
        year and is a regular topic of discussion [10] along with
        instance of
        (P31), which has over 6,600,000 claims.

        You noted a dubious claim subclass of claim for 'House of Staufen'
        (Q130875).  I agree that instance of would probably be the better
        membership property to use there.  Such questionable usage of
        P279 is
        probably uncommon, but definitely not singular.  The dynasty class
        hierarchy shows 13 dubious cases at the moment [11].  I would
        guess less
        than 5% of subclass of claims have that kind of issue, where
        instance of
        would make more sense.  I think there are probably vastly more
        cases of
        the converse: instance of being used where subclass of would
        make more

        As you probably know, P31 and P279 are intended to have the
        semantics of
        rdf:type and rdfs:subClassOf per community decision.  A while
        ago I read
        a bit about the ELK reasoner you were involved with [12], which
        use of the seemingly class-centric OWL EL profile.  Do you have any
        plans to integrate features of ELK with the Wikidata Toolkit
        [13]?  How
        do you see reasoning engines using P31 and P279 in the future,
        if at all?



        [2] https://www.wikidata.org/wiki/__WT:MB#Human.2Fmouse.2F..._ID
        [3] https://www.wikidata.org/wiki/__User:ProteinBoxBot
        <https://www.wikidata.org/wiki/User:ProteinBoxBot>.  Chinmay Nalk
        <https://www.wikidata.org/wiki/User:Chinmay26>) did all the work
        on this,
        with input from WD:MB.
        [7] http://apps.who.int/__classifications/icd10/browse/__2010/en
        [8] https://en.wikipedia.org/wiki/__Template:Surgeries
        [10] Examples include

        [12] http://korrekt.org/page/The___Incredible_ELK
        [13] https://www.mediawiki.org/__wiki/Wikidata_Toolkit

        On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch



             I got interested in subclass of (P279) and instance of (P31)
             statements recently. I was surprised by two things:

             (1) There are quite a lot of subclass of statements: tenth
        of thousands.
             (2) Many of them make a lot of sense, and (in particular)
        are not
             (obvious) copies of Wikipedia categories.

             My big question is: who is creating all these statements
        and how is
             this done? It seems too much data to be created manually, but I
             don't see obvious automated approaches either (and there
        are usually
             no references given).

             I also found some rare issues. "A subclass of B" should be
        read as
             "Every A is also a B". For example, we have "Every piano
        (Q5994) is
             also a keyboard instrument (Q52954)". Overall, the great
        majority of
             cases I looked at had remarkably sane modelling (which
        reinforces my
             big question).

             But there are still cases where "subclass of" is mixed up with
             "instance of". For example, Wikidata also says "Every 'House of
             Staufen' (Q130875) is also a dynasty (Q164950)". This is
        dubious --
             how many instances of 'House of Staufen' are there? I guess we
             really want to say that "The House of Staufen is a(n
        instance of)
             dynasty." Is this a singular error or a systematic issue?

             I guess there is already a group of people who deal with
        such issues
             -- or it would be a miracle that things are in such a good
             already :-) I have read the talk page for subclass of, but
        that does
             not seem to explain the original of all the data we have



             Wikidata-l mailing list

        Wikidata-l mailing list

    Wikidata-l mailing list
    Wikidata-l@lists.wikimedia.org <mailto:Wikidata-l@lists.wikimedia.org>

Wikidata-l mailing list

Wikidata-l mailing list

Reply via email to