Re: [Wikidata-l] Question about wikipedia categories.

2013-05-05 Thread emw
There's a related essay on Wikimedia Commons:
http://commons.wikimedia.org/wiki/User:Multichill/Next_generation_categories
.

The Wikidata properties instance
ofhttps://www.wikidata.org/wiki/Property_talk:P31(formerly is a)
and subclass
of https://www.wikidata.org/wiki/Property_talk:P279 are likely relevant
to folks interested in ontology building on Wikidata.  They're based on
rdfs:type http://www.w3.org/TR/rdf-schema/#ch_type and
rdfs:subClassOfhttp://www.w3.org/TR/rdf-schema/#ch_subclassoffrom
W3C recommendations, and allow for building a rooted DAG that places
concepts into a hierarchy of knowledge.  They also allow for a degree
of type-token
distinction http://en.wikipedia.org/wiki/Type%E2%80%93token_distinctionwhen
classifying subjects, though how that applies to certain knowledge
domains hasn't been fully sussed out.


On Sun, May 5, 2013 at 2:17 PM, Chris Maloney voldr...@gmail.com wrote:

 Doug from WikiSource started a page over at meta:
 http://meta.wikimedia.org/wiki/Beyond_categories

 I'll be trying to fill in some of my understanding of the problem and
 the scope of a possible solution.  I recognize there's been a lot of
 prior art on this issue, and a lot of existing overlapping tools and
 infrastructure, and I'm pretty new around here, and apt to be
 inaccurate and naive.  So I do hope others with more experience will
 come and help sort it out.

 Chris

 On Sun, May 5, 2013 at 11:06 AM, Michael Hale hale.michael...@live.com
 wrote:
  As far as checking the import progress of Wikidata, the category American
  women writers has 1479 articles. 651 of them currently have a main type
  (GND), 328 have a sex, 162 have an occupation, 111 have a country of
  citizenship, 49 have a sexual orientation, 39 have a place of birth, etc.
 
  From: j...@sahnwaldt.de
  Date: Sun, 5 May 2013 16:28:14 +0200
 
  To: wikidata-l@lists.wikimedia.org
  Subject: Re: [Wikidata-l] Question about wikipedia categories.
 
  Hi Pat,
 
  I've been involved with DBpedia for several years, so these are
  interesting thoughts.
 
  On 5 May 2013 01:25, Patrick Cassidy p...@micra.com wrote:
   If one is interested in a functional “category” system, it would be
 very
   helpful to have a good logic-based ontology as the backbone.
  
   I haven’t looked recently, but when I inquired about the ontology used
   by
   DBpedia a year ago, I was referred to “dbpedia-ontology.owl”, an
   ontology in
   the format of the “semantic web” ontology format OWL. The OWL format
 is
   excellent for simple purposes, but the dbpedia-ontology.owl (at that
   time)
   was not well-structured (being very polite).
 
  Do you mean just the file dbpedia-ontology.owl or the DBpedia ontology
  in general? We still use OWL as our main format for publishing the
  ontology. The file is generated automatically. Maybe the generation
  process could be improved.
 
   I did inquire as to who was
   maintaining the ontology, and had a hard time figuring out how to help
   bring
   it up to professional standards. But it was like punching jello,
 nothing
   to
   grasp onto. I gave up, having other useful things to do with my time.
 
  The ontology is maintained by a community that everyone can join at
  http://mappings.dbpedia.org/ . An overview of the current class
  hierarchy is here:
  http://mappings.dbpedia.org/server/ontology/classes/ . You're more
  than welcome to help! I think talk pages are not used enough on the
  mappings wiki, so if you have ideas, misgivings or questions about the
  DBpedia ontology, the place to go is probably the mailing list:
  https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
 
  Thanks!
 
  Christopher
 
  
  
  
   Perhaps it is time now, with more experience in hand, to rethink the
   category system starting with basics. This is not as hard as it
 sounds.
   It may require some changes where there is ambiguity or logical
   inconsistency, but mostly it only necessary to link the Wikipedia
   categories
   to an ontology based on a well-structured and logically sound
 foundation
   ontology (also referred to as an “upper ontology”), that supplies the
   basic
   categories and relations. Such an ontology can provide the basic
   concepts,
   whose labels can be translated into any terminology that any local
 user
   wants to use. There are several well-structured foundation ontologies,
   based on over twenty years of research, but the one I suggest is the
 one
   I
   am most familiar with (which I created over the past seven years),
   called
   COSMO. The files at http://micra.com/COSMO will provide the ontology
   itself
   (“COSMO.owl”, in OWL) and papers describing the basic principles.
 COSMO
   is structured to be a “primitives-based foundation ontology”,
 containing
   all
   of the “semantic primitives” needed to describe anything one wants to
   talk
   about. All other categories are structured as logical combinations of
   the
   basic elements. Its inventory of primitives is probably 

Re: [Wikidata-l] Fwd: Re: [Wikitech-l] Why isn't hotcat an extension?

2013-07-18 Thread emw
The relationship between Wikipedia categories and Wikidata pops up here and
there in discussions -- a recent one was
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/06#Proposal_for_phase_4:_unify_and_centralize_categories
.

I think Wikidata
propertieshttps://www.wikidata.org/wiki/Wikidata:Glossary#Propertyand
queries https://www.wikidata.org/wiki/Wikidata:Glossary#Query will likely
go a long way toward obsolescing Wikipedia categories.  Categories are
essentially queries on a set of pre-defined properties. The manual
maintenance that has been required to curate Wikipedia's category system
seems like it could be largely eliminated (or, at least, centralized and
streamlined) once Wikidata queries are deployed.  Wikidata properties like
those covered in
https://www.wikidata.org/wiki/Help:Basic_membership_properties also allow
subjects to be arranged into a taxonomy of concepts, which is one of the
main features of categories.

I'm not aware of any concrete plans to replace the category system with a
solution from Wikidata, but I think it would make more sense to explore
that option than to work on importing Wikipedia categories en masse into
Wikidata.


On Thu, Jul 18, 2013 at 6:18 PM, rupert THURNER rupert.thur...@gmail.comwrote:

 Lets forward this to here, maybe somebody here already thought about
 categories in wikidata.
 -- Weitergeleitete Nachricht --
 Von: Tyler Romeo tylerro...@gmail.com
 Datum: 18.07.2013 23:08
 Betreff: Re: [Wikitech-l] Why isn't hotcat an extension?
 An: Wikimedia developers wikitec...@lists.wikimedia.org

 On Thu, Jul 18, 2013 at 4:04 PM, Antoine Musso hashar+...@free.fr wrote:

  Lets move the categories in wikidata ? =)


 That'd be nice, but how much time would that take to develop?

 *-- *
 *Tyler Romeo*
 Stevens Institute of Technology, Class of 2016
 Major in Computer Science
 www.whizkidztech.com | tylerro...@gmail.com
 ___
 Wikitech-l mailing list
 wikitec...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikitech-l

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Application: sexing people by name/research gender bias

2013-10-15 Thread emw
Max's comment is very related to Wikidata.   The sex property [1] is a
model system to explore important questions for the project at large.

For example, how rigorous do we want to be with automatic classification?
Let's say a property can have one of three values: A, B or C.  Roughly 90%
of the valid subjects for that property are known to be either A or B, and
10% are known to be C.  Our automatic classifier can assign all valid
subjects to either A or B.  However, it can't segregate A or B from C.  So
our false positive rate is at least 10%.  Would it be acceptable for
Wikidata to have a known error rate of 10% in certain properties?  At what
error rate does automatic classification become unacceptable?

Another question this topic broaches: do we want to adopt formal domain and
range constraints on properties?  If we do, then how do we handle rare
values?  How about exceedingly rare values?  (It should be noted that the
Wikidata sex property includes intersex in its range constraints [2].)
There is ongoing discussion about whether we want to adopt range and domain
constraints (among other property metadata) in Wikidata's Project chat [3].

Eric
https://www.wikidata.org/wiki/User:Emw

1.  https://www.wikidata.org/wiki/Property:P21
2.  https://www.wikidata.org/wiki/Property_talk:P21
3.
https://www.wikidata.org/wiki/Wikidata:Project_chat#What_type_of_data_should_be_stored(permalink:
https://www.wikidata.org/w/index.php?title=Wikidata:Project_chatoldid=78406798#What_type_of_data_should_be_stored
)


On Tue, Oct 15, 2013 at 2:33 PM, Tom Morris tfmor...@gmail.com wrote:

 So you've got an agenda that's unrelated to Wikidata or analysis thereof.
  Got it.  Perhaps a non-Wikidata list would be a more appropriate forum.

 On Tue, Oct 15, 2013 at 2:08 PM, Klein,Max kle...@oclc.org wrote:

  Sorry to rant.


 Accepted.

 Tom

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] ontology Wikidata API, managing ontology structure and evolutions

2014-01-09 Thread emw

 What about monthly/dump-based aggregated property usage statistics?


Property usage statistics would be very valuable, Dimitris.  It would help
inform community decisions about how to steer changes in property usage
with less disruption.  It would have other significant benefits as well.

Getting daily counts like
https://www.wikidata.org/wiki/Wikidata:Database_reports/Popular_propertiesback
up and running would be a good place to start.  That report hasn't
been updated since October 2013.  We could go further by showing counts for
all properies, not just the top 100.

More detailed data would be great, too.  Wikidata editors recently posted a
list of the most popular objects for 'instance of' (P31) claims at
https://www.wikidata.org/w/index.php?title=Property_talk:P31oldid=99405143#Value_statistics.
Having daily data like that for all properties would be quite useful.

If anyone does end up doing something like this, I would recommend
archiving the data at http://dumps.wikimedia.org/other/ in addition to
posting it in a regularly updated report in Wikidata.

Cheers,
Eric

https://www.wikidata.org/wiki/User:Emw




On Thu, Jan 9, 2014 at 12:59 PM, Dimitris Kontokostas 
kontokos...@informatik.uni-leipzig.de wrote:

 What about monthly/dump-based aggregated property usage statistics?
 People would be able to check property trends or maybe subscribe to
 specific properties via rss.



 On Thu, Jan 9, 2014 at 3:55 PM, Daniel Kinzler 
 daniel.kinz...@wikimedia.de wrote:

 Am 08.01.2014 16:20, schrieb Thomas Douillard:
  Hi, a problem seems (not very surprisingly) to emerge into Wikidata :
 the
  managing of the evolution of how we do things on Wikidata.
 
  Properties are deleted, which made some consumer of the datas sometimes
 a little
  frustrated they are not informed of that and could not take part of the
 discussion.

 They are informed if they follow the relevant channels. There's no way to
 inform
 them if they don't. These channels can very likely be improved, yes.

 That being said: a property that is still widely used should very rarely
 be
 deleted, if at all. Usually, properties would be phased out by replacing
 them
 with another property, and only then they get deleted.

 Of course, 3rd parties that rely on specific properties would still face
 the
 problem that the property they use is simply no longer used (that's the
 actual
 problem - whether it is deleted doesn't really matter, I think).

 So, the question is really: how should 3rd party users be notified in
 changes of
 policy and best practice regarding the usage and meaning of properties?

 That's an interesting question, one that doesn't have a technical
 solution I can
 see.

 -- daniel


 --
 Daniel Kinzler
 Senior Software Developer

 Wikimedia Deutschland
 Gesellschaft zur Förderung Freien Wissens e.V.

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l




 --
 Dimitris Kontokostas
 Department of Computer Science, University of Leipzig
 Research Group: http://aksw.org
 Homepage:http://aksw.org/DimitrisKontokostas

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] difference between Person and Human classes

2014-02-15 Thread emw

 what's the usage of the Person class ?


'Person' [1] should generally be avoided.  For background, see the
discussion about how to classify subjects like Coco Chanel in [2].  That's
the basis for the note not for use with P31, instead use Q5 human.
Pretty much all the items that link to 'person' [3] shouldn't.

For fictional characters, the convention is to classify them as 'fictional
character', e.g. as done for Jack Bauer [4].  There are some tricky
knowledge representation issues with fictional entities.  For example, how
do we ensure that Harry Potter is not returned in a query for all people
born in London in 1984?  The fictional universes project [5] aims to
address problems like that.

1) https://www.wikidata.org/wiki/Q215627
2)
https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Migrating_away_from_GND_main_type#P31_value_for_things_like_Coco_Chanel
3) https://www.wikidata.org/wiki/Special:WhatLinksHere/Q215627
4) https://www.wikidata.org/wiki/Q24
5) https://www.wikidata.org/wiki/Wikidata:Wikiproject_Fictional_universes



On Sat, Feb 15, 2014 at 4:31 AM, Jane Darnell jane...@gmail.com wrote:

 I would imagine it's for fictional characters like Little Red Riding
 Hood, but I see that when I click on What links here while on page
 Q215627 I see Sleeping Beauty, but also roles and dead people. I am
 just as lost as you are!

 2014-02-15 2:42 GMT+01:00, Hady elsahar hadyelsa...@gmail.com:
  Hi all,
 
  just got confused a little bit between Person Q215627 and Human Q5
 classes
  in the Person page https://www.wikidata.org/wiki/Q215627 it's written
  not
  for use with P31, instead use Q5 human , if so what's the usage of the
  Person class ?
 
  thanks
  Regards
  -
  Hady El-Sahar
  Research Assistant
  Center of Informatics Sciences | Nile
  Universityhttp://nileuniversity.edu.eg/
 

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Subclass of/instance of

2014-05-05 Thread emw
Hi Markus,

You asked who is creating all these [subclass of] statements and how is
this done?

The class hierarchy in
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=enshows
a few relatively large subclass trees for specialist domains,
including molecular biology and mineralogy.  The several thousand subclass
of 'gene' and 'protein' subclass claims were created by members of
WikiProject Molecular biology (WD:MB), based on discussions in [1] and
[2].  The decision to use P279 instead of P31 there was based on the fact
that the is-a relation in Gene Ontology maps to rdfs:subClassOf, which
P279 is based on.  The claims were added by a bot [3], with input from
WD:MB members.  The data ultimately comes from external biological
databases.

A glance at the mineralogy class hierarchy indicates it has been
constructed by WikiProject Mineralogy [4] members through non-bot edits.  I
imagine most of the other subclass of claims are done manually or
semi-automatically outside specific Wikiproject efforts.  In other words, I
think most of the other P279 claims are added by Wikidata users going into
the UI and building usually-reasonable concept hierarchies on domains
they're interested in.  I've worked on constructing class hierarchies for
health problems (e.g. diseases and injuries) [5] and medical procedures [6]
based on classifications like ICD-10 and assertions and templates on
Wikipedia (e.g. [8]).

It's not incredibly surprising to me that Wikidata has about 36,000
subclass of (P279) claims [9].  The property has been around for over a
year and is a regular topic of discussion [10] along with instance of
(P31), which has over 6,600,000 claims.

You noted a dubious claim subclass of claim for 'House of Staufen'
(Q130875).  I agree that instance of would probably be the better
membership property to use there.  Such questionable usage of P279 is
probably uncommon, but definitely not singular.  The dynasty class
hierarchy shows 13 dubious cases at the moment [11].  I would guess less
than 5% of subclass of claims have that kind of issue, where instance of
would make more sense.  I think there are probably vastly more cases of the
converse: instance of being used where subclass of would make more sense.

As you probably know, P31 and P279 are intended to have the semantics of
rdf:type and rdfs:subClassOf per community decision.  A while ago I read a
bit about the ELK reasoner you were involved with [12], which makes use of
the seemingly class-centric OWL EL profile.  Do you have any plans to
integrate features of ELK with the Wikidata Toolkit [13]?  How do you see
reasoning engines using P31 and P279 in the future, if at all?

Thanks,
Eric

https://www.wikidata.org/wiki/User:Emw

[1]
https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_proteins
[2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID
[3] https://www.wikidata.org/wiki/User:ProteinBoxBot.  Chinmay Nalk (
https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this,
with input from WD:MB.
[4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy
[5]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399rp=279lang=en
[6]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194rp=279lang=en
[7] http://apps.who.int/classifications/icd10/browse/2010/en
[8] https://en.wikipedia.org/wiki/Template:Surgeries
[9]
https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular_propertiesoldid=125595374
[10] Examples include
- https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element
-
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_the_subclass_tree
-
https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27
[11]
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950rp=279lang=en
[12] http://korrekt.org/page/The_Incredible_ELK
[13] https://www.mediawiki.org/wiki/Wikidata_Toolkit


On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch 
markus.kroetz...@tu-dresden.de wrote:

 Hi,

 I got interested in subclass of (P279) and instance of (P31) statements
 recently. I was surprised by two things:

 (1) There are quite a lot of subclass of statements: tenth of thousands.
 (2) Many of them make a lot of sense, and (in particular) are not
 (obvious) copies of Wikipedia categories.

 My big question is: who is creating all these statements and how is this
 done? It seems too much data to be created manually, but I don't see
 obvious automated approaches either (and there are usually no references
 given).

 I also found some rare issues. A subclass of B should be read as Every
 A is also a B. For example, we have Every piano (Q5994) is also a
 keyboard instrument (Q52954). Overall, the great majority of cases I
 looked at had remarkably sane modelling (which reinforces my big question).

 But there are still cases where subclass of is mixed up with instance
 of. For example, 

Re: [Wikidata-l] Wikidata query feature: status and plans

2014-06-11 Thread emw
In case anyone is a bit lost, Tom is proposing an approach to
classification we've been calling explicit metamodeling.  Simply put,
let's say you have a class hierarchy:

A subclass of B
B subclass of C
C subclass of D

The proposal, as I understand it, is to add instance of claims for almost
all classes in Wikidata, which would yield classifications like:

A subclass of B
A instance of 'type of B'
B subclass of 'C'
B instance of 'type of C'
C subclass of 'D'
C instance of 'type of D'

The rationale for this is enable querying direct subclasses or immediate
subclasses of any given class.  This approach might be theoretically valid
in all classification, but I don't think it's a sensible solution for most
classification.  As you can see, explicit metamodeling introduces claims
that are rather redundant.  Tom's idea seems to be to use this approach for
almost all classification on Wikidata.

I am not enthusiastic about pervasively using that approach to
classification throughout Wikidata.  There are other ways to get direct
subclasses, several of which are described in
http://answers.semanticweb.com/questions/14699/get-immediate-subclasses-of-a-class.
For example, you could turn off entailments / inferencing in a query engine
you're using.  You could do a typical subclass query and filter out
non-direct subclasses.

Those querying approaches seem much simpler and more conventional than
saturating a concept hierarchy with redundant 'instance of 'type of Foo'
statements to enable the querying direct subclasses.

The extended discussions on this Tom refers can be found at
https://www.wikidata.org/wiki/Wikidata_talk:Country_subdivision_task_force#layers.
In addition to introducing substantial redundancy, you might get the
feeling as I did when reading through that discussion that widespread use
of explicit metamodeling would be quite confusing for users.

What are others' thoughts?

Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw




On Wed, Jun 11, 2014 at 6:56 AM, Thomas Douillard 
thomas.douill...@gmail.com wrote:




 I'm still talking of the model I proposed in my first post in this thread.
 I did give an advantage : you can really simply query the type of units
 used by a country to class his administrative units, like Region,
 Departement (as two items) for France with just simple in one request of
 the future simple query module : just retrieve the instances of the class
 French type of administrative units. This model also apply to any country
 administrative territorial division.

 I think French type of administrative units is the auxiliary item
 markus did mention. We can define precisely what it is becaus this class
 French type of administrative units regroups the types used by france to
 class cities, departments ... so clearly if we talk of Paris, there is
 several possibility, like the region of Paris, the City of Paris ... The
 city of Paris item is clearly the only one who is clearly defined by
 french law, hence it is an instance of the class French ville, who in
 turn is an instance of French type of administrative units.

 This seems to me a useful model, who can generalise easily to class things
 like Urban units, who are used for statistical purpose and are defined by
 national statistical organism in each country, such as INSEE in france
 Then we could also have a Urban unit class in Wikidata, but this is
 ambiguous.

 This class could have a subclass Urban unit as defined by INSEE, with
 instances such as Parisian urban unit, for hich it gives statistical
 information. Urban unit as defined by INSEE in turn may be a subclass of
 any geographical unit defined by INSEE. Then if you want INSEE
 geographical units, you query all instances of both Urban unit and any
 geographical unit defined by INSEE.

 But now let's say you want to find the definition of urban unit by the
 INSEE itself, not the instances. One way to do that would be to look at the
 subclass tree of  any geographical unit defined by INSEE or the one of
 Urban unit, or compute the intersection of both trees.

 One alternative, using metamodelling this time, would be to have a class
 regrouping all definitions of statistical units used by INSEE. Those
 definitions identify to some class we already have, such as Urban unit as
 defined by INSEE identifies to the definition of urban unit by INSEE. Then
 the class of all definitions used by INSEE would identify to the class of
 all classes with a name ... unit defined by INSEE. I propose to create
 the item any type of unit defined by INSEE ; with Urban unit as defined
 by INSEE an instance of it (actually I might have already have done it /o\)

 This is an (non mutually exclusive, more complementary) alternative to
 just class administrative units instances. In a way, this is just
 identifying (reifying) a classification system and putting an ''instance
 of'' this item statements to its classes.

 I think this is interesting in Wikidata as we actually are using a lot of
 

Re: [Wikidata-l] Wikidata RDF exports

2014-06-13 Thread emw
Markus,

Thank you very much for this.  Translating Wikidata into the language of
the Semantic Web is important.  Being able to explore the Wikidata taxonomy
[1] by doing SPARQL queries in Protege [2] (even primitive queries) is
really neat, e.g.

SELECT ?subject
WHERE
{
   ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 .
}

This is more of an issue of my ignorance of Protege, but I notice that the
above query returns only the direct subclasses of Q82586.  The full set of
subclasses for Q82586 (lepton) is visible at
http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en --
a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron
neutrino) are shown there but not returned by that SPARQL query.  It seems
rdfs:subClassOf isn't being treated as a transitive property in Protege.
Any ideas?

Do you know when the taxonomy data in OWL will have labels available?

Also, regarding the complete dumps, would it be possible to export a
smaller subset of the faithful data?  The files under Complete Data Dumps
in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
big to load into Protege on most personal computers, and would likely
require adjusting JVM settings on higher-end computers to load.  If it's
feasible to somehow prune those files -- and maybe even combine them into
one file that could be easily loaded into Protege -- that would be
especially nice.

Thanks,
Eric
https://www.wikidata.org/wiki/User:Emw

1.
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-taxonomy.nt.gz
2. http://protege.stanford.edu/





On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch 
markus.kroetz...@tu-dresden.de wrote:

 Hi all,

 We are now offering regular RDF dumps for the content of Wikidata:

 http://tools.wmflabs.org/wikidata-exports/rdf/

 RDF is the Resource Description Framework of the W3C that can be used to
 exchange data on the Web. The Wikidata RDF exports consist of several files
 that contain different parts and views of the data, and which can be used
 independently. Details on the available exports and the RDF encoding used
 in each can be found in the paper Introducing Wikidata to the Linked Data
 Web [1].

 The available RDF exports can be found in the directory
 http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are
 generated regularly from current data dumps of Wikidata and will appear in
 this directory shortly afterwards.

 All dump files have been generated using Wikidata Toolkit [2]. There are
 some important differences in comparison to earlier dumps:

 * Data is split into several dump files for convenience. Pick whatever you
 are most interested in.
 * All dumps are generated using the OpenRDF library for Java (better
 quality than ad hoc serialization; much slower too ;-)
 * All dumps are in N3 format, the simplest RDF serialization format that
 there is
 * In addition to the faithful dumps, some simplified dumps are also
 available (one statement = one triple; no qualifiers and references).
 * Links to external data sets are added to the data for Wikidata
 properties that point to datasets with RDF exports. That's the Linked in
 Linked Open Data.

 Suggestions for improvements and contributions on github are welcome.

 Cheers,

 Markus

 [1] http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web
 [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit

 --
 Markus Kroetzsch
 Faculty of Computer Science
 Technische Universität Dresden
 +49 351 463 38486
 http://korrekt.org/

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Wikidata RDF exports

2014-06-14 Thread emw
Markus,

Thanks for the thorough reply!

you can use SPARQL 1.1 transitive closure in queries (using * after
 properties), so you can find all subclasses there too. (You could also
 try this in Protege ...)


I had a feeling I was missing something basic.  (I'm also new to SPARQL.)
Using * after the property got me what I was looking for by default in
Protege.  That is,

SELECT ?subject
WHERE
{
   ?subject rdfs:subClassOf* http://www.wikidata.org/entity/Q82586 .
}

-- with an asterisk after rdfs:subClassOf -- got me the transitive closure
and returned all subclasses of Q82586 / lepton.

Should we maybe create an English label file for the classes? Descriptions
 too or just labels?


A file with English labels and descriptions for classes would be great and,
I think, address this use case.  Per your note, I suppose one would simply
concatenate that English terms file and wikidata-taxonomy.nt into a new .nt
file, then import that into Protege to explore the class hierarchy.
(Having every line in the ontology be self-contained in N3 is very
convenient!)

Regarding the pruned subset, I think the command-line approach in your
examples is enough for me to get started making my own.

I won't have time to experiment with these things for a few weeks, but I
will return to this then and let you know any interesting findings.

Cheers,
Eric


On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 Eric,

 Two general remarks first:

 (1) Protege is for small and medium ontologies, but not really for such
 large datasets. To get SPARQL support for the whole data, you could to
 install Virtuoso. It also comes with a simple Web query UI. Virtuoso does
 not do much reasoning, but you can use SPARQL 1.1 transitive closure in
 queries (using * after properties), so you can find all subclasses
 there too. (You could also try this in Protege ...)

 (2) If you want to explore the class hierarchy, you can also try our new
 class browser:

 http://tools.wmflabs.org/wikidata-exports/miga/?classes

 It has the whole class hierarchy, but without the leaves (=instances of
 classes + subclasses that have no own subclasses/instances). For example,
 it tells you that lepton has 5 direct subclasses, but shows only one:

 http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338

 On the other hand, it includes relationships of classes and properties
 that are not part of the RDF (we extract this from the data by considering
 co-occurrence). Example:

 Classes that have no superclasses but at least 10 instances, and which
 are often used with the property 'sex or gender':

 http://tools.wmflabs.org/wikidata-exports/miga/?
 classes#_cat=Classes/Direct%20superclasses=__null/Number%
 20of%20direct%20instances=10%20-%202/Related%
 20properties=sex%20or%20gender

 I already added superclasses for some of those in Wikidata now -- data in
 the browser is updated with some delay based on dump files.


 More answers below:


 On 14/06/14 05:52, emw wrote:

 Markus,

 Thank you very much for this.  Translating Wikidata into the language of
 the Semantic Web is important.  Being able to explore the Wikidata
 taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive
 queries) is really neat, e.g.

 SELECT ?subject
 WHERE
 {
 ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 .
 }

 This is more of an issue of my ignorance of Protege, but I notice that
 the above query returns only the direct subclasses of Q82586.  The full
 set of subclasses for Q82586 (lepton) is visible at
 http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en
 -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino,
 electron neutrino) are shown there but not returned by that SPARQL
 query.  It seems rdfs:subClassOf isn't being treated as a transitive
 property in Protege.  Any ideas?


 You need a reasoner to compute this properly. For a plain class hierarchy
 as in our case, ELK should be a good choice [1]. You can install the ELK
 Protege plugin and use it to classify the ontology [2]. Protege will then
 show the copmuted class hierarchy in the browser; I am not sure what
 happens to the SPARQL queries (it's quite possible that they don't use the
 reasoner).

 [1] https://code.google.com/p/elk-reasoner/
 [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege



 Do you know when the taxonomy data in OWL will have labels available?


 We had not thought of this as a use case. A challenge is that the label
 data is quite big because of the many languages. Should we maybe create an
 English label file for the classes? Descriptions too or just labels?



 Also, regarding the complete dumps, would it be possible to export a
 smaller subset of the faithful data?  The files under Complete Data
 Dumps in
 http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too
 big to load into Protege on most personal computers, and would likely
 require adjusting JVM settings

Re: [Wikidata-l] Making a Wikipedia article link to two wikidata items

2014-09-09 Thread Emw

 For articles that are really about multiple different things that cannot
 be reconciled in a single natural concept:

 * State intance of:Wikipedia article with multiple topics (we already
 have several other classes of Wikipedia articles).
 * Use some property, say has topic, to link to items about the
 individual topics.
 * Optionally: use a property like subject of (P805) to link back from
 the individual items to the multi-topic pages.


Can we make do without annotation statements like instance of: Wikipedia
page with multiple topics?  In my opinion, such statements would
unnecessarily clutter a significant portion of our items and would be
better inferred by the presence of *subject of* (P805) claims.  I think
it's better to reserve *instance of* for talk about the essence of the
subject itself.

The closest inverse property for* subject of* is probably *facet of*
(P1269).

Example: https://en.wikipedia.org/wiki/Samoan_Clipper


See https://www.wikidata.org/wiki/Q7409943 for an initial pass at modelling
that.

Note how that Wikipedia page says The aircraft developed an engine problem
(caused by an oil leak), which ultimately caused the in-flight explosion.
We currently have no generic way to model causes.  Coincidentally enough, I
just posted a detailed/long-winded proposal to address that.  Please see
https://www.wikidata.org/wiki/Property_talk:P828#A_better_way_to_model_causation
and give any feedback there!

Cheers,
Eric




On Tue, Sep 9, 2014 at 7:36 AM, Markus Krötzsch 
mar...@semantic-mediawiki.org wrote:

 My proposal became more clear to me over lunch:

 For articles that are really about multiple different things that cannot
 be reconciled in a single natural concept:

 * State intance of:Wikipedia article with multiple topics (we already
 have several other classes of Wikipedia articles).
 * Use some property, say has topic, to link to items about the
 individual topics.
 * Optionally: use a property like subject of (P805) to link back from
 the individual items to the multi-topic pages.

 The main proposal here is to treat these things like Wikipedia
 disambiguation pages: we have items, but the items are mainly about the
 page, not about any real-world concept we care about.

 Example: https://en.wikipedia.org/wiki/Samoan_Clipper

 It says Samoan Clipper was one of ten Pan American Airways Sikorsky S-42
 flying boats but it includes an infobox that lists fatalities. So the
 article describes both a specific airplane (the flying boat) and an event
 (crash of that plane). We should not try to invent a new concept of
 machine-event system to capture this, but have two items for the two
 things we have here.

 We will have many cases where this is not necessary if we can find a
 natural composite concept that it makes sense to talk about. In these
 case, we will use different properties for the links (for example, a
 country article may sometimes be used to describe all the federal states of
 that country, yet we have a good way of linking individual state items to
 the country). As usual, there will be corner cases where it is not clear
 what to do; then we need specific discussions on these cases.

 Cheers,

 Markus



 On 09.09.2014 11:57, Markus Krötzsch wrote:

 On 09.09.2014 11:33, Daniel Kinzler wrote:

 Am 09.09.2014 01:40, schrieb Denny Vrandečić:

 Create a third item in Wikidata, and use that for the language links.
 Any
 Wikipedia that has two separate articles can link to the separate
 items, any
 Wikipedia that has only one article can link to the single item.


 That's a nice solution for the language link problem, but modelling the
 relationship of these three items on wikidata is kind of
 annoying/tricky. How
 would you do that?


 Before the how? should come the why?. The modelling should be chosen
 so that it best suits a given purpose (the purpose is the benchmark for
 deciding if a particular modelling approach is good or not). I guess
 the main thing we want to achieve here is to link the combined item to
 and from the single items. If this is true, then the how? question is
 basically a which property to use? question.

 For this we should look more closely at the nature of the combined item.
 Let's distinguish combined items that are natural and meaningful
 concepts from those that are just different topics combined for
 editorial reasons in one article. The first kind of item involves things
 like bands (who have members, possibly with individual articles, but
 which are still meaningful concepts by themselves). The second kind of
 item involves the Wangerooge hybrid, but also many other things (e.g.,
 plane crashes and the planes themselves; or people and events the people
 where involved in).

 The problem with these second type of complex item is that it does not
 give you a good basis for adding data (you can't say properly which
 aspects of the thing you are talking about). It is also problematic
 since these things are not natural concepts that can be 

[Wikidata-l] Why? Modeling causes on Wikidata

2014-09-19 Thread Emw
Hi all,

Talk about causes is ubiquitous in everyday life and many other domains of
knowledge.  Until recently, we've had a few properties to make statements
about cause in certain narrow areas, but lacked a way to structure data
about causes across a broad range of subjects.  For example, you might want
to know:

   - What caused World War II?
   - What causes evolution?
   - What causes malaria?
   - What causes bread to rise?
   - What causes rust?
   - What causes gravity?
   - What causes rainbows?

Wikidata now has some new properties that provide structure for basic
answers to such questions.

   - *has cause* (alias: *has underlying cause*): thing that ultimately
   resulted in the effect [1]
   - *has immediate cause*: thing that proximately resulted in the effect
   [2]
   - *has contributing factor*: thing that significantly influenced the
   effect, but did not directly cause it [3]

This approach to modeling causation attempts to balance expressiveness with
simplicity.  It borrows from the idea of causation as a chain of events,
which also has background conditions or events that set the stage for some
outcome.  These properties are not perfect, but they do allow us to capture
much richness in how various sources talk about causes -- and to do so in a
way that humans can easily understand.

https://www.wikidata.org/wiki/Help:Modeling_causes explains these
properties, their background, examples, things to avoid, issues and
context.  Please comment on the 'Help:Modeling causes' talk page, or here,
with any feedback.

Hopefully we'll be able to build some cool stuff with this.

Cheers,
Eric

https://www.wikidata.org/wiki/User:Emw

1. *has cause*.  https://www.wikidata.org/wiki/Property:P828
2. *has immediate cause.  *https://www.wikidata.org/wiki/Property:P1478
3. *has contributing factor.* https://www.wikidata.org/wiki/Property:P1479
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] all human genes are now wikidata items

2014-10-07 Thread Emw
Andra, Chinmay, Ben, Andrew,

Kudos!  This is a significant milestone, and showcases Wikidata's potential
for structuring large sets of biological data.  Thanks for your excellent
work!

Cheers,
Eric

https://www.wikidata.org/wiki/User:Emw

On Mon, Oct 6, 2014 at 4:21 PM, Benjamin Good ben.mcgee.g...@gmail.com
wrote:

 I thought folks might like to know that every human gene (according to the
 United States National Center for Biotechnology Information) now has a
 representative entity on wikidata.  I hope that these are the seeds for
 some amazing applications in biology and medicine.

 Well done Andra and ProteinBoxBot !

 For example:
 Here is one (of approximately 40,000) called spinocerebellar ataxia 37
 https://www.wikidata.org/wiki/Q18081265

 -Ben

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Item both subclass and instance?

2014-10-07 Thread Emw
I have removed the statement *instance of* chemical compound from ethanol
(Q153) [1].

A few proposals have been made in this thread about how -- or whether -- to
use *instance of* (i.e. rdf:type, P31) to classify 'ethanol' and other
chemical compounds, but there seems to be consensus that *instance of*
chemical compound is not the way to do it.

Summary of proposals:

   1. *Do not use instance of for chemical compounds*.  Such statements
   make Wikidata incompatible with many major scientific ontologies, like
   ChEBI, Gene Ontology and Disease Ontology, which use *instance of* as
   defined in the Relation Ontology (RO) [2].  Note that RO defines instances
   as particular things that have a unique location in space and time, whereas
   classes are universal, general entities which have particular instances.
   Instances and classes are thus disjoint, so RO-based ontologies cannot have
   entities that have both *instance of* (rdf:type, P31) and *subclass of*
   (rdfs:subClassOf, P279) statements as is possible in OWL 2 DL via punning.

   2. *Use statements like instance of type of chemical compound for
   chemical compounds*.  Doing so makes it easier to generate lists of
   chemical compounds, and is valid in OWL 2 DL -- it is metamodeling via
   punning.

Let's build consensus for how (or whether) we want to use *instance of* for
chemical compounds before any mass edits to remove or replace the 14969
other *instance of* chemical compound claims [3] or adding statements
like *instance of *type of chemical compound to ethanol.

Micru has a different proposal for how to model items, which incidentally
does not represent ethanol as an instance [4].  However, that proposal is
clearly a more radical vision for Wikidata, and probably warrants a
separate thread for discussion.

Eric

https://www.wikidata.org/wiki/User:Emw
[1] Removal of *instance of* chemical compound from ethanol:
https://www.wikidata.org/w/index.php?title=Q153diff=162563849oldid=162327014
[2] Barry Smith et al. (2005).  *Relations in Biomedical Ontologies*.
http://genomebiology.com/2005/6/5/r46
[3] All *instance of* chemical compound claims on Wikidata.
http://tools.wmflabs.org/wikidata-todo/autolist.html?q=claim[31:11173]
[4] 'ethanol' is no longer an instance, but a class.
https://lists.wikimedia.org/pipermail/wikidata-l/2014-October/004691.html
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


[Wikidata-l] NYC Wikidata workshop and hackathon this Sunday!

2014-12-11 Thread Emw
Hi all,

Wikimedia New York City will be hosting a Wikidata hackathon and beginners
workshop this coming Sunday.  This will be a good event to meet Wikimedians
involved with cultural institutions, structure a bunch of data, and help
new users.

If you're in the area, come!

When:
Sunday, December 14,1:00 - 5:00 PM

Where:
55 Washington Street, Brooklyn, NY 11201
Room 321 (BLIP Outpost)

Details and sign up:
https://en.wikipedia.org/wiki/Wikipedia:Meetup/NYC/December_Wikidata

Cheers,
Eric
https://www.wikidata.org/wiki/User:Emw
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


[Wikidata-l] How to declare a property is transitive, etc.

2014-12-18 Thread Emw
Hi all,

Could those knowledgeable about OWL or intending to use Wikidata's RDF /
OWL exports please weigh in at
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Property_metadata#How_should_we_declare_that_a_property_is_transitive
? [1]

Being able to declare certain properties of properties is an essential
building block for querying and inference.  However, the way to declare
that a property is, say, transitive in OWL does not have a clear analog in
Wikidata syntax.  We could certainly shoehorn such a statement into our
existing model (and it looks like we'll need to), but it is important to do
so in a way that complicate things as little as possible for downstream
users, e.g. outside researchers or developers using the RDF exports and
assuming standard OWL semantics.

Please make any comments on this on-wiki at the location linked above.
That way we can keep the discussion centralized.

Other discussions on that page could also benefit from input by people
knowledgeable about Semantic Web vocabulary.

Thanks,
Eric

https://www.wikidata.org/wiki/User:Emw

1.  Discussion permalink:
https://www.wikidata.org/w/index.php?title=Wikidata:Property_proposal/Property_metadataoldid=182088235#How_should_we_declare_that_a_property_is_transitive
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] subclass-of vs. instance-of

2014-12-31 Thread Emw
Automobile (Q1420) had the claims [1]:

*subclass of* motor road vehicle
*instance of* motor road vehicle

That was incorrect.  An instance of motor road vehicle is something like
the Peekskill Meteorite Car (Q7756463) [2].

It is generally incorrect when an item has *instance of* and *subclass of*
claims with the same value.  I am not aware of a Wikidata constraint
template which can encode that rule.  (Off hand I'm not sure how it would
be encoded in OWL, either.  Ontology experts: how would we do that?)

If we wanted use both *instance of* and *subclass of* in automobile, then
we would need to do something like:

*subclass of* motor road vehicle
*instance of* motor road vehicle class

In my opinion, *instance of* claims like that are not very useful, because
they simply restate what is directly implied in the *subclass of* claim.
Punning that is not a mere rephrasing can be useful, e.g. Chevrolet Malibu
(Q287723) [3] *subclass of* mid-size car, *instance of* car model.

See also Markus's comment from September about using *subclass of* and
*instance
of* in the same item, which conveniently also discusses automobiles [4].

Happy Q11269!
Eric
https://www.wikidata.org/wiki/User:Emw

1.  https://www.wikidata.org/w/index.php?title=Q1420oldid=184512429#P279
2.  https://www.wikidata.org/wiki/Q7756463
3.  https://www.wikidata.org/wiki/Q287723
4.
https://lists.wikimedia.org/pipermail/wikidata-l/2014-September/004649.html
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Disambiguating property [was: Freebase like API with an OUTPUT feature}

2015-01-10 Thread Emw
Since it appears that the creation of *subproperty of* went unnoticed by
many, I'd like to describe an important aspect of its proper use, and how
that relates to classification.

Please note that *instance of* (P31) and *subclass of* (P279) are not valid
values for *subproperty of* (P1647) claims, as described in the P1647
documentation [1].  For example, claims like occupation *subproperty of*
instance of are invalid.  The reasons for this are both technical and
architectural.

On the technical side, *instance of, subclass of* and *subproperty of* are
intended to be straightforwardly exportable as rdf:type, rdfs:subClassOf
and rdfs:subPropertyOf.  As described in *On the Properties of Metamodeling
in OWL* [2], claims that use OWL's built-in vocabulary (e.g. rdf:type) as
individuals make an ontology undecidable.  If an ontology is undecidable,
then queries are not guaranteed to terminate.  This is a big deal.
Decidability is a main goal of OWL 2 DL and a requirement in the more
specialized profiles OWL 2 EL, OWL 2 RL and OWL 2 QL.  Most Semantic Web
ontologies aim to valid be in at least OWL 2 DL.  So if Wikidata aims to be
easily interoperable with the rest of the Semantic Web, we should aim to be
valid in OWL 2 DL, and thus not make claims of the form P *subproperty of*
instance of (P31) or P *subproperty of* subclass of (P279).

Avoiding such claims is also good design.  There should be one -- and
preferably only one -- obvious way to specify the type of an instance.
Having a multitude of domain-specific type subproperties would promote an
anti-pattern: using *instance of* as a catch-all property to make any
statement under the sun that makes sense when connected with the phrase is
a.

Having a single type property for instances also fosters another best
practice in Wikidata: asserted monohierarchy [3].  In other words, there
should be only one explicit normal or preferred *instance of *or *subclass
of* claim per item.  Having an *instance of *claim and a *subclass of*
claim on an item isn't necessarily bad (it's called punning), but having
multiple *instance of* claims or multiple *subclass of* claims on an item
is a bad smell.  Items can typically satisfy a huge number of *instance of*
claims, but should generally have only one such claim made explicitly in
Wikidata.

For example, Coco Chanel (Q45661) can be said to be *instance of* French
person, *instance of* fashion designer, *instance of* female, etc.
Instead of such catch-all use of *instance of*, Wikidata moves that
knowledge into properties like *country of citizenship* (P27), *occupation*
(P106) and *sex or gender* (P21).  Coco Chanel has one explicit *instance
of* value: human (Q5) -- a class that encapsulates essential features of
the subject.

Most of Wikidata follows these general principles of classification.  But a
few domains of knowledge remain either somewhat of a mess, or organized but
idiosyncratic.  Items like the one for the German municipality of Aalen
[4], with 7 *instance of* values -- several of them redundant -- exemplify
the mess.  With the deletion of domain-specific type properties like *type
of administrative territorial entity* (P132) [5], we are on the right
track.  The solution is not to make such things subproperties of *instance
of*, but rather to delete them and use *instance of* for one preferred
class and put other values in other properties (note -- this may require
new properties!).

The same applies for *subclass of*.

I encourage anyone interested in stuff like *subproperty of* to join the
discussions ongoing at
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Property_metadata.
The Wikidata community is currently discussing how we want to handle things
like *domain* and *range* properties (e.g. should we use rdfs:domain or
schema:DomainIncludes?)  and whether we want to have an *inverse of*
property (or delete all inverse properties).  The outcome of these
discussions will shape the interface between Wikidata and the rest of the
Semantic Web.

Thanks,
Eric

https://www.wikidata.org/wiki/User:Emw


1.  https://www.wikidata.org/wiki/Property:P1647
2.  Boris Motik (2007).  On the Properties of Metamodeling in OWL.
https://www.cs.ox.ac.uk/boris.motik/pubs/motik07metamodeling-journal.pdf
*3.  *Barry Smith, Werner Ceusters (2011).  Ontological realism: A
methodology for coordinated evolution of scientific ontologies.  Section
1.8: Asserted monohierarchies.
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3104413/#S9
4.  Aalen on Wikidata as of 2015-01-10.
https://www.wikidata.org/w/index.php?title=Q3951oldid=184247296#P31
5.
https://www.wikidata.org/wiki/Wikidata:Requests_for_deletions/Archive/2014/Properties/1#type_of_administrative_territorial_entity_.28P132.29
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Disambiguating property [was: Freebase like API with an OUTPUT feature}

2015-01-13 Thread Emw
Hi James,

My mistake, I should have linked to
https://www.wikidata.org/wiki/Property_talk:P1647, which includes the
following in the 'Examples' section of the documentation template:

Note: it is not valid to declare subproperties of instance of (P31)
(rdf:type), subclass of (P279) (rdfs:subClassOf) or any other property
mapped to a built-in property of RDF, RDFS or OWL. See creation discussion.

The creation discussion is available at
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Archive/27#subproperty_of
.

Eric
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] Kian: The first neural network to serve Wikidata

2015-03-07 Thread Emw
Amir,

What is the false positive rate of your algorithm when dealing with
fictitious humans and (non-fictitious) non-human organisms?  That is, how
often does your program classify such non-humans as humans?

Regarding the latter, note that items about individual dogs, elephants,
chimpanzees and even trees can use properties that are otherwise extremely
skewed towards humans.  For example, Prometheus (Q590010) [1], an extremely
old tree, has claims for *date of birth* (P569), *date of death* (P570),
even *killed by* (P157).  Non-human animals can also have kinship claims
(e.g. *mother*, *brother, child*), among other properties typically used on
humans.

Best,
Eric

https://www.wikidata.org/wiki/User:Emw

1.  Prometheus.  https://www.wikidata.org/wiki/Q590010

On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup ladsgr...@gmail.com wrote:

 Hey Markus,
 Thanks for your insight :)

 On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch 
 mar...@semantic-mediawiki.org wrote:

 Hi Amir,

 In spite of all due enthusiasm, please evaluate your results (with
 humans!) before making automated edits. In fact, I would contradict Magnus
 here and say that such an approach would best be suited to provide
 meaningful (pre-filtered) *input* to people who play a Wikidata game,
 rather than bypassing the game (and humans) altogether. The expected error
 rates are quite high for such an approach, but it can still save a lot of
 works for humans.

 there is a certainty factor and It can save a lot without making such
 errors by using the certainty factor


 As for the next steps, I would suggest that you have a look at the works
 that others have done already. Try Google Scholar:

 https://scholar.google.com/scholar?q=machine+learning+wikipedia

 As you can see, there are countless works on using machine learning
 techniques on Wikipedia, both for information extraction (e.g.,
 understanding link semantics) and for things like vandalism detection. I am
 sure that one could get a lot of inspiration from there, both on potential
 applications and on technical hints on how to improve result quality.

 Yes, definitely I would use them, thanks.


 You will find that people are using many different approaches in these
 works. The good old ANN is still a relevant algorithm in practice, but
 there are many other techniques, such as SVNs, Markov models, or random
 forests, which have been found to work better than ANNs in many cases. Not
 saying that a three-layer feed-forward ANN cannot do some jobs as well, but
 I would not restrict to one ML approach if you have a whole arsenal of
 algorithms available, most of them pre-implemented in libraries (the first
 Google hit has a lot of relevant projects listed:
 http://daoudclarke.github.io/machine%20learning%20in%
 20practice/2013/10/08/machine-learning-libraries/). I would certainly
 recommend that you don't implement any of the standard ML algorithms from
 scratch.

 I use backward propagation algorithm and I use octave in ML for my
 personal works, but in Wikipedia I use python (for two main reasons:
 integrating with with other wikipedia-related tools like pywikibot and bad
 performance of octave and Matlab in big sets of data) and I had to write
 that parts from scratch since I couldn't find any related library in
 python. Even algorithms like BFGS is not there (I could find in scipy but I
 wasn't sure it works correctly and because no documentation is there)

 In practice, the most challenging task for successful ML is often feature
 engineering: the question which features you use as an input to your
 learning algorithm. This is far more important that the choice of
 algorithm. Wikipedia in particular offers you so many relevant pieces of
 information with each article that are not just mere keywords (links,
 categories, in-links, ...)  and it is not easy to decide which of these to
 feed into your learner. This will be different for each task you solve
 (subject classification is fundamentally different from vandalism
 detection, and even different types of vandalism would require very
 different techniques). You should pick hard or very large tasks to make
 sure that the tweaking you need in each case takes less time than you would
 need as a human to solve the task manually ;-)

 Yes, feature engineering is the most important thing and it can be tricky
 but feature engineering in Wikidata is lot easier (it's easier than
 Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism
 bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is
 limited to certain kinds (like removing a sitelink, etc.) but it's not easy
 in Wikipedia.


 Anyway, it's an interesting field, and we could certainly use some effort
 to exploit the countless works in this field for Wikidata. But you should
 be aware that this is no small challenge and that there is no universal
 solution that will work well even for all the tasks that you have mentioned
 in your email.

 Of course, I 

Re: [Wikidata-l] External identifiers vs. Wikidata-internal links data

2015-04-03 Thread Emw

 Yes. I could see a simple Statements vs. External identifiers
 distinction being useful that's also reflected in the data model so
 it's easier to treat these property groups in a distinct manner.


I support grouping statements about external identifiers together and
distinguishing them from other statements, but I would voice caution about
presenting that distinction as Statements vs. External identifiers.

I agree with Denny that qualifiers and references should be retained for
external identifiers.  I would further suggest that external identifiers
remain structured as properties that can (along with their values in
claims) be created, updated and deleted by the community.

Given that, I think the distinction should be styled less as Statements
vs. External identifiers and more as External identifiers as a kind of
statement.  UI editing controls and data modeling as statements would
remain, but External identifiers (e.g. *VIAF identifier* 113230702) would
be moved to the bottom or side of statements of subject knowledge (e.g. *cause
of death* heart attack).

Grouping together and separating external identifiers from other kinds of
statements in the UI, and reflecting that in the data model and API, sounds
like a great idea.  https://www.wikidata.org/wiki/Q42 is a rat's nest of
meaningless (but technically useful) statements about external identifiers
and meaningful statements about the subject.  It's important to fix that,
and I imagine we could do so while retaining all the current UI controls
and data model attributes of statements in statements about external
identifiers.

Best,
Eric

https://www.wikidata.org/wiki/User:Emw
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] OWL based ontologies as basis for Wikidata item interactions and property proposal

2015-04-04 Thread Emw
Sebastian, Benjamin, Elvira, Andra, Andrew,

Kudos on your progress with an OWL-centric approach to knowledge
representation.  The community has been incorporating OWL concepts into
property definitions and ontology development on-wiki for some time, but
yours is the first Wikidata group I'm aware of that has incorporated
Protege into the process.

We think that using ontologies brings several advantages


The examples you cite seem like good ideas and I support them.

I would also suggest considering how the Wikidata ontologies we develop fit
into established ontologies in the Semantic Web.  For example, the OBO
Foundry (http://www.obofoundry.org/) is by far the world's most widely used
group of biomedical ontologies [1, 2].  Those ontologies are rooted in the
Basic Formal Ontology (BFO).  OWL helps a great deal in being interoperable
with those works, but a further ontological commitment tends to be needed
for easy compatibility.

Is your gene-disease interaction ontology compatible with BFO, and the OBO
ontologies rooted in it?

Cheers,
Eric

https://www.wikidata.org/wiki/User:Emw

1.  http://www.nature.com/nbt/journal/v25/n11/full/nbt1346.html
2.  https://scholar.google.com/scholar?cites=13806088078865650870
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l