Re: [Wikidata-l] Question about wikipedia categories.
There's a related essay on Wikimedia Commons: http://commons.wikimedia.org/wiki/User:Multichill/Next_generation_categories . The Wikidata properties instance ofhttps://www.wikidata.org/wiki/Property_talk:P31(formerly is a) and subclass of https://www.wikidata.org/wiki/Property_talk:P279 are likely relevant to folks interested in ontology building on Wikidata. They're based on rdfs:type http://www.w3.org/TR/rdf-schema/#ch_type and rdfs:subClassOfhttp://www.w3.org/TR/rdf-schema/#ch_subclassoffrom W3C recommendations, and allow for building a rooted DAG that places concepts into a hierarchy of knowledge. They also allow for a degree of type-token distinction http://en.wikipedia.org/wiki/Type%E2%80%93token_distinctionwhen classifying subjects, though how that applies to certain knowledge domains hasn't been fully sussed out. On Sun, May 5, 2013 at 2:17 PM, Chris Maloney voldr...@gmail.com wrote: Doug from WikiSource started a page over at meta: http://meta.wikimedia.org/wiki/Beyond_categories I'll be trying to fill in some of my understanding of the problem and the scope of a possible solution. I recognize there's been a lot of prior art on this issue, and a lot of existing overlapping tools and infrastructure, and I'm pretty new around here, and apt to be inaccurate and naive. So I do hope others with more experience will come and help sort it out. Chris On Sun, May 5, 2013 at 11:06 AM, Michael Hale hale.michael...@live.com wrote: As far as checking the import progress of Wikidata, the category American women writers has 1479 articles. 651 of them currently have a main type (GND), 328 have a sex, 162 have an occupation, 111 have a country of citizenship, 49 have a sexual orientation, 39 have a place of birth, etc. From: j...@sahnwaldt.de Date: Sun, 5 May 2013 16:28:14 +0200 To: wikidata-l@lists.wikimedia.org Subject: Re: [Wikidata-l] Question about wikipedia categories. Hi Pat, I've been involved with DBpedia for several years, so these are interesting thoughts. On 5 May 2013 01:25, Patrick Cassidy p...@micra.com wrote: If one is interested in a functional “category” system, it would be very helpful to have a good logic-based ontology as the backbone. I haven’t looked recently, but when I inquired about the ontology used by DBpedia a year ago, I was referred to “dbpedia-ontology.owl”, an ontology in the format of the “semantic web” ontology format OWL. The OWL format is excellent for simple purposes, but the dbpedia-ontology.owl (at that time) was not well-structured (being very polite). Do you mean just the file dbpedia-ontology.owl or the DBpedia ontology in general? We still use OWL as our main format for publishing the ontology. The file is generated automatically. Maybe the generation process could be improved. I did inquire as to who was maintaining the ontology, and had a hard time figuring out how to help bring it up to professional standards. But it was like punching jello, nothing to grasp onto. I gave up, having other useful things to do with my time. The ontology is maintained by a community that everyone can join at http://mappings.dbpedia.org/ . An overview of the current class hierarchy is here: http://mappings.dbpedia.org/server/ontology/classes/ . You're more than welcome to help! I think talk pages are not used enough on the mappings wiki, so if you have ideas, misgivings or questions about the DBpedia ontology, the place to go is probably the mailing list: https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion Thanks! Christopher Perhaps it is time now, with more experience in hand, to rethink the category system starting with basics. This is not as hard as it sounds. It may require some changes where there is ambiguity or logical inconsistency, but mostly it only necessary to link the Wikipedia categories to an ontology based on a well-structured and logically sound foundation ontology (also referred to as an “upper ontology”), that supplies the basic categories and relations. Such an ontology can provide the basic concepts, whose labels can be translated into any terminology that any local user wants to use. There are several well-structured foundation ontologies, based on over twenty years of research, but the one I suggest is the one I am most familiar with (which I created over the past seven years), called COSMO. The files at http://micra.com/COSMO will provide the ontology itself (“COSMO.owl”, in OWL) and papers describing the basic principles. COSMO is structured to be a “primitives-based foundation ontology”, containing all of the “semantic primitives” needed to describe anything one wants to talk about. All other categories are structured as logical combinations of the basic elements. Its inventory of primitives is probably
Re: [Wikidata-l] Fwd: Re: [Wikitech-l] Why isn't hotcat an extension?
The relationship between Wikipedia categories and Wikidata pops up here and there in discussions -- a recent one was https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/06#Proposal_for_phase_4:_unify_and_centralize_categories . I think Wikidata propertieshttps://www.wikidata.org/wiki/Wikidata:Glossary#Propertyand queries https://www.wikidata.org/wiki/Wikidata:Glossary#Query will likely go a long way toward obsolescing Wikipedia categories. Categories are essentially queries on a set of pre-defined properties. The manual maintenance that has been required to curate Wikipedia's category system seems like it could be largely eliminated (or, at least, centralized and streamlined) once Wikidata queries are deployed. Wikidata properties like those covered in https://www.wikidata.org/wiki/Help:Basic_membership_properties also allow subjects to be arranged into a taxonomy of concepts, which is one of the main features of categories. I'm not aware of any concrete plans to replace the category system with a solution from Wikidata, but I think it would make more sense to explore that option than to work on importing Wikipedia categories en masse into Wikidata. On Thu, Jul 18, 2013 at 6:18 PM, rupert THURNER rupert.thur...@gmail.comwrote: Lets forward this to here, maybe somebody here already thought about categories in wikidata. -- Weitergeleitete Nachricht -- Von: Tyler Romeo tylerro...@gmail.com Datum: 18.07.2013 23:08 Betreff: Re: [Wikitech-l] Why isn't hotcat an extension? An: Wikimedia developers wikitec...@lists.wikimedia.org On Thu, Jul 18, 2013 at 4:04 PM, Antoine Musso hashar+...@free.fr wrote: Lets move the categories in wikidata ? =) That'd be nice, but how much time would that take to develop? *-- * *Tyler Romeo* Stevens Institute of Technology, Class of 2016 Major in Computer Science www.whizkidztech.com | tylerro...@gmail.com ___ Wikitech-l mailing list wikitec...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Application: sexing people by name/research gender bias
Max's comment is very related to Wikidata. The sex property [1] is a model system to explore important questions for the project at large. For example, how rigorous do we want to be with automatic classification? Let's say a property can have one of three values: A, B or C. Roughly 90% of the valid subjects for that property are known to be either A or B, and 10% are known to be C. Our automatic classifier can assign all valid subjects to either A or B. However, it can't segregate A or B from C. So our false positive rate is at least 10%. Would it be acceptable for Wikidata to have a known error rate of 10% in certain properties? At what error rate does automatic classification become unacceptable? Another question this topic broaches: do we want to adopt formal domain and range constraints on properties? If we do, then how do we handle rare values? How about exceedingly rare values? (It should be noted that the Wikidata sex property includes intersex in its range constraints [2].) There is ongoing discussion about whether we want to adopt range and domain constraints (among other property metadata) in Wikidata's Project chat [3]. Eric https://www.wikidata.org/wiki/User:Emw 1. https://www.wikidata.org/wiki/Property:P21 2. https://www.wikidata.org/wiki/Property_talk:P21 3. https://www.wikidata.org/wiki/Wikidata:Project_chat#What_type_of_data_should_be_stored(permalink: https://www.wikidata.org/w/index.php?title=Wikidata:Project_chatoldid=78406798#What_type_of_data_should_be_stored ) On Tue, Oct 15, 2013 at 2:33 PM, Tom Morris tfmor...@gmail.com wrote: So you've got an agenda that's unrelated to Wikidata or analysis thereof. Got it. Perhaps a non-Wikidata list would be a more appropriate forum. On Tue, Oct 15, 2013 at 2:08 PM, Klein,Max kle...@oclc.org wrote: Sorry to rant. Accepted. Tom ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] ontology Wikidata API, managing ontology structure and evolutions
What about monthly/dump-based aggregated property usage statistics? Property usage statistics would be very valuable, Dimitris. It would help inform community decisions about how to steer changes in property usage with less disruption. It would have other significant benefits as well. Getting daily counts like https://www.wikidata.org/wiki/Wikidata:Database_reports/Popular_propertiesback up and running would be a good place to start. That report hasn't been updated since October 2013. We could go further by showing counts for all properies, not just the top 100. More detailed data would be great, too. Wikidata editors recently posted a list of the most popular objects for 'instance of' (P31) claims at https://www.wikidata.org/w/index.php?title=Property_talk:P31oldid=99405143#Value_statistics. Having daily data like that for all properties would be quite useful. If anyone does end up doing something like this, I would recommend archiving the data at http://dumps.wikimedia.org/other/ in addition to posting it in a regularly updated report in Wikidata. Cheers, Eric https://www.wikidata.org/wiki/User:Emw On Thu, Jan 9, 2014 at 12:59 PM, Dimitris Kontokostas kontokos...@informatik.uni-leipzig.de wrote: What about monthly/dump-based aggregated property usage statistics? People would be able to check property trends or maybe subscribe to specific properties via rss. On Thu, Jan 9, 2014 at 3:55 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: Am 08.01.2014 16:20, schrieb Thomas Douillard: Hi, a problem seems (not very surprisingly) to emerge into Wikidata : the managing of the evolution of how we do things on Wikidata. Properties are deleted, which made some consumer of the datas sometimes a little frustrated they are not informed of that and could not take part of the discussion. They are informed if they follow the relevant channels. There's no way to inform them if they don't. These channels can very likely be improved, yes. That being said: a property that is still widely used should very rarely be deleted, if at all. Usually, properties would be phased out by replacing them with another property, and only then they get deleted. Of course, 3rd parties that rely on specific properties would still face the problem that the property they use is simply no longer used (that's the actual problem - whether it is deleted doesn't really matter, I think). So, the question is really: how should 3rd party users be notified in changes of policy and best practice regarding the usage and meaning of properties? That's an interesting question, one that doesn't have a technical solution I can see. -- daniel -- Daniel Kinzler Senior Software Developer Wikimedia Deutschland Gesellschaft zur Förderung Freien Wissens e.V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Dimitris Kontokostas Department of Computer Science, University of Leipzig Research Group: http://aksw.org Homepage:http://aksw.org/DimitrisKontokostas ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] difference between Person and Human classes
what's the usage of the Person class ? 'Person' [1] should generally be avoided. For background, see the discussion about how to classify subjects like Coco Chanel in [2]. That's the basis for the note not for use with P31, instead use Q5 human. Pretty much all the items that link to 'person' [3] shouldn't. For fictional characters, the convention is to classify them as 'fictional character', e.g. as done for Jack Bauer [4]. There are some tricky knowledge representation issues with fictional entities. For example, how do we ensure that Harry Potter is not returned in a query for all people born in London in 1984? The fictional universes project [5] aims to address problems like that. 1) https://www.wikidata.org/wiki/Q215627 2) https://www.wikidata.org/wiki/Wikidata:Requests_for_comment/Migrating_away_from_GND_main_type#P31_value_for_things_like_Coco_Chanel 3) https://www.wikidata.org/wiki/Special:WhatLinksHere/Q215627 4) https://www.wikidata.org/wiki/Q24 5) https://www.wikidata.org/wiki/Wikidata:Wikiproject_Fictional_universes On Sat, Feb 15, 2014 at 4:31 AM, Jane Darnell jane...@gmail.com wrote: I would imagine it's for fictional characters like Little Red Riding Hood, but I see that when I click on What links here while on page Q215627 I see Sleeping Beauty, but also roles and dead people. I am just as lost as you are! 2014-02-15 2:42 GMT+01:00, Hady elsahar hadyelsa...@gmail.com: Hi all, just got confused a little bit between Person Q215627 and Human Q5 classes in the Person page https://www.wikidata.org/wiki/Q215627 it's written not for use with P31, instead use Q5 human , if so what's the usage of the Person class ? thanks Regards - Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Subclass of/instance of
Hi Markus, You asked who is creating all these [subclass of] statements and how is this done? The class hierarchy in http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q35120rp=279lang=enshows a few relatively large subclass trees for specialist domains, including molecular biology and mineralogy. The several thousand subclass of 'gene' and 'protein' subclass claims were created by members of WikiProject Molecular biology (WD:MB), based on discussions in [1] and [2]. The decision to use P279 instead of P31 there was based on the fact that the is-a relation in Gene Ontology maps to rdfs:subClassOf, which P279 is based on. The claims were added by a bot [3], with input from WD:MB members. The data ultimately comes from external biological databases. A glance at the mineralogy class hierarchy indicates it has been constructed by WikiProject Mineralogy [4] members through non-bot edits. I imagine most of the other subclass of claims are done manually or semi-automatically outside specific Wikiproject efforts. In other words, I think most of the other P279 claims are added by Wikidata users going into the UI and building usually-reasonable concept hierarchies on domains they're interested in. I've worked on constructing class hierarchies for health problems (e.g. diseases and injuries) [5] and medical procedures [6] based on classifications like ICD-10 and assertions and templates on Wikipedia (e.g. [8]). It's not incredibly surprising to me that Wikidata has about 36,000 subclass of (P279) claims [9]. The property has been around for over a year and is a regular topic of discussion [10] along with instance of (P31), which has over 6,600,000 claims. You noted a dubious claim subclass of claim for 'House of Staufen' (Q130875). I agree that instance of would probably be the better membership property to use there. Such questionable usage of P279 is probably uncommon, but definitely not singular. The dynasty class hierarchy shows 13 dubious cases at the moment [11]. I would guess less than 5% of subclass of claims have that kind of issue, where instance of would make more sense. I think there are probably vastly more cases of the converse: instance of being used where subclass of would make more sense. As you probably know, P31 and P279 are intended to have the semantics of rdf:type and rdfs:subClassOf per community decision. A while ago I read a bit about the ELK reasoner you were involved with [12], which makes use of the seemingly class-centric OWL EL profile. Do you have any plans to integrate features of ELK with the Wikidata Toolkit [13]? How do you see reasoning engines using P31 and P279 in the future, if at all? Thanks, Eric https://www.wikidata.org/wiki/User:Emw [1] https://www.wikidata.org/wiki/WT:MB#Distinguishing_between_genes_and_proteins [2] https://www.wikidata.org/wiki/WT:MB#Human.2Fmouse.2F..._ID [3] https://www.wikidata.org/wiki/User:ProteinBoxBot. Chinmay Nalk ( https://www.wikidata.org/wiki/User:Chinmay26) did all the work on this, with input from WD:MB. [4] https://www.wikidata.org/wiki/Wikidata:WikiProject_Mineralogy [5] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q15281399rp=279lang=en [6] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q796194rp=279lang=en [7] http://apps.who.int/classifications/icd10/browse/2010/en [8] https://en.wikipedia.org/wiki/Template:Surgeries [9] https://www.wikidata.org/w/index.php?title=Wikidata:Database_reports/Popular_propertiesoldid=125595374 [10] Examples include - https://www.wikidata.org/wiki/Wikidata:Project_chat#chemical_element - https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2013/12#Top_of_the_subclass_tree - https://www.wikidata.org/wiki/Wikidata:Project_chat/Archive/2014/01#Question_about_classes.2C_and_.27instance_of.27_vs_.27subclass.27 [11] http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q164950rp=279lang=en [12] http://korrekt.org/page/The_Incredible_ELK [13] https://www.mediawiki.org/wiki/Wikidata_Toolkit On Mon, May 5, 2014 at 12:46 PM, Markus Kroetzsch markus.kroetz...@tu-dresden.de wrote: Hi, I got interested in subclass of (P279) and instance of (P31) statements recently. I was surprised by two things: (1) There are quite a lot of subclass of statements: tenth of thousands. (2) Many of them make a lot of sense, and (in particular) are not (obvious) copies of Wikipedia categories. My big question is: who is creating all these statements and how is this done? It seems too much data to be created manually, but I don't see obvious automated approaches either (and there are usually no references given). I also found some rare issues. A subclass of B should be read as Every A is also a B. For example, we have Every piano (Q5994) is also a keyboard instrument (Q52954). Overall, the great majority of cases I looked at had remarkably sane modelling (which reinforces my big question). But there are still cases where subclass of is mixed up with instance of. For example,
Re: [Wikidata-l] Wikidata query feature: status and plans
In case anyone is a bit lost, Tom is proposing an approach to classification we've been calling explicit metamodeling. Simply put, let's say you have a class hierarchy: A subclass of B B subclass of C C subclass of D The proposal, as I understand it, is to add instance of claims for almost all classes in Wikidata, which would yield classifications like: A subclass of B A instance of 'type of B' B subclass of 'C' B instance of 'type of C' C subclass of 'D' C instance of 'type of D' The rationale for this is enable querying direct subclasses or immediate subclasses of any given class. This approach might be theoretically valid in all classification, but I don't think it's a sensible solution for most classification. As you can see, explicit metamodeling introduces claims that are rather redundant. Tom's idea seems to be to use this approach for almost all classification on Wikidata. I am not enthusiastic about pervasively using that approach to classification throughout Wikidata. There are other ways to get direct subclasses, several of which are described in http://answers.semanticweb.com/questions/14699/get-immediate-subclasses-of-a-class. For example, you could turn off entailments / inferencing in a query engine you're using. You could do a typical subclass query and filter out non-direct subclasses. Those querying approaches seem much simpler and more conventional than saturating a concept hierarchy with redundant 'instance of 'type of Foo' statements to enable the querying direct subclasses. The extended discussions on this Tom refers can be found at https://www.wikidata.org/wiki/Wikidata_talk:Country_subdivision_task_force#layers. In addition to introducing substantial redundancy, you might get the feeling as I did when reading through that discussion that widespread use of explicit metamodeling would be quite confusing for users. What are others' thoughts? Thanks, Eric https://www.wikidata.org/wiki/User:Emw On Wed, Jun 11, 2014 at 6:56 AM, Thomas Douillard thomas.douill...@gmail.com wrote: I'm still talking of the model I proposed in my first post in this thread. I did give an advantage : you can really simply query the type of units used by a country to class his administrative units, like Region, Departement (as two items) for France with just simple in one request of the future simple query module : just retrieve the instances of the class French type of administrative units. This model also apply to any country administrative territorial division. I think French type of administrative units is the auxiliary item markus did mention. We can define precisely what it is becaus this class French type of administrative units regroups the types used by france to class cities, departments ... so clearly if we talk of Paris, there is several possibility, like the region of Paris, the City of Paris ... The city of Paris item is clearly the only one who is clearly defined by french law, hence it is an instance of the class French ville, who in turn is an instance of French type of administrative units. This seems to me a useful model, who can generalise easily to class things like Urban units, who are used for statistical purpose and are defined by national statistical organism in each country, such as INSEE in france Then we could also have a Urban unit class in Wikidata, but this is ambiguous. This class could have a subclass Urban unit as defined by INSEE, with instances such as Parisian urban unit, for hich it gives statistical information. Urban unit as defined by INSEE in turn may be a subclass of any geographical unit defined by INSEE. Then if you want INSEE geographical units, you query all instances of both Urban unit and any geographical unit defined by INSEE. But now let's say you want to find the definition of urban unit by the INSEE itself, not the instances. One way to do that would be to look at the subclass tree of any geographical unit defined by INSEE or the one of Urban unit, or compute the intersection of both trees. One alternative, using metamodelling this time, would be to have a class regrouping all definitions of statistical units used by INSEE. Those definitions identify to some class we already have, such as Urban unit as defined by INSEE identifies to the definition of urban unit by INSEE. Then the class of all definitions used by INSEE would identify to the class of all classes with a name ... unit defined by INSEE. I propose to create the item any type of unit defined by INSEE ; with Urban unit as defined by INSEE an instance of it (actually I might have already have done it /o\) This is an (non mutually exclusive, more complementary) alternative to just class administrative units instances. In a way, this is just identifying (reifying) a classification system and putting an ''instance of'' this item statements to its classes. I think this is interesting in Wikidata as we actually are using a lot of
Re: [Wikidata-l] Wikidata RDF exports
Markus, Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g. SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . } This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 (lepton) is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas? Do you know when the taxonomy data in OWL will have labels available? Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under Complete Data Dumps in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings on higher-end computers to load. If it's feasible to somehow prune those files -- and maybe even combine them into one file that could be easily loaded into Protege -- that would be especially nice. Thanks, Eric https://www.wikidata.org/wiki/User:Emw 1. http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/wikidata-taxonomy.nt.gz 2. http://protege.stanford.edu/ On Tue, Jun 10, 2014 at 4:43 AM, Markus Kroetzsch markus.kroetz...@tu-dresden.de wrote: Hi all, We are now offering regular RDF dumps for the content of Wikidata: http://tools.wmflabs.org/wikidata-exports/rdf/ RDF is the Resource Description Framework of the W3C that can be used to exchange data on the Web. The Wikidata RDF exports consist of several files that contain different parts and views of the data, and which can be used independently. Details on the available exports and the RDF encoding used in each can be found in the paper Introducing Wikidata to the Linked Data Web [1]. The available RDF exports can be found in the directory http://tools.wmflabs.org/wikidata-exports/rdf/exports/. New exports are generated regularly from current data dumps of Wikidata and will appear in this directory shortly afterwards. All dump files have been generated using Wikidata Toolkit [2]. There are some important differences in comparison to earlier dumps: * Data is split into several dump files for convenience. Pick whatever you are most interested in. * All dumps are generated using the OpenRDF library for Java (better quality than ad hoc serialization; much slower too ;-) * All dumps are in N3 format, the simplest RDF serialization format that there is * In addition to the faithful dumps, some simplified dumps are also available (one statement = one triple; no qualifiers and references). * Links to external data sets are added to the data for Wikidata properties that point to datasets with RDF exports. That's the Linked in Linked Open Data. Suggestions for improvements and contributions on github are welcome. Cheers, Markus [1] http://korrekt.org/page/Introducing_Wikidata_to_the_Linked_Data_Web [2] https://www.mediawiki.org/wiki/Wikidata_Toolkit -- Markus Kroetzsch Faculty of Computer Science Technische Universität Dresden +49 351 463 38486 http://korrekt.org/ ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF exports
Markus, Thanks for the thorough reply! you can use SPARQL 1.1 transitive closure in queries (using * after properties), so you can find all subclasses there too. (You could also try this in Protege ...) I had a feeling I was missing something basic. (I'm also new to SPARQL.) Using * after the property got me what I was looking for by default in Protege. That is, SELECT ?subject WHERE { ?subject rdfs:subClassOf* http://www.wikidata.org/entity/Q82586 . } -- with an asterisk after rdfs:subClassOf -- got me the transitive closure and returned all subclasses of Q82586 / lepton. Should we maybe create an English label file for the classes? Descriptions too or just labels? A file with English labels and descriptions for classes would be great and, I think, address this use case. Per your note, I suppose one would simply concatenate that English terms file and wikidata-taxonomy.nt into a new .nt file, then import that into Protege to explore the class hierarchy. (Having every line in the ontology be self-contained in N3 is very convenient!) Regarding the pruned subset, I think the command-line approach in your examples is enough for me to get started making my own. I won't have time to experiment with these things for a few weeks, but I will return to this then and let you know any interesting findings. Cheers, Eric On Sat, Jun 14, 2014 at 4:41 AM, Markus Krötzsch mar...@semantic-mediawiki.org wrote: Eric, Two general remarks first: (1) Protege is for small and medium ontologies, but not really for such large datasets. To get SPARQL support for the whole data, you could to install Virtuoso. It also comes with a simple Web query UI. Virtuoso does not do much reasoning, but you can use SPARQL 1.1 transitive closure in queries (using * after properties), so you can find all subclasses there too. (You could also try this in Protege ...) (2) If you want to explore the class hierarchy, you can also try our new class browser: http://tools.wmflabs.org/wikidata-exports/miga/?classes It has the whole class hierarchy, but without the leaves (=instances of classes + subclasses that have no own subclasses/instances). For example, it tells you that lepton has 5 direct subclasses, but shows only one: http://tools.wmflabs.org/wikidata-exports/miga/?classes#_item=3338 On the other hand, it includes relationships of classes and properties that are not part of the RDF (we extract this from the data by considering co-occurrence). Example: Classes that have no superclasses but at least 10 instances, and which are often used with the property 'sex or gender': http://tools.wmflabs.org/wikidata-exports/miga/? classes#_cat=Classes/Direct%20superclasses=__null/Number% 20of%20direct%20instances=10%20-%202/Related% 20properties=sex%20or%20gender I already added superclasses for some of those in Wikidata now -- data in the browser is updated with some delay based on dump files. More answers below: On 14/06/14 05:52, emw wrote: Markus, Thank you very much for this. Translating Wikidata into the language of the Semantic Web is important. Being able to explore the Wikidata taxonomy [1] by doing SPARQL queries in Protege [2] (even primitive queries) is really neat, e.g. SELECT ?subject WHERE { ?subject rdfs:subClassOf http://www.wikidata.org/entity/Q82586 . } This is more of an issue of my ignorance of Protege, but I notice that the above query returns only the direct subclasses of Q82586. The full set of subclasses for Q82586 (lepton) is visible at http://tools.wmflabs.org/wikidata-todo/tree.html?q=Q82586rp=279lang=en -- a few of the 2nd-level subclasses (muon neutrino, tau neutrino, electron neutrino) are shown there but not returned by that SPARQL query. It seems rdfs:subClassOf isn't being treated as a transitive property in Protege. Any ideas? You need a reasoner to compute this properly. For a plain class hierarchy as in our case, ELK should be a good choice [1]. You can install the ELK Protege plugin and use it to classify the ontology [2]. Protege will then show the copmuted class hierarchy in the browser; I am not sure what happens to the SPARQL queries (it's quite possible that they don't use the reasoner). [1] https://code.google.com/p/elk-reasoner/ [2] https://code.google.com/p/elk-reasoner/wiki/ElkProtege Do you know when the taxonomy data in OWL will have labels available? We had not thought of this as a use case. A challenge is that the label data is quite big because of the many languages. Should we maybe create an English label file for the classes? Descriptions too or just labels? Also, regarding the complete dumps, would it be possible to export a smaller subset of the faithful data? The files under Complete Data Dumps in http://tools.wmflabs.org/wikidata-exports/rdf/exports/20140526/ look too big to load into Protege on most personal computers, and would likely require adjusting JVM settings
Re: [Wikidata-l] Making a Wikipedia article link to two wikidata items
For articles that are really about multiple different things that cannot be reconciled in a single natural concept: * State intance of:Wikipedia article with multiple topics (we already have several other classes of Wikipedia articles). * Use some property, say has topic, to link to items about the individual topics. * Optionally: use a property like subject of (P805) to link back from the individual items to the multi-topic pages. Can we make do without annotation statements like instance of: Wikipedia page with multiple topics? In my opinion, such statements would unnecessarily clutter a significant portion of our items and would be better inferred by the presence of *subject of* (P805) claims. I think it's better to reserve *instance of* for talk about the essence of the subject itself. The closest inverse property for* subject of* is probably *facet of* (P1269). Example: https://en.wikipedia.org/wiki/Samoan_Clipper See https://www.wikidata.org/wiki/Q7409943 for an initial pass at modelling that. Note how that Wikipedia page says The aircraft developed an engine problem (caused by an oil leak), which ultimately caused the in-flight explosion. We currently have no generic way to model causes. Coincidentally enough, I just posted a detailed/long-winded proposal to address that. Please see https://www.wikidata.org/wiki/Property_talk:P828#A_better_way_to_model_causation and give any feedback there! Cheers, Eric On Tue, Sep 9, 2014 at 7:36 AM, Markus Krötzsch mar...@semantic-mediawiki.org wrote: My proposal became more clear to me over lunch: For articles that are really about multiple different things that cannot be reconciled in a single natural concept: * State intance of:Wikipedia article with multiple topics (we already have several other classes of Wikipedia articles). * Use some property, say has topic, to link to items about the individual topics. * Optionally: use a property like subject of (P805) to link back from the individual items to the multi-topic pages. The main proposal here is to treat these things like Wikipedia disambiguation pages: we have items, but the items are mainly about the page, not about any real-world concept we care about. Example: https://en.wikipedia.org/wiki/Samoan_Clipper It says Samoan Clipper was one of ten Pan American Airways Sikorsky S-42 flying boats but it includes an infobox that lists fatalities. So the article describes both a specific airplane (the flying boat) and an event (crash of that plane). We should not try to invent a new concept of machine-event system to capture this, but have two items for the two things we have here. We will have many cases where this is not necessary if we can find a natural composite concept that it makes sense to talk about. In these case, we will use different properties for the links (for example, a country article may sometimes be used to describe all the federal states of that country, yet we have a good way of linking individual state items to the country). As usual, there will be corner cases where it is not clear what to do; then we need specific discussions on these cases. Cheers, Markus On 09.09.2014 11:57, Markus Krötzsch wrote: On 09.09.2014 11:33, Daniel Kinzler wrote: Am 09.09.2014 01:40, schrieb Denny Vrandečić: Create a third item in Wikidata, and use that for the language links. Any Wikipedia that has two separate articles can link to the separate items, any Wikipedia that has only one article can link to the single item. That's a nice solution for the language link problem, but modelling the relationship of these three items on wikidata is kind of annoying/tricky. How would you do that? Before the how? should come the why?. The modelling should be chosen so that it best suits a given purpose (the purpose is the benchmark for deciding if a particular modelling approach is good or not). I guess the main thing we want to achieve here is to link the combined item to and from the single items. If this is true, then the how? question is basically a which property to use? question. For this we should look more closely at the nature of the combined item. Let's distinguish combined items that are natural and meaningful concepts from those that are just different topics combined for editorial reasons in one article. The first kind of item involves things like bands (who have members, possibly with individual articles, but which are still meaningful concepts by themselves). The second kind of item involves the Wangerooge hybrid, but also many other things (e.g., plane crashes and the planes themselves; or people and events the people where involved in). The problem with these second type of complex item is that it does not give you a good basis for adding data (you can't say properly which aspects of the thing you are talking about). It is also problematic since these things are not natural concepts that can be
[Wikidata-l] Why? Modeling causes on Wikidata
Hi all, Talk about causes is ubiquitous in everyday life and many other domains of knowledge. Until recently, we've had a few properties to make statements about cause in certain narrow areas, but lacked a way to structure data about causes across a broad range of subjects. For example, you might want to know: - What caused World War II? - What causes evolution? - What causes malaria? - What causes bread to rise? - What causes rust? - What causes gravity? - What causes rainbows? Wikidata now has some new properties that provide structure for basic answers to such questions. - *has cause* (alias: *has underlying cause*): thing that ultimately resulted in the effect [1] - *has immediate cause*: thing that proximately resulted in the effect [2] - *has contributing factor*: thing that significantly influenced the effect, but did not directly cause it [3] This approach to modeling causation attempts to balance expressiveness with simplicity. It borrows from the idea of causation as a chain of events, which also has background conditions or events that set the stage for some outcome. These properties are not perfect, but they do allow us to capture much richness in how various sources talk about causes -- and to do so in a way that humans can easily understand. https://www.wikidata.org/wiki/Help:Modeling_causes explains these properties, their background, examples, things to avoid, issues and context. Please comment on the 'Help:Modeling causes' talk page, or here, with any feedback. Hopefully we'll be able to build some cool stuff with this. Cheers, Eric https://www.wikidata.org/wiki/User:Emw 1. *has cause*. https://www.wikidata.org/wiki/Property:P828 2. *has immediate cause. *https://www.wikidata.org/wiki/Property:P1478 3. *has contributing factor.* https://www.wikidata.org/wiki/Property:P1479 ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] all human genes are now wikidata items
Andra, Chinmay, Ben, Andrew, Kudos! This is a significant milestone, and showcases Wikidata's potential for structuring large sets of biological data. Thanks for your excellent work! Cheers, Eric https://www.wikidata.org/wiki/User:Emw On Mon, Oct 6, 2014 at 4:21 PM, Benjamin Good ben.mcgee.g...@gmail.com wrote: I thought folks might like to know that every human gene (according to the United States National Center for Biotechnology Information) now has a representative entity on wikidata. I hope that these are the seeds for some amazing applications in biology and medicine. Well done Andra and ProteinBoxBot ! For example: Here is one (of approximately 40,000) called spinocerebellar ataxia 37 https://www.wikidata.org/wiki/Q18081265 -Ben ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Item both subclass and instance?
I have removed the statement *instance of* chemical compound from ethanol (Q153) [1]. A few proposals have been made in this thread about how -- or whether -- to use *instance of* (i.e. rdf:type, P31) to classify 'ethanol' and other chemical compounds, but there seems to be consensus that *instance of* chemical compound is not the way to do it. Summary of proposals: 1. *Do not use instance of for chemical compounds*. Such statements make Wikidata incompatible with many major scientific ontologies, like ChEBI, Gene Ontology and Disease Ontology, which use *instance of* as defined in the Relation Ontology (RO) [2]. Note that RO defines instances as particular things that have a unique location in space and time, whereas classes are universal, general entities which have particular instances. Instances and classes are thus disjoint, so RO-based ontologies cannot have entities that have both *instance of* (rdf:type, P31) and *subclass of* (rdfs:subClassOf, P279) statements as is possible in OWL 2 DL via punning. 2. *Use statements like instance of type of chemical compound for chemical compounds*. Doing so makes it easier to generate lists of chemical compounds, and is valid in OWL 2 DL -- it is metamodeling via punning. Let's build consensus for how (or whether) we want to use *instance of* for chemical compounds before any mass edits to remove or replace the 14969 other *instance of* chemical compound claims [3] or adding statements like *instance of *type of chemical compound to ethanol. Micru has a different proposal for how to model items, which incidentally does not represent ethanol as an instance [4]. However, that proposal is clearly a more radical vision for Wikidata, and probably warrants a separate thread for discussion. Eric https://www.wikidata.org/wiki/User:Emw [1] Removal of *instance of* chemical compound from ethanol: https://www.wikidata.org/w/index.php?title=Q153diff=162563849oldid=162327014 [2] Barry Smith et al. (2005). *Relations in Biomedical Ontologies*. http://genomebiology.com/2005/6/5/r46 [3] All *instance of* chemical compound claims on Wikidata. http://tools.wmflabs.org/wikidata-todo/autolist.html?q=claim[31:11173] [4] 'ethanol' is no longer an instance, but a class. https://lists.wikimedia.org/pipermail/wikidata-l/2014-October/004691.html ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] NYC Wikidata workshop and hackathon this Sunday!
Hi all, Wikimedia New York City will be hosting a Wikidata hackathon and beginners workshop this coming Sunday. This will be a good event to meet Wikimedians involved with cultural institutions, structure a bunch of data, and help new users. If you're in the area, come! When: Sunday, December 14,1:00 - 5:00 PM Where: 55 Washington Street, Brooklyn, NY 11201 Room 321 (BLIP Outpost) Details and sign up: https://en.wikipedia.org/wiki/Wikipedia:Meetup/NYC/December_Wikidata Cheers, Eric https://www.wikidata.org/wiki/User:Emw ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] How to declare a property is transitive, etc.
Hi all, Could those knowledgeable about OWL or intending to use Wikidata's RDF / OWL exports please weigh in at https://www.wikidata.org/wiki/Wikidata:Property_proposal/Property_metadata#How_should_we_declare_that_a_property_is_transitive ? [1] Being able to declare certain properties of properties is an essential building block for querying and inference. However, the way to declare that a property is, say, transitive in OWL does not have a clear analog in Wikidata syntax. We could certainly shoehorn such a statement into our existing model (and it looks like we'll need to), but it is important to do so in a way that complicate things as little as possible for downstream users, e.g. outside researchers or developers using the RDF exports and assuming standard OWL semantics. Please make any comments on this on-wiki at the location linked above. That way we can keep the discussion centralized. Other discussions on that page could also benefit from input by people knowledgeable about Semantic Web vocabulary. Thanks, Eric https://www.wikidata.org/wiki/User:Emw 1. Discussion permalink: https://www.wikidata.org/w/index.php?title=Wikidata:Property_proposal/Property_metadataoldid=182088235#How_should_we_declare_that_a_property_is_transitive ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] subclass-of vs. instance-of
Automobile (Q1420) had the claims [1]: *subclass of* motor road vehicle *instance of* motor road vehicle That was incorrect. An instance of motor road vehicle is something like the Peekskill Meteorite Car (Q7756463) [2]. It is generally incorrect when an item has *instance of* and *subclass of* claims with the same value. I am not aware of a Wikidata constraint template which can encode that rule. (Off hand I'm not sure how it would be encoded in OWL, either. Ontology experts: how would we do that?) If we wanted use both *instance of* and *subclass of* in automobile, then we would need to do something like: *subclass of* motor road vehicle *instance of* motor road vehicle class In my opinion, *instance of* claims like that are not very useful, because they simply restate what is directly implied in the *subclass of* claim. Punning that is not a mere rephrasing can be useful, e.g. Chevrolet Malibu (Q287723) [3] *subclass of* mid-size car, *instance of* car model. See also Markus's comment from September about using *subclass of* and *instance of* in the same item, which conveniently also discusses automobiles [4]. Happy Q11269! Eric https://www.wikidata.org/wiki/User:Emw 1. https://www.wikidata.org/w/index.php?title=Q1420oldid=184512429#P279 2. https://www.wikidata.org/wiki/Q7756463 3. https://www.wikidata.org/wiki/Q287723 4. https://lists.wikimedia.org/pipermail/wikidata-l/2014-September/004649.html ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Disambiguating property [was: Freebase like API with an OUTPUT feature}
Since it appears that the creation of *subproperty of* went unnoticed by many, I'd like to describe an important aspect of its proper use, and how that relates to classification. Please note that *instance of* (P31) and *subclass of* (P279) are not valid values for *subproperty of* (P1647) claims, as described in the P1647 documentation [1]. For example, claims like occupation *subproperty of* instance of are invalid. The reasons for this are both technical and architectural. On the technical side, *instance of, subclass of* and *subproperty of* are intended to be straightforwardly exportable as rdf:type, rdfs:subClassOf and rdfs:subPropertyOf. As described in *On the Properties of Metamodeling in OWL* [2], claims that use OWL's built-in vocabulary (e.g. rdf:type) as individuals make an ontology undecidable. If an ontology is undecidable, then queries are not guaranteed to terminate. This is a big deal. Decidability is a main goal of OWL 2 DL and a requirement in the more specialized profiles OWL 2 EL, OWL 2 RL and OWL 2 QL. Most Semantic Web ontologies aim to valid be in at least OWL 2 DL. So if Wikidata aims to be easily interoperable with the rest of the Semantic Web, we should aim to be valid in OWL 2 DL, and thus not make claims of the form P *subproperty of* instance of (P31) or P *subproperty of* subclass of (P279). Avoiding such claims is also good design. There should be one -- and preferably only one -- obvious way to specify the type of an instance. Having a multitude of domain-specific type subproperties would promote an anti-pattern: using *instance of* as a catch-all property to make any statement under the sun that makes sense when connected with the phrase is a. Having a single type property for instances also fosters another best practice in Wikidata: asserted monohierarchy [3]. In other words, there should be only one explicit normal or preferred *instance of *or *subclass of* claim per item. Having an *instance of *claim and a *subclass of* claim on an item isn't necessarily bad (it's called punning), but having multiple *instance of* claims or multiple *subclass of* claims on an item is a bad smell. Items can typically satisfy a huge number of *instance of* claims, but should generally have only one such claim made explicitly in Wikidata. For example, Coco Chanel (Q45661) can be said to be *instance of* French person, *instance of* fashion designer, *instance of* female, etc. Instead of such catch-all use of *instance of*, Wikidata moves that knowledge into properties like *country of citizenship* (P27), *occupation* (P106) and *sex or gender* (P21). Coco Chanel has one explicit *instance of* value: human (Q5) -- a class that encapsulates essential features of the subject. Most of Wikidata follows these general principles of classification. But a few domains of knowledge remain either somewhat of a mess, or organized but idiosyncratic. Items like the one for the German municipality of Aalen [4], with 7 *instance of* values -- several of them redundant -- exemplify the mess. With the deletion of domain-specific type properties like *type of administrative territorial entity* (P132) [5], we are on the right track. The solution is not to make such things subproperties of *instance of*, but rather to delete them and use *instance of* for one preferred class and put other values in other properties (note -- this may require new properties!). The same applies for *subclass of*. I encourage anyone interested in stuff like *subproperty of* to join the discussions ongoing at https://www.wikidata.org/wiki/Wikidata:Property_proposal/Property_metadata. The Wikidata community is currently discussing how we want to handle things like *domain* and *range* properties (e.g. should we use rdfs:domain or schema:DomainIncludes?) and whether we want to have an *inverse of* property (or delete all inverse properties). The outcome of these discussions will shape the interface between Wikidata and the rest of the Semantic Web. Thanks, Eric https://www.wikidata.org/wiki/User:Emw 1. https://www.wikidata.org/wiki/Property:P1647 2. Boris Motik (2007). On the Properties of Metamodeling in OWL. https://www.cs.ox.ac.uk/boris.motik/pubs/motik07metamodeling-journal.pdf *3. *Barry Smith, Werner Ceusters (2011). Ontological realism: A methodology for coordinated evolution of scientific ontologies. Section 1.8: Asserted monohierarchies. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3104413/#S9 4. Aalen on Wikidata as of 2015-01-10. https://www.wikidata.org/w/index.php?title=Q3951oldid=184247296#P31 5. https://www.wikidata.org/wiki/Wikidata:Requests_for_deletions/Archive/2014/Properties/1#type_of_administrative_territorial_entity_.28P132.29 ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Disambiguating property [was: Freebase like API with an OUTPUT feature}
Hi James, My mistake, I should have linked to https://www.wikidata.org/wiki/Property_talk:P1647, which includes the following in the 'Examples' section of the documentation template: Note: it is not valid to declare subproperties of instance of (P31) (rdf:type), subclass of (P279) (rdfs:subClassOf) or any other property mapped to a built-in property of RDF, RDFS or OWL. See creation discussion. The creation discussion is available at https://www.wikidata.org/wiki/Wikidata:Property_proposal/Archive/27#subproperty_of . Eric ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Kian: The first neural network to serve Wikidata
Amir, What is the false positive rate of your algorithm when dealing with fictitious humans and (non-fictitious) non-human organisms? That is, how often does your program classify such non-humans as humans? Regarding the latter, note that items about individual dogs, elephants, chimpanzees and even trees can use properties that are otherwise extremely skewed towards humans. For example, Prometheus (Q590010) [1], an extremely old tree, has claims for *date of birth* (P569), *date of death* (P570), even *killed by* (P157). Non-human animals can also have kinship claims (e.g. *mother*, *brother, child*), among other properties typically used on humans. Best, Eric https://www.wikidata.org/wiki/User:Emw 1. Prometheus. https://www.wikidata.org/wiki/Q590010 On Sat, Mar 7, 2015 at 1:44 PM, Amir Ladsgroup ladsgr...@gmail.com wrote: Hey Markus, Thanks for your insight :) On Sat, Mar 7, 2015 at 9:52 PM, Markus Krötzsch mar...@semantic-mediawiki.org wrote: Hi Amir, In spite of all due enthusiasm, please evaluate your results (with humans!) before making automated edits. In fact, I would contradict Magnus here and say that such an approach would best be suited to provide meaningful (pre-filtered) *input* to people who play a Wikidata game, rather than bypassing the game (and humans) altogether. The expected error rates are quite high for such an approach, but it can still save a lot of works for humans. there is a certainty factor and It can save a lot without making such errors by using the certainty factor As for the next steps, I would suggest that you have a look at the works that others have done already. Try Google Scholar: https://scholar.google.com/scholar?q=machine+learning+wikipedia As you can see, there are countless works on using machine learning techniques on Wikipedia, both for information extraction (e.g., understanding link semantics) and for things like vandalism detection. I am sure that one could get a lot of inspiration from there, both on potential applications and on technical hints on how to improve result quality. Yes, definitely I would use them, thanks. You will find that people are using many different approaches in these works. The good old ANN is still a relevant algorithm in practice, but there are many other techniques, such as SVNs, Markov models, or random forests, which have been found to work better than ANNs in many cases. Not saying that a three-layer feed-forward ANN cannot do some jobs as well, but I would not restrict to one ML approach if you have a whole arsenal of algorithms available, most of them pre-implemented in libraries (the first Google hit has a lot of relevant projects listed: http://daoudclarke.github.io/machine%20learning%20in% 20practice/2013/10/08/machine-learning-libraries/). I would certainly recommend that you don't implement any of the standard ML algorithms from scratch. I use backward propagation algorithm and I use octave in ML for my personal works, but in Wikipedia I use python (for two main reasons: integrating with with other wikipedia-related tools like pywikibot and bad performance of octave and Matlab in big sets of data) and I had to write that parts from scratch since I couldn't find any related library in python. Even algorithms like BFGS is not there (I could find in scipy but I wasn't sure it works correctly and because no documentation is there) In practice, the most challenging task for successful ML is often feature engineering: the question which features you use as an input to your learning algorithm. This is far more important that the choice of algorithm. Wikipedia in particular offers you so many relevant pieces of information with each article that are not just mere keywords (links, categories, in-links, ...) and it is not easy to decide which of these to feed into your learner. This will be different for each task you solve (subject classification is fundamentally different from vandalism detection, and even different types of vandalism would require very different techniques). You should pick hard or very large tasks to make sure that the tweaking you need in each case takes less time than you would need as a human to solve the task manually ;-) Yes, feature engineering is the most important thing and it can be tricky but feature engineering in Wikidata is lot easier (it's easier than Wikipedia. Wikipedia itself it's easier than other places). Anti-Vandalism bots are lot easier in Wikidata than Wikipedia. Editing in Wikidata is limited to certain kinds (like removing a sitelink, etc.) but it's not easy in Wikipedia. Anyway, it's an interesting field, and we could certainly use some effort to exploit the countless works in this field for Wikidata. But you should be aware that this is no small challenge and that there is no universal solution that will work well even for all the tasks that you have mentioned in your email. Of course, I
Re: [Wikidata-l] External identifiers vs. Wikidata-internal links data
Yes. I could see a simple Statements vs. External identifiers distinction being useful that's also reflected in the data model so it's easier to treat these property groups in a distinct manner. I support grouping statements about external identifiers together and distinguishing them from other statements, but I would voice caution about presenting that distinction as Statements vs. External identifiers. I agree with Denny that qualifiers and references should be retained for external identifiers. I would further suggest that external identifiers remain structured as properties that can (along with their values in claims) be created, updated and deleted by the community. Given that, I think the distinction should be styled less as Statements vs. External identifiers and more as External identifiers as a kind of statement. UI editing controls and data modeling as statements would remain, but External identifiers (e.g. *VIAF identifier* 113230702) would be moved to the bottom or side of statements of subject knowledge (e.g. *cause of death* heart attack). Grouping together and separating external identifiers from other kinds of statements in the UI, and reflecting that in the data model and API, sounds like a great idea. https://www.wikidata.org/wiki/Q42 is a rat's nest of meaningless (but technically useful) statements about external identifiers and meaningful statements about the subject. It's important to fix that, and I imagine we could do so while retaining all the current UI controls and data model attributes of statements in statements about external identifiers. Best, Eric https://www.wikidata.org/wiki/User:Emw ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] OWL based ontologies as basis for Wikidata item interactions and property proposal
Sebastian, Benjamin, Elvira, Andra, Andrew, Kudos on your progress with an OWL-centric approach to knowledge representation. The community has been incorporating OWL concepts into property definitions and ontology development on-wiki for some time, but yours is the first Wikidata group I'm aware of that has incorporated Protege into the process. We think that using ontologies brings several advantages The examples you cite seem like good ideas and I support them. I would also suggest considering how the Wikidata ontologies we develop fit into established ontologies in the Semantic Web. For example, the OBO Foundry (http://www.obofoundry.org/) is by far the world's most widely used group of biomedical ontologies [1, 2]. Those ontologies are rooted in the Basic Formal Ontology (BFO). OWL helps a great deal in being interoperable with those works, but a further ontological commitment tends to be needed for easy compatibility. Is your gene-disease interaction ontology compatible with BFO, and the OBO ontologies rooted in it? Cheers, Eric https://www.wikidata.org/wiki/User:Emw 1. http://www.nature.com/nbt/journal/v25/n11/full/nbt1346.html 2. https://scholar.google.com/scholar?cites=13806088078865650870 ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l