Re: [Wikidata-l] Broken JSON in XML dumps
Looks like someone hasn't learned the lesson: https://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg02588.html On Thu, Feb 26, 2015 at 9:27 PM, Lukas Benedix lukas.bene...@fu-berlin.de wrote: I second this! btw: what is the status of the problem with the missing dumps with history? (latest available from November 2014) Lukas Am Do 26.02.2015 um 14:52 schrieb Markus Kroetzsch: Hi, It's that time of the year again when I am sending a reminder that we still have broken JSON in the dump files ;-). As usual, the problem is that empty maps {} are serialized wrongly as empty lists []. I am not sure if there is any open bug that tracks this, so I am sending an email. There was one, but it was closed [1]. As you know (I had sent an email a while ago), there are some remaining problems of this kind in the JSON dump, and also in the live exported JSON, e.g., https://www.wikidata.org/wiki/Special:EntityData/Q4383128.json (uses [] as a value for snaks: this item has a reference with an empty list of snaks, which is an error by itself) However, the situation is considerably worse in the XML dumps, which have seen less usage since we have JSON, but as it turns out are still preferred by some users. Surprisingly (to me), the JSON content in the XML dumps is still not the same as in the JSON dumps. A large part of the records in the XML dump is broken because of the map-vs-list issue. For example, the latest dump of current revisions [2] has countless instances of the problem. The first is in the item Q3261 (empty list for claims), but you can easily find more by grepping for things like quot;claimsquot;:[] It seems that all empty maps are serialized wrongly in this dump (aliases, descriptions, claims, ...). In contrast, the site's export simply omits the key of empty maps entirely, see https://www.wikidata.org/wiki/Special:EntityData/Q3261.json The JSON in the JSON dumps is the same. Cheers, Markus [1] https://github.com/wmde/WikibaseDataModelSerialization/issues/77 [2] http://dumps.wikimedia.org/wikidatawiki/20150207/wikidatawiki-20150207-pages-meta-current.xml.bz2 ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF
Gerard, what about query functionality for example? This has been long promised but shows no real progress. And why do you think practical cases cannot be implemented using RDF? What is the justification for ignoring the whole standard and implementation stack? What makes you think Wikidata can do better than RDF? Martynas On Tue, Oct 28, 2014 at 6:48 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Hell no. Wikidata is first and foremost a product that is actually used. It has that way from the start. Prioritising RDF over actual practical use cases is imho wrong. If anything the continuous tinkering on the format of dumps has mostly brought us grieve. Dumps that can no longer be read like currently for the Wikidata statistics really hurt. So lets not spend time at this time on RDF, Lets ensure that what we have works, works well and plan carefully for a better RDF but lets only have it go in production AFTER we know that it works well. Thanks, GerardM On 28 October 2014 02:46, Martynas Jusevičius marty...@graphity.org wrote: Hey all, so I see there is some work being done on mapping Wikidata data model to RDF [1]. Just a thought: what if you actually used RDF and Wikidata's concepts modeled in it right from the start? And used standard RDF tools, APIs, query language (SPARQL) instead of building the whole thing from scratch? Is it just me or was this decision really a colossal waste of resources? [1] http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf Martynas http://graphityhq.com ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF
John, please see inline: On Tue, Oct 28, 2014 at 8:39 AM, John Erling Blad jeb...@gmail.com wrote: The data model is close to RDF, but not quite. Statements in items are reified statements, etc. Technically it is semantic data, where RDF is one possible representaton. Well it has been shown (in the paper I referenced) that Wikidata can be modeled as RDF. And there is no reason why it couldn't be, because in RDF anyone can say anything about anything. There was a decision choice to keep Mediawiki to ease reuse within the Wikimedia sites, mostly so users could reuse their knowledge, but also for devs to reuse existing infrastructure. This is exactly the decision that I question. I think it was completely misguided. If the goal was to reuse knowledge and infrastructure, then Wikidata has failed completely, as there is more infrastructure and knowledge of RDF than there ever will be for Mediawiki, or any structured/semantic data model for that matter. Some of the problems with Wd comes from the fact that the similarities isn't clear enough for the users, and possibly the devs, which have resulted in a slightly introvert community and a technical structure that is slightly more Wikipedia-centric than necessary. Here I can only agree with you. That is not an RDF problem though. On Tue, Oct 28, 2014 at 6:48 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Hell no. Wikidata is first and foremost a product that is actually used. It has that way from the start. Prioritising RDF over actual practical use cases is imho wrong. If anything the continuous tinkering on the format of dumps has mostly brought us grieve. Dumps that can no longer be read like currently for the Wikidata statistics really hurt. So lets not spend time at this time on RDF, Lets ensure that what we have works, works well and plan carefully for a better RDF but lets only have it go in production AFTER we know that it works well. Thanks, GerardM On 28 October 2014 02:46, Martynas Jusevičius marty...@graphity.org wrote: Hey all, so I see there is some work being done on mapping Wikidata data model to RDF [1]. Just a thought: what if you actually used RDF and Wikidata's concepts modeled in it right from the start? And used standard RDF tools, APIs, query language (SPARQL) instead of building the whole thing from scratch? Is it just me or was this decision really a colossal waste of resources? [1] http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf Martynas http://graphityhq.com ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF
Gerard, what is there practical about having a query language that 1) is not a standard and never will be 2) is not supported by any other tool or project and never will be? I would understand this kind of reasoning coming from a hobbyist project, but not from one claiming to be a global free linked database. Martynas On Tue, Oct 28, 2014 at 11:37 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Query has been promised and unofficially we have it for a VERY long time.. It is called WDQ. it is used in many tools. The official query will only provide a subset of functionality for quite some time as I understand it. Practical cases in RDF for what by whom ? Wikidata is first and foremost a vehicle to bring interwiki links to our projects. Then and only then it becomes relevant to store data about the items involved. This data may be used in info boxes and what not in our projects.. THAT is practical use to our community. RDF may of interest to others and it may be possible to do practical things by them but that does not prioritise it. I do not think Wikidata can do better. As far as I am concerned it is the least of our problems. The reuse of data is first to happen within our projects and THAT is not so much of a technical problem at all. Thanks, GerardM On 28 October 2014 11:26, Martynas Jusevičius marty...@graphity.org wrote: Gerard, what about query functionality for example? This has been long promised but shows no real progress. And why do you think practical cases cannot be implemented using RDF? What is the justification for ignoring the whole standard and implementation stack? What makes you think Wikidata can do better than RDF? Martynas On Tue, Oct 28, 2014 at 6:48 AM, Gerard Meijssen gerard.meijs...@gmail.com wrote: Hoi, Hell no. Wikidata is first and foremost a product that is actually used. It has that way from the start. Prioritising RDF over actual practical use cases is imho wrong. If anything the continuous tinkering on the format of dumps has mostly brought us grieve. Dumps that can no longer be read like currently for the Wikidata statistics really hurt. So lets not spend time at this time on RDF, Lets ensure that what we have works, works well and plan carefully for a better RDF but lets only have it go in production AFTER we know that it works well. Thanks, GerardM On 28 October 2014 02:46, Martynas Jusevičius marty...@graphity.org wrote: Hey all, so I see there is some work being done on mapping Wikidata data model to RDF [1]. Just a thought: what if you actually used RDF and Wikidata's concepts modeled in it right from the start? And used standard RDF tools, APIs, query language (SPARQL) instead of building the whole thing from scratch? Is it just me or was this decision really a colossal waste of resources? [1] http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf Martynas http://graphityhq.com ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikidata RDF
Hey all, so I see there is some work being done on mapping Wikidata data model to RDF [1]. Just a thought: what if you actually used RDF and Wikidata's concepts modeled in it right from the start? And used standard RDF tools, APIs, query language (SPARQL) instead of building the whole thing from scratch? Is it just me or was this decision really a colossal waste of resources? [1] http://korrekt.org/papers/Wikidata-RDF-export-2014.pdf Martynas http://graphityhq.com ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] How are queries doing?
Jan, my suspicion is that my predictions from last year hold true: it is a far more complex task to design a scalable and performant data model, query language and/or query engine solely for Wikidata than the designers of this project anticipated - unless they did anticipate and now knowingly fail to deliver. You can check some threads from december last year, and they relate to even older ones: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg01415.html Martynas On Fri, Nov 29, 2013 at 1:47 PM, Jan Kučera kozuc...@gmail.com wrote: Ok. One is a bit disappointed seeing various projects to fail to deliver according to their original timelines... seems like there is not enough money in? Do you need more developers to perform better? 2013/11/26 Lydia Pintscher lydia.pintsc...@wikimedia.de On Mon, Nov 25, 2013 at 9:55 PM, Jan Kučera kozuc...@gmail.com wrote: Hi, so how things are going? Anything for testing already? Nothing to test yet. As soon as there is I will send an email to this list. The current status is that we still need to make some final adjustments on the database schema and finish the java script part of the user interface as well as ranks. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] new search backend ready for testing
Hey Lydia, how about query access? Martynas graphityhq.com On Wed, Nov 6, 2013 at 6:17 PM, Lydia Pintscher lydia.pintsc...@wikimedia.de wrote: Hey everyone, Progress! We now have the long awaited new search backend up and running for testing on Wikidata. It will still need some tweaking but please do try it and give feedback. It is running in parallel to the old one. You will need to visit a special page to use it: https://www.wikidata.org/w/index.php?search=athensbutton=title=Special%3ASearchsrbackend=CirrusSearch Please let me know about any issues you can still find with it so this so we can soon make it the default. Thanks to Chad and Katie for working on this. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Obentrautstr. 72 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Wikidata RDF Issues
There was a long discussion not so long ago about using established RDF tools for Wikipedia dumps instead of home-brewed ones, but I guess someone hasn't learnt the lesson yet. On Thu, Sep 26, 2013 at 2:22 PM, Kingsley Idehen kide...@openlinksw.com wrote: All, See: https://www.wikidata.org/wiki/Q76 The resource to which the URI above resolves contains: schema:version 72358096^^xsd:integer . It should be: schema:version 72358096^^xsd:integer . Who is responsible for RDF resource publication and issue report handling? -- Regards, Kingsley Idehen Founder CEO OpenLink Software Company Web: http://www.openlinksw.com Personal Weblog: http://www.openlinksw.com/blog/~kidehen Twitter/Identi.ca handle: @kidehen Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Accelerating software innovation with Wikidata and improved Wikicode
Here's my approach to software code problems: we need less of it, not more. We need to remove domain logic from source code and move it into data, which can be managed and on which UI can be built. In that way we can build generic scalable software agents. That is the way to Semantic Web. Martynas graphityhq.com On Mon, Jul 8, 2013 at 10:13 PM, Michael Hale hale.michael...@live.com wrote: There are lots of code snippets scattered around the internet, but most of them can't be wired together in a simple flowchart manner. If you look at object libraries that are designed specifically for that purpose, like Modelica, you can do all sorts of neat engineering tasks like simulate the thermodynamics and power usage of a new refrigerator design. Then if your company is designing a new insulation material you would make a new block with the experimentally determined properties of your material to include in the programmatic flowchart to quickly calibrate other aspects of the refrigerator's design. To my understanding, Modelica is as big and good as it gets for code libraries that represent physically accurate objects. Often, the visual representation of those objects needs to be handled separately. As far as general purpose, standard programming libraries go, Mathematica is the best one I've found for quickly prototyping new functionality. A typical web mashup app or site will combine functionality and/or data from 3 to 6 APIs. Mobile apps will typically use the phone's functionality, an extra library for better graphics support, a proprietary library or two made by the company, and a couple of web APIs. A similar story for desktop media-editing programs, business software, and high-end games except the libraries are often larger. But there aren't many software libraries that I would describe as huge. And there are even fewer that manage to scale the usefulness of the library equally with the size it occupies on disk. Platform fragmentation (increase in number and popularity of smart phones and tablets) has proven to be a tremendous challenge for continuing to improve libraries. I now just have 15 different ways to draw a circle on different screens. The attempts to provide virtual machines with write-once run-anywhere functionality (Java and .NET) have failed, often due to customer lock-in reasons as much as platform fragmentation. Flash isn't designed to grow much beyond its current scope. The web standards can only progress as quickly as the least common denominator of functionality provided by other means, which is better than nothing I suppose. Mathematica has continued to improve their library (that's essentially what they sell), but they don't try to cover a lot of platforms. They also aren't open source and don't attempt to make the entire encyclopedia interactive and programmable. Open source attempts like the Boost C++ library don't seem to grow very quickly. But I think using Wikipedia articles as a scaffold for a massive open source, object-oriented library might be what is needed. I have a few approaches I use to decide what code to write next. They can be arranged from most useful as an exercise to stay sharp in the long term to most immediately useful for a specific project. Sometimes I just write code in a vacuum. Like, I will just choose a simple task like making a 2D ball bounce around some stairs interactively and I will just spend a few hours writing it and rewriting it to be more efficient and easier to expand. It always gives me a greater appreciation for the types of details that can be specified to a computer (and hence the scope of the computational universe, or space of all computer programs). Like with the ball bouncing example you can get lost defining interesting options for the ball and the ground or in the geometry logic for calculating the intersections (like if the ball doesn't deform or if the stairs have certain constraints on their shape there are optimizations you can make). At the end of the exercise I still just have a ball bouncing down some stairs, but my mind feels like it has been on a journey. Sometimes I try to write code that I think a group of people would find useful. I will browse the articles in the areas of computer science category by popularity and start writing the first things I see that aren't already in the libraries I use. So I'll expand Mathematica's FindClusters function to support density based methods or I'll expand the RandomSample function to support files that are too large to fit in memory with a reservoir sampling algorithm. Finally, I write code for specific projects. I'm trying to genetically engineer turf grass that doesn't need to be cut, so I need to automate some of the work I do for GenBank imports and sequence comparisons. For all of those, if there was an organized place to put my code afterwards so it would fit into a larger useful library I would totally be willing to do
Re: [Wikidata-l] Accelerating software innovation with Wikidata and improved Wikicode
Yes, that is one of the reasons functional languages are getting popular: https://www.fpcomplete.com/blog/2012/04/the-downfall-of-imperative-programming With PHP and JavaScript being the most widespread (and still misused) languages we will not get there soon, however. On Mon, Jul 8, 2013 at 10:57 PM, Michael Hale hale.michael...@live.com wrote: In the functional programming language family (think Lisp) there is no fundamental distinction between code and data. Date: Mon, 8 Jul 2013 22:47:46 +0300 From: marty...@graphity.org To: wikidata-l@lists.wikimedia.org Subject: Re: [Wikidata-l] Accelerating software innovation with Wikidata and improved Wikicode Here's my approach to software code problems: we need less of it, not more. We need to remove domain logic from source code and move it into data, which can be managed and on which UI can be built. In that way we can build generic scalable software agents. That is the way to Semantic Web. Martynas graphityhq.com On Mon, Jul 8, 2013 at 10:13 PM, Michael Hale hale.michael...@live.com wrote: There are lots of code snippets scattered around the internet, but most of them can't be wired together in a simple flowchart manner. If you look at object libraries that are designed specifically for that purpose, like Modelica, you can do all sorts of neat engineering tasks like simulate the thermodynamics and power usage of a new refrigerator design. Then if your company is designing a new insulation material you would make a new block with the experimentally determined properties of your material to include in the programmatic flowchart to quickly calibrate other aspects of the refrigerator's design. To my understanding, Modelica is as big and good as it gets for code libraries that represent physically accurate objects. Often, the visual representation of those objects needs to be handled separately. As far as general purpose, standard programming libraries go, Mathematica is the best one I've found for quickly prototyping new functionality. A typical web mashup app or site will combine functionality and/or data from 3 to 6 APIs. Mobile apps will typically use the phone's functionality, an extra library for better graphics support, a proprietary library or two made by the company, and a couple of web APIs. A similar story for desktop media-editing programs, business software, and high-end games except the libraries are often larger. But there aren't many software libraries that I would describe as huge. And there are even fewer that manage to scale the usefulness of the library equally with the size it occupies on disk. Platform fragmentation (increase in number and popularity of smart phones and tablets) has proven to be a tremendous challenge for continuing to improve libraries. I now just have 15 different ways to draw a circle on different screens. The attempts to provide virtual machines with write-once run-anywhere functionality (Java and .NET) have failed, often due to customer lock-in reasons as much as platform fragmentation. Flash isn't designed to grow much beyond its current scope. The web standards can only progress as quickly as the least common denominator of functionality provided by other means, which is better than nothing I suppose. Mathematica has continued to improve their library (that's essentially what they sell), but they don't try to cover a lot of platforms. They also aren't open source and don't attempt to make the entire encyclopedia interactive and programmable. Open source attempts like the Boost C++ library don't seem to grow very quickly. But I think using Wikipedia articles as a scaffold for a massive open source, object-oriented library might be what is needed. I have a few approaches I use to decide what code to write next. They can be arranged from most useful as an exercise to stay sharp in the long term to most immediately useful for a specific project. Sometimes I just write code in a vacuum. Like, I will just choose a simple task like making a 2D ball bounce around some stairs interactively and I will just spend a few hours writing it and rewriting it to be more efficient and easier to expand. It always gives me a greater appreciation for the types of details that can be specified to a computer (and hence the scope of the computational universe, or space of all computer programs). Like with the ball bouncing example you can get lost defining interesting options for the ball and the ground or in the geometry logic for calculating the intersections (like if the ball doesn't deform or if the stairs have certain constraints on their shape there are optimizations you can make). At the end of the exercise I still just have a ball bouncing down some stairs, but my mind feels like it has been on a journey. Sometimes I try to write code that I
Re: [Wikidata-l] Is an ecosystem of Wikidatas possible?
You probably mean Linked Data? On Tue, Jun 11, 2013 at 9:41 PM, David Cuenca dacu...@gmail.com wrote: While on the Hackathon I had the opportunity to talk with some people from sister projects about how they view Wikidata and the relationship it should have to sister projects. Probably you are already familiar with the views because they have been presented already several times. The hopes are high, in my opinion too high, about what can be accomplished when Wikidata is deployed to sister projects. There are conflicting needs about what belongs into Wikidata and what sister projects need, and that divide it is far greater to be overcome than just by installing the extension. In fact, I think there is a confusion between the need for Wikidata and the need for structured data. True that Wikidata embodies that technology, but I don't think all problems can be approached by the same centralized tool. At least not from the social side of it. Wikiquote could have one item for each quote, or Wikivoyage an item for each bar, hostel, restaurant, etc..., and the question will always be: are they relevant enough to be created in Wikidata? Considering that Wikidata was initially thought for Wikipedia, that scope wouldn't allow those uses. However, the structured data needs could be covered in other ways. It doesn't need to be a big wikidata addressing it all. It could well be a central Wikidata addressing common issues (like author data, population data, etc), plus other Wikidata installs on each sister project that requires it. For instance there could be a data.wikiquote.org, a data.wikivoyage.org, etc that would cater for the needs of each community, that I predict will increase as soon as the benefits become clear, and of course linked to the central Wikidata whenever needed. Even Commons could be wikidatized with each file becoming an item and having different labels representing the file name depending on the language version being accessed. Could be this the right direction to go? Cheers, Micru ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
Denny, you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero. This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier. Martynas graphity.org On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. Cheers, Denny 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des
Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs
Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a deprecated mechanism of provenance, without backing it up. Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially? Martynas graphity.org On Jun 22, 2012 5:20 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Here's the use case: Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website. Whereas it is always possible (and probably what we will do first) as well as correct to state: Statement1 accordingTo SlashDot . it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying Statement1 accordingTo X . with X being a URI denoting the sentence that I mean in a specific Slashdot-Article. I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail. The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized). I hope this helps, Cheers, Denny 2012/6/21 Sebastian Hellmann hellm...@informatik.uni-leipzig.de: Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline. On 05/29/2012 05:25 PM, Denny VrandeÄ ić wrote: Hello Sebastian, Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards: * I understand that you dismiss IETF RFC 5147 because it is not stable enough, right? The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . We might change the syntax and reuse the RFC syntax, but it has several issues: 1. The optional part is not easy to handle, because you would need to add owl:sameAs statements: ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 . So theoretically ok, but annoying to implement and check. 2. When implementing web services, NIF allows the client to choose the prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=textnif=trueprefix=http%3A%2F%2Fthis.is%2Fa%2Fslash%2Fprefix%2Furirecipe=offsetinput=President+Obama+is+president . returning URIs like http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like: http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12;UTF-8 or http://this.is/a/slash/prefix?char=717,12 http://this.is/a/slash/prefix?char=717,12;UTF-8 3. Character like = , prevent the use of prefixes: echo @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . test.ttl ; rapper -i turtle test.ttl 4. implementation is a little bit more difficult, given that : $arr = split(_, offset_717_729) ; switch ($arr[0]){ case 'offset' : $begin = $arr[1]; $end = $arr[2]; break; case 'hash' : $clength = $arr[1]; $slength = $arr[2]; $hash = $arr[3]; $rest = /*merge remaining with '_' */ break; } 5. RFC assumes a certain mime type, i.e. plain text. NIF does have a broader assumption. * what is the relation to the W3C media fragment URIs? Did not find a pointer there. They are designed for media such as images, video, not strings. Potentially, the same principle can be applied, but it is not yet engineered/researched. * any plans of standardizing your approach? We will do NIF 2.0 as a community standard and finish it in a couple of months. It will be published under open licences, so anybody W3C or ISO might pick it up, easily. Other than that there are plans by several EU projects (see e.g. here http://lists.w3.org/Archives/Public/public-multilingualweb-lt/2012Jun/0101.html ) and a US project to use it and there are several third party implementations, already. We would rather have it adopted first on a large scale and then standardized, properly, i.e. W3C. This worked quite well for the FOAF project or for RDB2RDF Mappers. Chances for fast standardization are not so unlikely, I would assume. We would strongly prefer to just use a standard instead of advocating contenders for one -- if one
Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs
It says deprecated on the Data model wiki. So maybe Wikidata doesn't need statement-level granularity? Maybe the named graph approach is good enough? But it's not based on statements. If you build this kind of data model on the relational, not to mention provenance, you will not be able to provide a reasonable query mechanism. That's the reason why the development of Jena's SDB store is pretty much abandoned. Martynas On Jun 22, 2012 8:18 PM, Sebastian Hellmann hellm...@informatik.uni-leipzig.de wrote: Denny didn't even use the word deprecated. Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So could - yes , should - ?? - probably not If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;) For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho. Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options. All the best, Sebastian [1]http://ceur-ws.org/Vol-699/Paper5.pdf On 06/22/2012 06:20 PM, Martynas Jusevičius wrote: Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a deprecated mechanism of provenance, without backing it up. Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially? Martynasgraphity.org On Jun 22, 2012 5:20 PM, Denny Vrandečić denny.vrande...@wikimedia.de denny.vrande...@wikimedia.de wrote: Here's the use case: Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website. Whereas it is always possible (and probably what we will do first) as well as correct to state: Statement1 accordingTo SlashDot . it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying Statement1 accordingTo X . with X being a URI denoting the sentence that I mean in a specific Slashdot-Article. I would prefer a standard or widely adopted way to how to do that, and NIF-URIs seem to be a viable solution for that. We will come back to this once we start modeling references in more detail. The reference could be pointing to a book, to a video, to a mesopotamic stone table, etc. (OK, I admit that the different media types will be differently prioritized). I hope this helps, Cheers, Denny 2012/6/21 Sebastian Hellmann hellm...@informatik.uni-leipzig.de hellm...@informatik.uni-leipzig.de: Hello Denny, I was traveling for the past few weeks and can finally answer your email. See my comments inline. On 05/29/2012 05:25 PM, Denny VrandeÄ ić wrote: Hello Sebastian, Just a few questions - as you note, it is easier if we all use the same standards, and so I want to ask about the relation to other related standards: * I understand that you dismiss IETF RFC 5147 because it is not stable enough, right? The offset scheme of NIF is built on this RFC. So the following would hold: @prefix ld: http://www.w3.org/DesignIssues/LinkedData.html# http://www.w3.org/DesignIssues/LinkedData.html# . @prefix owl: http://www.w3.org/2002/07/owl# http://www.w3.org/2002/07/owl# . ld:offset_717_729 owl:sameAs ld:char=717,12 . We might change the syntax and reuse the RFC syntax, but it has several issues: 1. The optional part is not easy to handle, because you would need to add owl:sameAs statements: ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12;length=12 . ld:char=717,12;length=12,UTF-8 owl:sameAs ld:char=717,12 . ld:char=717,12;UTF-8 owl:sameAs ld:char=717,12;length=9876 . So theoretically ok, but annoying to implement and check. 2. When implementing web services, NIF allows the client to choose the prefix: http://nlp2rdf.lod2.eu/demo/NIFStemmer?input-type=textnif=trueprefix=http%3A%2F%2Fthis.is%2Fa%2Fslash%2Fprefix%2Furirecipe=offsetinput=President+Obama+is+president . returning URIs like http://this.is/a/slash/prefix/offset_10_15 http://this.is/a/slash/prefix/offset_10_15 So RFC 5147 would look like:http://this.is/a/slash/prefix/char=717,12 http://this.is/a/slash/prefix/char=717,12http://this.is/a/slash/prefix/char=717,12;UTF-8 http://this.is/a/slash/prefix/char=717,12;UTF-8 orhttp://this.is/a/slash/prefix?char
Re: [Wikidata-l] Provenance tracking on the Web with NIF-URIs
You do not need the full expressive power of SPARQL or graph querying -- what kind of query mechanism is Wikidata planning to support in later stages? I don't suppose the data model will be redesigned for that? So in that case you have to have queries in mind from the start of its design. Regarding scalability again: Long-term though it seems likely that native triplestores will have the advantage for performance. A difficulty with implementing triplestores over SQL is that although triples may thus be stored, implementing efficient querying of a graph-based RDF model (i.e. mapping from SPARQL) onto SQL queries is difficult. http://en.wikipedia.org/wiki/Triplestore#Implementation The above results indicate a superior performance of native stores like Sesame native, Mulgara and Virtuoso. This is in coherence with the current emphasis on development of native stores since their performance can be optimized for RDF. http://www.bioontology.org/wiki/images/6/6a/Triple_Stores.pdf On Jun 22, 2012 9:10 PM, Sebastian Hellmann hellm...@informatik.uni-leipzig.de wrote: Dear Martynas, as far as I understand it, Wikidata will not need to worry about named graphs or alike. IIRC Wikidata is building a fast software to edit facts and generate infoboxes. You do not need the full expressive power of SPARQL or graph querying. That is a different use case and can be done by exporting the data and loading it into a triple store/graph database. I would assume that the most efficient operation is to retrieve all data for one entity/entry/page? So the database needs to be optimized for lookup/update, not graph querying. In another mail you said that: Regarding scalability -- I can only see those possible cases: either Wikidata will not have any query language, or it's query language will be SQL with never ending JOINs too complicated to be useful, or it's gonna be another query language translated to SQL -- for example SPARQL, which is doable but attempts have shown it doesn't scale. A native RDF store is much more performant. Do you have a reference for this? I always thought it was exactly the opposite, i.e. SPARQL2SQL mappers performing better than native stores. Cheers, Sebastian On 06/22/2012 08:43 PM, Martynas Jusevičius wrote: It says deprecated on the Data model wiki. So maybe Wikidata doesn't need statement-level granularity? Maybe the named graph approach is good enough? But it's not based on statements. If you build this kind of data model on the relational, not to mention provenance, you will not be able to provide a reasonable query mechanism. That's the reason why the development of Jena's SDB store is pretty much abandoned. Martynas On Jun 22, 2012 8:18 PM, Sebastian Hellmann hellm...@informatik.uni-leipzig.de wrote: Denny didn't even use the word deprecated. Reification for statement-level provenance works, but you won't be able to sell it as an elegant solution to the problem. So could - yes , should - ?? - probably not If Wikidata is using statement-level provenance, there might be better ways to serialize it in RDF than reification in the future e.g. NQuads: http://sw.deri.org/2008/07/n-quads/ or JSON ;) For internal use I would discourage reification. If using a relational scheme, a statement id, which can be joined with another SQL table for provenance is the best way to do it imho. Before you are driving us all mad with explaining why reifiction is bad, I would really like you to justify why WikiData should consider reification. I really do not know many use case (if any) where reification is the right choice of modelling. Before going into the discussion any further [1], I think you should name an example where reification is really better than other options. All the best, Sebastian [1]http://ceur-ws.org/Vol-699/Paper5.pdf On 06/22/2012 06:20 PM, Martynas Jusevičius wrote: Denny, the statement-level of granularity you're describing is achieved by RDF reification. You describe it however as a deprecated mechanism of provenance, without backing it up. Why do you think there must be a better mechanism? Maybe you should take another look at reification, or lower your provenance requirements, at least initially? Martynasgraphity.org On Jun 22, 2012 5:20 PM, Denny Vrandečić denny.vrande...@wikimedia.de denny.vrande...@wikimedia.de wrote: Here's the use case: Every statement in Wikidata will have a URI. Every statement can have one more references. In many cases, the reference might be text on a website. Whereas it is always possible (and probably what we will do first) as well as correct to state: Statement1 accordingTo SlashDot . it would be preferable to be a bit more specific on that, and most preferably it would be to go all the way down to the sentence saying Statement1 accordingTo X . with X being a URI denoting the sentence that I mean in a specific Slashdot-Article. I would prefer