Re: [Wikidata-l] Data values
I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions. If a custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points? On Wed, Dec 19, 2012 at 10:30 AM, Nikola Smolenski smole...@eunet.rswrote: On 19/12/12 08:53, Gregor Hagedorn wrote: I agree. What I propose is that the user interface supports entering and proofreading 10.6 nm as 10.6 plus n (= nano) plus meter. How the value is stored in the data property, whether as 10.6 floating point or as 1.6e-8 is a second issue -- the latter is probably preferable. I only intend to show that scientific values are not Perhaps both should be stored. 1.6e-8 is necessary for sorting and comparison. But 10.6 nm is how the user entered it, presumably how it was written in the source that the user used, how is it preferably used in the given field, and how other users would want to see it and edit it. As an example, human height is commonly given in centimetres, while building height is commonly given in metres. So, users will probably prefer to edit the tallest person as 282cm and the lowest building as 2.1m even though the absolute values are similar. __**_ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/**mailman/listinfo/wikidata-lhttps://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19.12.2012 11:56, Friedrich Röhrs wrote: I don't understand why 1.6e-8 is absolutly necessary for sorting and comparison. PHP allows for the definition of custom sorting functions. If a custom datatype is defined, a custom sorting/comparision function can be defined too. Or am i missing some performance points? We are talking about searching and sorting millions of data entries - doing that in PHP would be extremely slow and would take far more memory than we have. It has to be done natively in the database. So we have to use a data representation that can be natively compared and sorted by the database (at the very least by MySQL, but ideally, by many different database systems). -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
In addition to a storage option of the desired unit prefix (this may be considered a original-prefix, since naturally re-users may wish to reformat this). I see no point in storing the unit used for input. I think you plan to store the unit (which would be meter), so you don't want to store prefixes, correct? Please argue why you don't see a point. You want to both the size of the universe, distance to New York, size of the proton in meter? If not, with which algorithm will you restore the SI prefix, or rather, recognize with SI-prefix is usable? We do not use Mm in common language, so we do give the circumference of the earth as roughly 40 000 km and not as 40 Mm. We don't write 4*10^7 m either. it is probably necessary to store the number of significant decimals. That's how Denny proposed to calculate the default accuracy. If the accuracy is given by a complex model (e.g. a gamma distribution), then it might be handy to have a simple value that tells us the significant digits. Hm... perhaps it's best to always express accuracy as +/-n, and allow for more detailed information (standard deviation, whatever) as *additional* information about the accuracy (could be modelled as a qualifier internally). I fear that is two separate levels of precision of giving a measure of measurement _precision_ (I believe accuracy is the wrong term here, precision and accuracy are related but distinct concepts). So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc. Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts. I believe in the user interface this needs not be any visible setting, simply the number of digits can be preserved. Without these is impossible to store and reproduce information like 10.20 nm, it would be returned as 1.02 10^-8 m. No, it would return using whatever system of measurement the user has selected in their preferences. then you have lost the information. There is no user selection in this in science. Complex heuristic may guess when to use the scientific SI prefixes instead. The trailing zero cannot be reproduced however when completely relying on IEEE floating-point. We'll need heuristics to pick the correct secondary unit (e.g. nm or km). The (I believe there is no such thing as a secondary unit, did you make that term up? Only m is a unit of measurement, the n or k are prefixes see http://en.wikipedia.org/wiki/SI_prefix ) general rule could be to pick a unit so that the actual value is between 1 and 10, with some additional rules for dealing with cultural specialities (decimeter is rarely used, hectoliter however is pretty common. The decagram is commonly used in Austria only, etc). You would need to also know which prefix is applicable to which unit in which context. In a scientific context different prefixes are used than in a lay context. In a lay context astronomical temperatures may be given as degree celsius, in a scientific as kelvin. This is not just a user preference. I agree that the system should allow explicit conversion in infoboxes. I disagree that you should create an artifical intelligence system for wikidata that knows more about unit usage than the authors. To store the wisdom of authors, storing both unit and original unit prefix is necessary. You write The Precision can be derived from the accuracy and vice versa, using appropriate heuristics. I _terrible strongly_ doubt that. Can you give any proof of that? For precision I can use statistics, for accuracy and need an indirect, separate and precise method to estimate accuracy. If you have a laser-distance measurement device, the precision can be estimated by yourself by repeated measurements at various times, temperatures, etc. But unless you have an objective distance standard, you have no means to determine whether the accuracy of the device is always off by 10 cm because someone screwed up the software program inside the device. But they are not the same. IMHO, the accuracy should always be stored with the value, the precision never. I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 2012-12-19 15:11, Daniel Kinzler wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. To be added by multiple simultaneous sorting operations. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). Software developers are not allowed to just think of the status quo they also have to think of use case the solution might gonna be used. There is e.g. the idea of pushing the monuments lists into wikidata. Only in Austria there are 36.000-37.000 of those. Germany is much bigger but has a similar history with probably an equal number per square kilometers. Sorting this by distance to a specific place needs to be done by the database. Everything else will be too ineffective. If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. +1 This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. IMHO this should be part of a model. E.g. Altitudes are usually measured in metres or feet, never in km or yards. Distances have the same SI base unit but are measured also measured in km, depending of the use case. Maybe we should make a difference between internal usage and visualization. Comparing meters with kilometers and feet is quite difficult, transcaling everything on visualization not. Cheers Marco ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19/12/12 15:33, Nikola Smolenski wrote: On 19/12/12 12:23, Daniel Kinzler wrote: I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity. Ah, but they could still be meaningfully compared to each other. And if approximate conversion is known, this could be still be used to make the conversion so that the measure is converted and its uncertainty increased. Just throwing more info here: there might also be cases where we could have multiple competing conversions. Somewhat similar to units, something that I would very much like to see is comparison of various monetary values, adjusted for inflation or exchange rate. But then you would have various estimates of inflation by various bodies and you might want to compare by either of them (or a combination of them?). Appropriate conversion might also depend on the item in question. For example, old censuses sometimes measure population not in people but in households. In some cases we might have the idea of how large a household is to give estimate of the population. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On Wed, Dec 19, 2012 at 2:32 PM, Marco Fleckinger marco.fleckin...@wikipedia.at wrote: IMHO this should be part of a model. E.g. Altitudes are usually measured in metres or feet, never in km or yards. Distances have the same SI base unit but are measured also measured in km, depending of the use case. No, altitudes are sometimes measured in km, at least once you get beyond the Earth's surface. From http://en.wikipedia.org/wiki/Hubble_Space_Telescope: Orbit height 559 km (347 mi) From http://en.wikipedia.org/wiki/Olympus_Mons: Peak 21 km (69,000 ft) above datum ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19.12.2012 15:32, Marco Fleckinger wrote: Maybe we should make a difference between internal usage and visualization. Comparing meters with kilometers and feet is quite difficult, transcaling everything on visualization not. Not maybe. Definitely. Visualization is based on user preference, interface language, and heuristics for picking a decent unit based on dimension and accuracy. The internal representation should use the same unit for all quantities of a given dimension. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19.12.2012 15:26, Avenue wrote: What about the North and South Poles? I'm sure standard coordinate systems have a convention for representing them. Won't we need lots of units that are not SI units (e.g. base pairs, IQ points, Scoville heat units, $ and €) and can't readily be translated into them? Why would historical units with unknown conversions pose any more problem than these? These all pose the same problems, correct. At the moment, I'm very unsure about how to accommodate these at all. Maybe we can have them as custom units, which are fixed for a given property, and can not be converted. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19 December 2012 15:11, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units. Example: a quantity like 1013 hektopascal, the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2 I see several problems with this approach: 1. Many base units are little known. kg⋅m2⋅s−3⋅A−2 for Ohm... It breaks communication with humans curating data on wikidata. It will make it very difficult to compare data entered into wikidata for correctness, because the data displayed after saving will have little relation with the data entered. This makes Wikidata inherently unsuitable for an effort like Wikipedia with many authors and the reliance on fact checking. 2. Even for standard base units, there is often a 1:n relation. e,g, both gray and sievert have the same base unit. The base unit for lumen is candela (because the steradians is not a unit, but part of the derived unit applicability definition) Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
I don't think we can sensibly support historical units with unknown conversions, because they cannot be compared directly to SI units. So, they couldn't be used to answer queries, can't be converted for display, etc - they arn't units in any sense the software can understand. This is a solvable problem, but would add a tremendous amount of complexity. I get the feeling that I might be the only person on this thread that doesn't have a maths/sciences/computers background here. I'm going to be frank here: We need to snap out of the mindset that all of the data we're collecting is going to be easily expressible using modern scientific units and methodologies. If we try and cram everything into a small number of common units, without giving the users some method of expressing non-standard/uncommon/non-scientific values, we're going to have a massive database that is going to at best be cumbersome and at worst be useless for a great deal of information. Traditional Chinese units of measurement [1] have changed their actual value over time. A li in one century is not as long as it is in another century, and while there is a li to SI conversion, it's artificial and when we try to use the modern li to measure something, we get a different value for that thing than the historically documented li value states it should be. There is a balance. The more flexible the parameters, the easier it is to put data in, but the harder it is for computers to make useful connections with it. I'm not sure how to handle this, but I am sure that we can't just keep pretending that all of the data we're going to collect falls nicely into the metric system. Reality just doesn't work that way, and for Wikidata to be useful, we can't discount data that doesn't fit in the mold of modern units. Sven [1] http://en.wikipedia.org/wiki/Chinese_units_of_measurement ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 2012-12-19 16:56, Daniel Kinzler wrote: On 19.12.2012 16:47, Gregor Hagedorn wrote: Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units. Example: a quantity like 1013 hektopascal, the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2 Converted and stored, yes, but not displayed. For display, it would be converted to a customary/convenient unit according to the user's (or client site's) locale, using a bit of heuristics to get the scale (order of magnitude) right. Of course, in wikitext, the desired output unit can be specified. Actually we have 3 different use cases of values: 1. Internally in the data base 2. On wikidata.org 3. On other projects like WP and also WM external projects SI shall be used internally (1.) On 2. the user can decide what he wants. On 3. either some standard setting of the MW-project says, what is desired or the article's author. Via API (also 3.) you should be able to choose: * precision * displaying of accuracy * unit -- Marco ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. Cheers, Denny 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
These all pose the same problems, correct. At the moment, I'm very unsure about how to accommodate these at all. Maybe we can have them as custom units, which are fixed for a given property, and can not be converted. I think the proposal to use wikidata items for the units (that is both base and derived SI as well as Imperial units/US customary units) is most sensible. Let people use the units they need. Then write software that picks up the units that people use (after verifying common and correct use) by means of their Wikidata item ID. With successive versions of Wikidata, pick up more and more of these and make them available for conversion. This way Wikidata will become what is needed. I fear the discussion presently is about anticipating the needs of the next years and not allowing any data into wikidata that have not been foreseen. There may be a way that Wikidata can have enough artifical intelligence to predict which unit prefixes are usable in common topics versus scientific topics, which units shall be used. Where Megaton is used (TNT of atomic bombs) and where 10^x ton are preferred (shipping). And that the base unit for weight is kilogram, but for gold in a certain value range ounce may be preferred and gemstones and pearls in carat (http://en.wikipedia.org/wiki/Carat_(unit) ). But I believe forcing Wikidata to solve that problem first and ignoring the wisdom of the users is the wrong path. Modelling Wikidata on the feet versus meter and Fahrenheit versus Celsius problem, where US citizens have a different personal preference is misleading. The issue is much more complex. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19.12.2012 16:41, Marco Fleckinger wrote: I assume there's a table for usual units for different purposes. E.g. altitudes are displayed in m and ft. Out of that one of those is chosen by the user's locale setting. My locale-setting would be kind of metric system, therefore it will be displayed in m on my wikidata-surface. On enwiki it will probably be displayed in ft. I'd have thought that we'd have one such table per dimension (such as length or weight). It may make sense to override that on a per-property basis, so 2300m elevation isn't shown as 2.3km. Or that can be done in the template that renders the value. My suggestion would be: * Somebody types in 4.10, so 4.10 will be saved. There is no accuracy available so n/a is been saved for the accuracy or even the javascript way could be used, which will be undefined (because not mentioned). Retrieving this will result in 4.10 or {value:4.10}. What is saved would depend on unit conversion, the value actually stored in the database would be in a base unit. In addition, the input'S precision would be usewd to derive the value'S accuracy: entering 4.10m will make the accuracy default to 10cm (+/- 5cm). Futhermore, a quantity may be given as 4.10-4.20-4.35. The precision of measurement and the the measure of variance and dispersion are separate concepts. Hm, somewhere in the scope of mechanical engineering there are also existing ±-values where the tolerances up and down differ from each other. E.g: it should be 11.2, but it may be 11.1 or 11.35. I'd suggest to store such additional information in a Qualifier instead of the Data Value itself. I fear that is a view of how data in a perfect world should be known, not a reflection of the kind of data that people need to store in Wikidata. Very often only the precision will be known or available to its authors, or worse, the source may not say which it is. I think this is kind of Wikidata definitions. Since years now precision is used for the amount of digits behind the comma. Now we need another word for expressing how accurate a value is. Therefore: Do we have a glossary? Indeed we do: https://wikidata.org/wiki/Wikidata:Glossary I use precision exactly like that: significant digits when rendering output or parsing intput. It can be used to *guess* at the values accuracy, but is not the same. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
When we speak about dimensions, we talk about properties right? So when I define the property height of a person as an entity, i would supply the SI unit (m) and the SI multiple (-2, cm) that it should be saved in (in the database). When someone then inputs the height in meters (e.g. 1.86m) it would be converted to the matching SI multiple before being saved (i.e. 186 (cm)). On the database side each SI multiple would get its own table so that indexes can easily be made. Depending on which multiple we choose in the property the datavalue would be saved to a different table. Did i get the idea correctly? On Wed, Dec 19, 2012 at 4:47 PM, Gregor Hagedorn g.m.haged...@gmail.comwrote: On 19 December 2012 15:11, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. Daniel confirms (in separate mail) that Wikidata indeed intends to convert any derived SI units to a common formula of base units. Example: a quantity like 1013 hektopascal, the common unit for meterological barometric pressure (this used to be millibar), would be stored and re-displayed as 1.013 10^5 kg⋅m−1⋅s−2 I see several problems with this approach: 1. Many base units are little known. kg⋅m2⋅s−3⋅A−2 for Ohm... It breaks communication with humans curating data on wikidata. It will make it very difficult to compare data entered into wikidata for correctness, because the data displayed after saving will have little relation with the data entered. This makes Wikidata inherently unsuitable for an effort like Wikipedia with many authors and the reliance on fact checking. 2. Even for standard base units, there is often a 1:n relation. e,g, both gray and sievert have the same base unit. The base unit for lumen is candela (because the steradians is not a unit, but part of the derived unit applicability definition) Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On Wed, 19 Dec 2012, Denny Vrandečić wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. NIST has created a standard in OWL: QUDT - Quantities, Units, Dimensions and Data Types in OWL and XML: http://www.qudt.org/qudt/owl/1.0.0/index.html I fully share Martynas' concerns: most of the problems that are being discussed in this thread (and that are very relevant and interesting) should not be solved with an object oriented approach (that is, via properties of objects, and inheritance) but by semantic modelling (that is, composition of knowledge). For example, one single data base representation of a unit can have multiple displays depending on who wants to see the unit, and in which context; the viewer and the context are rather simple to add via semantic primitives. For example, the Topic Map semantic standard would fit here very well, in my opinion: http://en.wikipedia.org/wiki/Topic_map. Cheers, Denny Herman 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel.
Re: [Wikidata-l] Data values
it is probably necessary to store the number of significant decimals. Yes, that *is* the accuracy value i mean. Daniel, please use correct terms. Accuracy is a defined concept and although by convention it may be roughly expressed by using the number of significant figures, that is not the same concept. Without additional information you cannot infer backwards whether usage of significant figures expresses accuracy or precision. See http://en.wikipedia.org/wiki/Accuracy_and_precision Ok, there's some terminology confusion here. I'm using accuracy to refer to the accuracy of measurement (e.g. standard deviation), and precision to refer to the precision of presentation (e.g. significant digits). We need these two things at least, and words for them. I don't care much which words we use. I do. And I think it is important for WIkidata to precisely express what it wants to achieve. Accuracy has nothing to do with s.d., which is a measure of dispersion. You can have an accuracy of +/- 10 measured with a precision of +/- 0.1 (and a standard deviation for the population of objects that you have measured of 2). - So 4.10 means that the last digit is significant, i.e. the best estimate is at least between 4.095 and 4.105 (but it may be better). . 4.10 +/- 0.005 means it is precisely 4.095 and 4.105, as opposed to 4.10 +/- 0.004, 4.10 +/- 0.003, 4.10 +/- 0.002 etc. Yes, all this should be handled by the component responsible for parsing user input for quantity values. But it cannot be because you have lost the information. I don't know whether +/- 0.005 indicates significant figures/digits or whether is is an exact precision_or_accuracy interval. I think this may become clearer if you consider a value entered in inches: 1.20 inches. you convert: 1.20 +/- 0.05 in = 3.048 10^-2 m +/- 1.27 10^-3 m If this is the only information stored, I have no information left whether I should display 3.048 or 3.0480 and whether the information +/- 1.27 10^-3 m is meaningful (no) or an artifact of conversion (yes). It can be stored as an auxilliary data point, that is, as a qualifier (measured in feet). It should not IMHO be part of the data value as such, because that would make it extremely hard to use the values in a database. You are correct insofar that I propose you need to store two units: the normalized one (SI units only, and no prefix - and even though the SI base unit is kg I would store gram) and the original one plus the original unit prefix. If you do that, you can store the value in a single normalized unit, provided you back-convert it prior to display in Wikidata. I don't think the original unit is a meaningless qualifier, it is vital information for context. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19 December 2012 17:03, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: I'd have thought that we'd have one such table per dimension (such as length or weight). It may make sense to override that on a per-property basis, so 2300m elevation isn't shown as 2.3km. Or that can be done in the template that renders the value. here and in the entire discussion I fear that the need to support data curation on Wikidata data for correctness is not sufficiently in the focus. If someone enters the height of a mountain in feet and I see the converted value in meter in my wikidata preferences-converted view, I will correct the seemingly senseless and unjustified precision to three digits after the meter. Only if we understand in which unit the data were originally valid, we will be able to successfully communicate and collaborate. Yes, Wikidata shall store a normalized version of the value, but it also needs to store an original one. Whether it needs to store the value twice I am not sure, I believe not. If it store the original prefix, original unit and original significant digits, it can generally recreate the original form. I know that there are some pitfalls with IEEE numbers in this, and it may be safer to store the original number as well initially (and perhaps drop it later when enough data are available to test the effects). Of course, Wikipedias can use the API to display the value in any other form, just as they like, but that does not solve the problem of data curation on wikidata (which includes the data curation by wikipedia authors). Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
Denny, you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero. This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier. Martynas graphity.org On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. Cheers, Denny 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values, thereby defying one of the major reasons for Wikidata's existance. I don't see a way around this. -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Project director Wikidata Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin Tel. +49-30-219 158 26-0 | http://wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V. Eingetragen im Vereinsregister des
Re: [Wikidata-l] Data values
My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to what's been done already if it gets in the way of what will make Wikidata better. Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard. Sven On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius marty...@graphity.org wrote: Denny, you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero. This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier. Martynas graphity.org On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. Cheers, Denny 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html Best, Martynas graphity.org On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map our values to scalar data types the database can understand directly, and use for indexing. This needs to be done anyway, since the values are saved at a specific unit (which is just a wikidata item). To compare them on a database level they must all be saved at the same unit, or some sort of procedure must be used to compare them (or am i missing something again?). If they measure the same dimension, they should be saved using the same unit (probably the SI base unit for that dimension). Saving values using different units would make it impossible to run efficient queries against these values,
Re: [Wikidata-l] Data values
Martynas, I think you misinterpret the thread. There is no discussion not to build on the datatypes defined in http://www.w3.org/TR/xmlschema-2/ What we are doing is discussing compositions of elements, all typed to xml datatypes, that shall be able to express scientific and engineering requirements as to statistics, signficant digits (except perhaps for duration, none of the data types in http://www.w3.org/TR/xmlschema-2/ supports that), as well as means to express uncertainty and confidence intervals. Many existing xml schemata define such compositions, all squarely built on http://www.w3.org/TR/xmlschema-2/ - wikidata is certainly not unique in this effort. If you can point the team to further well reviewed solutions, this would be very useful. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
I suspect what Martynas is driving at is that XMLS defines **FACETS** for its datatypes - accepting those as a baseline, and then extending them to your requirements, is a reasonable, community-oriented procss. However, wrapping oneself in the flag of open development is to me unresponsive to a simple plea to stand on the shoulders of giants gone before, to act in a responsible manner cognizant of the interests of the broader community. And personally I have to say I don't like the word clinging -- clearly a red flag meant to inflame if not insult. This is no place for that! On 19.12.2012 09:47, Sven Manguard wrote: My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to what's been done already if it gets in the way of what will make Wikidata better. Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard. Sven On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius marty...@graphity.org wrote: Denny, you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero. This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/ [1] All the XML and RDF tools are compatible with XSD, however I don't think there is even a single mention of it in this thread? What makes Wikidata so special that its datatypes cannot build on XSD? And this is only one of the issues, I've pointed out others earlier. Martynas graphity.org [2] On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić denny.vrande...@wikimedia.de wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. Cheers, Denny 2012/12/19 Martynas Jusevičius marty...@graphity.org Hey wikidatians, occasionally checking threads in this list like the current one, I get a mixed feeling: on one hand, it is sad to see the efforts and resources waisted as Wikidata tries to reinvent RDF, and now also triplestore design as well as XSD datatypes. What's next, WikiQL instead of SPARQL? On the other hand, it feels reassuring as I was right to predict this: http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html [3] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html [4] Best, Martynas graphity.org [2] On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 19.12.2012 14:34, Friedrich Röhrs wrote: Hi, Sorry for my ignorance, if this is common knowledge: What is the use case for sorting millions of different measures from different objects? Finding all cities with more than 10 inhabitants requires the database to look through all values for the property population (or even all properties with countable values, depending on implementation an query planning), compare each value with 10 and return those with a greater value. To speed this up, an index sorted by this value would be needed. For cars there could be entries by the manufacturer, by some car-testing magazine, etc. I don't see how this could be adequatly represented/sorted by a database only query. If this cannot be done adequatly on the database level, then it cannot be done efficiently, which means we will not allow it. So our task is to come up with an architecture that does allow this. (One way to allow scripted queries like this to run efficiently is to do this in a massively parallel way, using a map/reduce framework. But that's also not trivial, and would require a whole new server infrastructure). If however this is necessary, i still don't understand why it must affect the datavalue structure. If a index is necessary it could be done over a serialized representation of the value. Serialized can mean a lot of things, but an index on some data blob is only useful for exact matches, it can not be used for greater/lesser queries. We need to map
Re: [Wikidata-l] Data values
Wow, what a long thread. I was just about to chime in to agree with Sven's point about units when he interjected his comment about blithely ignoring history, so I feel compelled to comment on that first. It's fine to ignore standards *for good reasons*, but doing it out of ignorance or gratuitously is just silly. Thinking that WMF is so special it can create a better solution without even know what others have done before is the height of arrogance. Modeling time and units can basically be made arbitrary complex, so the trick is in achieving the right balance of complexity vs utility. Time is complex enough that I think it deserves it's own thread. The first thing I'd do is establish some definitions to cover some basics like durations/intervals, uncertain dates, unknown dates, imprecise dates, etc to that everyone is using the same terminology and concepts. Much of the time discussion is difficult for me to follow because I have to guess at what people mean. In addition to the ability to handle circa/about dates already mentioned, it's also useful to be able to represent before/after dates e.g. he died before 1 Dec 1792 when his will was probated. Long term I suspect you'll need support for additional calendars rather than converting everything to a common calendar, but only supporting Gregorian is a good way to limit complexity to start with. Geologic times may (probably?) need to be modeled differently. Although I disagree strongly with Sven's sentiments about the appropriateness of reinventing things, I believe he's right about the need to support more units than just SI units and to know what units were used in the original measurement. It's not just a matter of aesthetics but of being able to preserve the provenance. Perhaps this gets saved for a future iteration, but you may find that you need both display and computable versions of things stored separately. Speaking of computable versions don't underestimate the issues with using floating points numbers. There are numbers that they just can't represent and their range is not infinite. Historians and genealogists have many interminable discussions about date/time representation which can be found in various list archives, but one recent spec worth reviewing is Extended Date/Time Format (EDTF) http://www.loc.gov/standards/datetime/pre-submission.html Another thing worth looking at is the Freebase schema since it not only represents a bunch of this stuff already, but it's got real world data stored in the schema and user interface implementations for input and rendering (although many of the latter could be improved). In particular, some of the following might be of interest: http://www.freebase.com/view/measurement_unit / http://www.freebase.com/schema/measurement_unit http://www.freebase.com/schema/time http://www.freebase.com/schema/astronomy/celestial_object_age http://www.freebase.com/schema/time/geologic_time_period http://www.freebase.com/schema/time/geologic_time_period_uncertainty If you rummage around, you can probably find lots of interesting examples and decide for yourself whether or not that's a good way to model things. I'm reasonably familiar with the schema and happy to answer questions. There are probably lots of other example vocabularlies that one could review such as the Pleiades project's: http://pleiades.stoa.org/vocabularies You're not going to get it right the first time, so I would just start with a small core that you're reasonably confident in and iterate from there. Tom On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard svenmangu...@gmail.comwrote: My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We should not be clinging to what's been done already if it gets in the way of what will make Wikidata better. Everything that we make and do is open, including the software we're going to operate the database on. Every WMF project has done things differently from the standards of the time, and people have developed tools to use our content before. Wikidata will be no different in that regard. Sven On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius marty...@graphity.org wrote: Denny, you're sidestepping the main issue here -- every sensible architecture should build on as much previous standards as possible, and build own custom solution only if a *very* compelling reason is found to do so instead of finding a compromise between the requirements and the standard. Wikidata seems to be constantly doing the opposite -- building a custom solution with whatever reason, or even without it. This drives the compatibility and reuse towards zero. This thread originally discussed datatypes for values such as numbers, dates and their intervals -- semantics for all of those are defined in XML Schema Datatypes:
Re: [Wikidata-l] Reusing Languages Translation (was: Data values)
It would be much more easier if this could be done automatically, so everybody could set there preferred data system SI or CGS or what ever. Sk!d ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] .name = text property
Using the dotted notation, XSD datatype facets such as below can be specified easily as properties using a simple colon: Property: .anyType:equal - (sameAs equivaluent) redirect to page/object with actual numeric value Property: .anyType:ordered - a boolean property Property: .anyType:bounded - a boolean property Property: .anyType:cardinality - a boolean property Property: .anyType:numeric - a boolean property Property: .anyType:length - number of chars allowed for value Property: .anyType:minLength - min nbr of chars for value Property: .anyType:maxLength - max nbr of chars for value Property: .anyType:pattern - regex string Property: .anyType:enumeration - specified values comprising value space Property: .anyType:whiteSpace - reserve or replace or collapse Property: .anyType:maxExclusive - number for an upper bound Property: .anyType:maxInclusive - number for an upper bound Property: .anyType:minExclusive - number for an lower bound Property: .anyType:minInclusive - number for an lower bound Property: .anyType:totalDigits - number of total digits Property: .anyType:fractionDigits - number of digits in the fractional part of a number An anonymous object is used to represent namespace-qualified (text url) values eg_ rdf:about_: Property: .:rdf:about - this is a .url value for an RDF about property for a page/object Property: .:skos:prefLabel - this is a .name value for a page/object I suggest that properties for precision can be found in XSD facets above. - john On 19.12.2012 12:41, jmccl...@hypergrove.com wrote: Here's a suggestion. Property names for numeric information seem to be on the table -- these should be viewed systematically not haphazardly. If all text properties had a dotted lower-case name, life would be simpler in SMW land all around and maybe Wikidata land too. All page names have an initial capital as a consequence of requiring all text properties to be named with an initial period followed by a lower-case letter. The SMW tool mandates the properties from which all derive: .text, .string and .number are basic (along with others like .page). Then, strings have language-based subproperties and number expression subproperties, and numbers have XSD datatype subpropertiess, which in turn have SI unit type subproperties, and so on. Here's a Consolidated Listing of ISO 639, ISO 4217, SI Measurement Symbols, and World Time Zones [2] [1] to illustrate that it is possible to create a unified string- numeric-type property name dictionary across a wide swath of the standards world. The document lists a few overlapping symbols then re-assigned to another symbol. Adopting a dotted name text-property naming convention, can segue to easier user interfaces too for query forms at least plus impacts exploited by an SMW query engine. What is meant by these expressions seems pretty natural to most people: Property: Height - the value is a wiki pagename or objectname for a height numeric object Property: .text - (on Height) the value is text markup associated with the Height object Property: .string - (on Height) the value is text non-markup data for the Height object Property: .ft - (on Height) the value is number of feet associated with the Height object Property: Height.text - the value is text markup associated with an anonymous Height object Property: Height.string - the value is a string property of an anonymous Height object Property: Height.ft - the value is a feet property of an anonymous Height object [1] http://www.hypergrove.com/Publications/Symbols.html ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1] Links: -- [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2] http://www.hypergrove.com/Publications/Symbols.html ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
On 19 December 2012 20:01, jmccl...@hypergrove.com wrote: Hi Gregor - the root of the misconception I likely have about significant digits and the like, is that such is one example of a rendering parameter not a semantic property. It is about semantics, not formatting. In science and engineering, the number of significant digits is not used to right align numbers, but to semantically indicate the order of magnitude of the accuracy and/or precision of a measurement or quantity. Thus, the weight of a machine can be given as 1.2 t (exact to +/- 50 kg), 1200 kg (+/- 1 kg), or 1200.000 g. This is not part of IEEE floating point numbers, which always have the type dependent same precision or number of significant digits, regardless whether this is semantically justified or not. IEEE 754 standard double always has about 16 decimal significant digits, i.e. the value 1.2 tons will always be given as 1.200 tons. This is good for calculations, but lacks the information for final rounding. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
totally agree - hopefully XSD facets provide a solid start to meeting those concrete requrements - thanks. On 19.12.2012 14:09, Gregor Hagedorn wrote: On 19 December 2012 20:01, jmccl...@hypergrove.com wrote: Hi Gregor - the root of the misconception I likely have about significant digits and the like, is that such is one example of a rendering parameter not a semantic property. It is about semantics, not formatting. In science and engineering, the number of significant digits is not used to right align numbers, but to semantically indicate the order of magnitude of the accuracy and/or precision of a measurement or quantity. Thus, the weight of a machine can be given as 1.2 t (exact to +/- 50 kg), 1200 kg (+/- 1 kg), or 1200.000 g. This is not part of IEEE floating point numbers, which always have the type dependent same precision or number of significant digits, regardless whether this is semantically justified or not. IEEE 754 standard double always has about 16 decimal significant digits, i.e. the value 1.2 tons will always be given as 1.200 tons. This is good for calculations, but lacks the information for final rounding. Gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1] Links: -- [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
For me the question is how to name the precision information. Do not the XSD facets totalDigits and fractionDigits work well enough? I mean .number:totalDigits contains a positive power of ten for precision .number:fractionDigits contains a negative power of ten for precision The use of the word datatype is always interesting as somehow it's meant organically different from the measurement to which it's related. Both are resources with named properties - what are those names? Certain property names derived from (international standards) should be considered builtin to whatever foundation the implementing tool procides. I suggest that XSD names be used at least for concepts that appear to be the same, with or without the xsd: xml-namespace prefix. But the word datatype fascinates me even more ever since SMW internalized the Datatype namespace. Because to me RDF made an error back when the rdf:type property got the range Class, when it should have been Datatype (though politics got in the way!) It gets more twisted, as now Category is the chosen implementation of rdfs:Class. The problem that presents is that categories are lists and a class (that is, rdf:type value) is, for some singular, and for others a plural, concept or label. Pure semantic mayhem. I'm happy SMW internalized the datatype namespace to the extent it maps to its software chiefly because it clarifies that a standard Type namespace is needed -- which contains singular noun phrases -- which is the value range for rdf:type (if you will) properties. All Measurement types (eg Feet, Height Lumens) would be represented there too, like any other class, with its associated properties that (in the case of numerics) would include .totalDigits and .fractionDigits. Going this route -- establishing a standard Type namespace -- would allow wikis to have a separate vocabulary of singular noun phrases not in the Category namespace. The ultimate goal is to associate a given Type to its implemention as a wiki namespace, subpage or subobject; the Category namespace itself is already overloaded to handle that task. -john On 19.12.2012 14:50, Gregor Hagedorn wrote: totally agree - hopefully XSD facets provide a solid start to meeting those concrete requrements they don't. They allow to define derived datatypes and thus apply to the datatype, not the measurement. Different measurements of the same datatype may be of different precision. --gregor ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1] Links: -- [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
I think that Tom Morris tragically misunderstood my point, although that was likely helped by the fact that, as I've insinuated already, the many standards and acronyms being thrown about are largely lost on me. My point is not We can just throw everything out because we're big and awesome and have name brand power. My point was We're going to reach a point where some of the existing standards and tools just don't work because when they were built things like Wikidata weren't envisioned. We need to have the mindset that developing new pieces that work for us is better than trying to force a square peg into a round hole just because something is already widely used. If what exists doesn't work, we're going to do more harm than good if we have to start cutting corners or cutting features to try and get it to work. We have an infrestructure that would allow third parties to come along later and build tools that allow there to be a bridge between whatever we create and whatever exists already. Sven On Wed, Dec 19, 2012 at 2:40 PM, Tom Morris tfmor...@gmail.com wrote: Wow, what a long thread. I was just about to chime in to agree with Sven's point about units when he interjected his comment about blithely ignoring history, so I feel compelled to comment on that first. It's fine to ignore standards *for good reasons*, but doing it out of ignorance or gratuitously is just silly. Thinking that WMF is so special it can create a better solution without even know what others have done before is the height of arrogance. Modeling time and units can basically be made arbitrary complex, so the trick is in achieving the right balance of complexity vs utility. Time is complex enough that I think it deserves it's own thread. The first thing I'd do is establish some definitions to cover some basics like durations/intervals, uncertain dates, unknown dates, imprecise dates, etc to that everyone is using the same terminology and concepts. Much of the time discussion is difficult for me to follow because I have to guess at what people mean. In addition to the ability to handle circa/about dates already mentioned, it's also useful to be able to represent before/after dates e.g. he died before 1 Dec 1792 when his will was probated. Long term I suspect you'll need support for additional calendars rather than converting everything to a common calendar, but only supporting Gregorian is a good way to limit complexity to start with. Geologic times may (probably?) need to be modeled differently. Although I disagree strongly with Sven's sentiments about the appropriateness of reinventing things, I believe he's right about the need to support more units than just SI units and to know what units were used in the original measurement. It's not just a matter of aesthetics but of being able to preserve the provenance. Perhaps this gets saved for a future iteration, but you may find that you need both display and computable versions of things stored separately. Speaking of computable versions don't underestimate the issues with using floating points numbers. There are numbers that they just can't represent and their range is not infinite. Historians and genealogists have many interminable discussions about date/time representation which can be found in various list archives, but one recent spec worth reviewing is Extended Date/Time Format (EDTF) http://www.loc.gov/standards/datetime/pre-submission.html Another thing worth looking at is the Freebase schema since it not only represents a bunch of this stuff already, but it's got real world data stored in the schema and user interface implementations for input and rendering (although many of the latter could be improved). In particular, some of the following might be of interest: http://www.freebase.com/view/measurement_unit / http://www.freebase.com/schema/measurement_unit http://www.freebase.com/schema/time http://www.freebase.com/schema/astronomy/celestial_object_age http://www.freebase.com/schema/time/geologic_time_period http://www.freebase.com/schema/time/geologic_time_period_uncertainty If you rummage around, you can probably find lots of interesting examples and decide for yourself whether or not that's a good way to model things. I'm reasonably familiar with the schema and happy to answer questions. There are probably lots of other example vocabularlies that one could review such as the Pleiades project's: http://pleiades.stoa.org/vocabularies You're not going to get it right the first time, so I would just start with a small core that you're reasonably confident in and iterate from there. Tom On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard svenmangu...@gmail.comwrote: My philosophy is this: We should do whatever works best for Wikidata and Wikidata's needs. If people want to reuse our content, and the choices we've made make existing tools unworkable, they can build new tools themselves. We
[Wikidata-l] qudt ontology facets
The NIST ontology defines 4 basic classes that are great: _qudt:QuantityKind [11]_, _qudt:Quantity [12]_, _qudt:QuantityValue [13]_, _qudt:Unit [14]_ but the properties set leaves me a bit thirsty. Take Area as an example. I'd like to reference properties named .ft2 and .m2 so that, for instance, an annotation might be [[Leasable area.ft2::12345]]. To state the precision applicable to that measurement, might be [[Leasable area.ft2:fractionDigits :: 0]] to indicate say, rounding. However, in the NIST ontology, there is no ft2 property at all -- this is an SI unit though, so it seems identifying first the system of measurement units, and then the specific measurement unit is not a great idea because these notations are then divorced from the property name itself, a scenario guaranteed to cause more user errors omissions I think. Someone's mentioned uncertainty facets, so I suggest these from the qudt ontology: Property: .anyType:relativeStandardUncertainty Property: .anyType:standardUncertainty Other facets noted might include Property: .anyType:abbreviation Property: .anyType:description Property: .anyType:symbol -john On 19.12.2012 08:10, Herman Bruyninckx wrote: On Wed, 19 Dec 2012, Denny Vrandečić wrote: Martynas, could you please let me know where RDF or any of the W3C standards covers topics like units, uncertainty, and their conversion. I would be very much interested in that. NIST has created a standard in OWL: QUDT - Quantities, Units, Dimensions and Data Types in OWL and XML: http://www.qudt.org/qudt/owl/1.0.0/index.html [5] I fully share Martynas' concerns: most of the problems that are being discussed in this thread (and that are very relevant and interesting) should not be solved with an object oriented approach (that is, via properties of objects, and inheritance) but by semantic modelling (that is, composition of knowledge). For example, one single data base representation of a unit can have multiple displays depending on who wants to see the unit, and in which context; the viewer and the context are rather simple to add via semantic primitives. For example, the Topic Map semantic standard would fit here very well, in my opinion: http://en.wikipedia.org/wiki/Topic_map [6]. Cheers, Denny Herman http://people.mech.kuleuven.be/~bruyninc Tel: +32 16 328056 Vice-President Research euRobotics http://www.eu-robotics.net [7] Open RObot COntrol Software http://www.orocos.org [8] Associate Editor JOSER http://www.joser.org [9], IJRR http://www.ijrr.org [10] ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l [1] Links: -- [1] https://lists.wikimedia.org/mailman/listinfo/wikidata-l [2] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00056.html [3] http://www.mail-archive.com/wikidata-l@lists.wikimedia.org/msg00750.html [4] http://wikimedia.de [5] http://www.qudt.org/qudt/owl/1.0.0/index.html [6] http://en.wikipedia.org/wiki/Topic_map [7] http://www.eu-robotics.net [8] http://www.orocos.org [9] http://www.joser.org [10] http://www.ijrr.org [11] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#QuantityKind [12] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#Quantity [13] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#QuantityValue [14] http://www.qudt.org/qudt/owl/1.0.0/qudt/index.html#Unit ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
[Wikidata-l] Wikimania videos of Wikidata sessions
Finally, we have the rest of the Wikimania videos available, including this one of the Wikidata panel in the sister projects session: http://www.youtube.com/watch?v=xi8Yf9c3wXg (starts at 22:45) The other Wikidata session is here: http://www.youtube.com/watch?v=05HxNwxiNZ0 Cheers, Katie -- Katie Filbert Wikidata Developer Wikimedia Germany e.V. | NEW: Obentrautstr. 72 | 10963 Berlin Phone (030) 219 158 26-0 http://wikimedia.de Wikimedia Germany - Society for the Promotion of free knowledge eV Entered in the register of Amtsgericht Berlin-Charlottenburg under the number 23 855 as recognized as charitable by the Inland Revenue for corporations I Berlin, tax number 27/681/51985. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] Data values
If one has time to read prior art, I'd suggest giving the Health Level 7 v3.0 Data Types Specification http://amisha.pragmaticdata.com/v3dt/report.html a look. Of course HL7 has a lot of things to worry about which are off topic for us, starting with a prior completely different version of the standard. And much emphasis goes to coded values (enums) and coding systems, but it is a nice review of issues found and solved by many eyeballs and years. Peter ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l