Re: [Wikidata-l] Data values

Denny Vrandečić Fri, 21 Dec 2012 09:14:42 -0800

Hi all,

wow! Thanks for all the input. I read it all through, and am trying to
digest it currently into a new draft of the data model for the discussed
data values. I will try to adress some questions here. Please be kind if I
refer the wrong person at one place or the other.

Whenever I refer to the "current model", I mean the version as it was
during this discussion <
http://meta.wikimedia.org/w/index.php?title=Wikidata/Development/Representing_values&oldid=4859586
>

The term "updated model" refers to the new one, which is not published yet.
I hope I can do that soon.

== General comments ==

I want to remind everyone of the Wikidata requirements: <
http://meta.wikimedia.org/wiki/Wikidata/Notes/Requirements>

Here especially:
* The expressiveness of Wikidata will be limited. There will always be
examples of knowledge that Wikidata will not be able to convey. We hope
that this expressiveness can increase over time.
* The first goal of Wikidata is to serve actual use cases in Wikipedia, not
to enable some form of hypothetical perfection in knowledge representation.
* Wikidata has to balance ease of use and expressiveness of statements. The
user interface should not get complicated to merely cover a few exceptional
edge cases.
* What is an exceptional case, and what is not, will be defined by how
often they appear in Wikipedia. Instead of anecdotal evidence or
hypothetical examples we will analyse Wikipedia and see how frequent
specific cases are.

In general this means that we cannot express everything that is
expressible. A statement should not be intended to reflect the source as
close as possible, but rather to be *supported* by the source. I.e. if the
source says "He died during the early days of 1876" this would also support
a statement like "died in - 19th century". It does not have to be more
exact than that.

Martynas, there is no mention here of XSD etc. because it is not relevant
on this level of discussion. For exporting the data we will obviously use
XSD datatypes. This is so obvious that I didn't think it needed to be
explicitly stated.

Tom, thanks for the links to EDTF and the Freebase work, this was certainly
very enlightening.

Friedrich, the term "query answering" simply means the ability to answer
queries against the database in Phase 3, e.g. the list of cities located in
Ghana with a population over 25,000 ordered by population.

A query system that deals well with intervals -- I would need a pointer for
that. For now I was always assuming to use a single value internally to
answer such queries. If the values is 90+-20 then the query >100? would not
contain that result. Sucks, but I don't know of any better system.

We do not anywhere rely on floats (besides in internal representations),
but always use decimals. Floats have some inherent problems in representing
some numbers that could be interesting for us.

== Time ==

Marco suggested to N/A some values of dates. This is partially the idea of
the "precision" attribute in the current data. Anything below the precision
would be N/A. It would not be possible to N/A the year when the month or
day is known though, as Friedrich suggested.

Friedrich also suggested to use a value like April-July 1567 for uncertain
time instead of the current precision model. I prefer his suggestion to the
current one and will include that in the updated model.

The accuracy though has to be in the unit given by the precision, we cannot
just take seconds, since there is no well-defined number of seconds in a
month or a year, or, almost anything, actually.

Note though that the intervals that Sven mentioned -- useful for e.g.
reigns or office periods -- are different beasts and should have
uncertainty entries both for the start and end date. We have intervals in
the data model, and plan to implement them later -- it is just that they
are not such a high priority (dates appear 2.5 Million times in infoboxes,
intervals only 80,000 times).

I am completely unsure what to do with a value like "about 1850" if not to
interpret it at as something like 1850 +- 50, but Sven seems to dislike
that.

== Location ==

After the discussion, I decided to drop altitude / elevation from the
Geolocation. It can still be expressed through a property, and have all the
flexibility of a normal property (including qualifiers etc.)

In a Geolocation, neither the lat nor the long is optional (sorry Nikola).
The Geolocation as a whole can be optional, though (i.e. unknown), but not
only one of them.

For the geolocations uncertainty I would like to use the same uncertainty
model as for Quantity values and now for time. I know that "meters" have
been suggested instead of degrees, but that would be kind of ugh
considering that the biggest reason why we need the uncertainty is for
converting units, in this case from decimals to degree-minute-seconds.

== Quantity values ==

Sorry to disagree with Daniel here, but we will definitively store a
quantity value in the unit that the editor used for input. We will then
internally normalize it for indexing etc., but the editor won't be bothered
with that as long as they do not ask for a conversion. Storing it with the
original unit is important for a number of reasons, most of which Gregor
already alluded to.

I very much like Gregor's suggestion: rename the lower uncertainty and
upper uncertainty to something with less semantic baggage. What about upper
and lower bound? Or just upper and lower? And then leave the interpretation
to others.

Gregor, an infinitively precise number (the number of apostles, e.g.) would
be handled trivially by +- 0.

Also I am taking the hint from Avenue and others and drop confidence. I
don't think it is useful to have it so deeply embedded in the data model,
and should properly be handled through qualifiers.

Regarding the height of the Eiffel tower: 324 m +- 1m is exactly what I
would like to see here if the source states 324 meter.
I know the source doesn't say +-1m, but this is certainly supported by the
source. Think about why we need this +-1m: it is simply so we can give a
useful transformation into feet. Otherwise we cannot convert units.
The +-1m would not be displayed usually.

== Units ==

I sense consensus that we should allow declaration of units in the wiki,
and not to have it hardcoded in the software. Having discussed the various
options and in light of the discussion here, the current suggestion would
be to create a page for every quantity unit including the appropriate
factors (for linear translations). This is similar to the way Freebase does
it, as sent around by Tom, and what John McClure suggested.

Then on a given property, the property points to a quantity unit and
furthermore lists the "usual units" for the given property (pointing to the
given items), which is used for display.

Internally, for indexing, sorting, and query answering, we would always
transform the input to the quantity unit so they are comparable. But this
is usually neither exposed nor a useful number (e.g. it might have too many
significant digits etc.)

This would allow to use historic units like Li or historic miles even
though we do not know how to translate them to other units (but not by the
same property).

This would also allow for other units, like Avenue has pointed out. Those
are important.

Nikola, we will not have special handling for money for now. This would
require a whole different spec I am afraid. Currency happen 200,000 times
in Wikipedia -- it is often, but not so often to be high priority.

I hope that I managed to digest the whole discussion and bring it together.

Cheers,
Denny

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to