Re: [Wikidata-l] Data values

Sven Manguard Wed, 19 Dec 2012 17:53:52 -0800

I think that Tom Morris tragically misunderstood my point, although that
was likely helped by the fact that, as I've insinuated already, the many
standards and acronyms being thrown about are largely lost on me.


My point is not "We can just throw everything out because we're big and
awesome and have name brand power". My point was "We're going to reach a
point where some of the existing standards and tools just don't work
because when they were built things like Wikidata weren't envisioned. We
need to have the mindset that developing new pieces that work for us is
better than trying to force a square peg into a round hole just because
something is already widely used. If what exists doesn't work, we're going
to do more harm than good if we have to start cutting corners or cutting
features to try and get it to work. We have an infrestructure that would
allow third parties to come along later and build tools that allow there to
be a bridge between whatever we create and whatever exists already".

Sven

On Wed, Dec 19, 2012 at 2:40 PM, Tom Morris <[email protected]> wrote:

> Wow, what a long thread.  I was just about to chime in to agree with
> Sven's point about units when he interjected his comment about blithely
> ignoring history, so I feel compelled to comment on that first.  It's fine
> to ignore standards *for good reasons*, but doing it out of ignorance or
> gratuitously is just silly.  Thinking that WMF is so special it can create
> a better solution without even know what others have done before is the
> height of arrogance.
>
> Modeling time and units can basically be made arbitrary complex, so the
> trick is in achieving the right balance of complexity vs utility.  Time is
> complex enough that I think it deserves it's own thread.  The first thing
> I'd do is establish some definitions to cover some basics like
> durations/intervals, uncertain dates, unknown dates, imprecise dates, etc
> to that everyone is using the same terminology and concepts.  Much of the
> time discussion is difficult for me to follow because I have to guess at
> what people mean.  In addition to the ability to handle circa/about dates
> already mentioned, it's also useful to be able to represent before/after
> dates e.g. he died before 1 Dec 1792 when his will was probated.  Long term
> I suspect you'll need support for additional calendars rather than
> converting everything to a common calendar, but only supporting Gregorian
> is a good way to limit complexity to start with.  Geologic times may
> (probably?) need to be modeled differently.
>
> Although I disagree strongly with Sven's sentiments about the
> appropriateness of reinventing things, I believe he's right about the need
> to support more units than just SI units and to know what units were used
> in the original measurement.  It's not just a matter of aesthetics but of
> being able to preserve the provenance.  Perhaps this gets saved for a
> future iteration, but you may find that you need both display and
> computable versions of things stored separately.
>
> Speaking of computable versions don't underestimate the issues with using
> floating points numbers.  There are numbers that they just can't represent
> and their range is not infinite.
>
> Historians and genealogists have many interminable discussions about
> date/time representation which can be found in various list archives, but
> one recent spec worth reviewing is Extended Date/Time Format (EDTF)
> http://www.loc.gov/standards/datetime/pre-submission.html
>
> Another thing worth looking at is the Freebase schema since it not only
> represents a bunch of this stuff already, but it's got real world data
> stored in the schema and user interface implementations for input and
> rendering (although many of the latter could be improved).  In particular,
> some of the following might be of interest:
>
> http://www.freebase.com/view/measurement_unit /
> http://www.freebase.com/schema/measurement_unit
> http://www.freebase.com/schema/time
> http://www.freebase.com/schema/astronomy/celestial_object_age
> http://www.freebase.com/schema/time/geologic_time_period
> http://www.freebase.com/schema/time/geologic_time_period_uncertainty
>
> If you rummage around, you can probably find lots of interesting examples
> and decide for yourself whether or not that's a good way to model things.
>  I'm reasonably familiar with the schema and happy to answer questions.
>
> There are probably lots of other example vocabularlies that one could
> review such as the Pleiades project's:
> http://pleiades.stoa.org/vocabularies
>
> You're not going to get it right the first time, so I would just start
> with a small core that you're reasonably confident in and iterate from
> there.
>
> Tom
>
> On Wed, Dec 19, 2012 at 12:47 PM, Sven Manguard <[email protected]>wrote:
>
>> My philosophy is this: We should do whatever works best for Wikidata and
>> Wikidata's needs. If people want to reuse our content, and the choices
>> we've made make existing tools unworkable, they can build new tools
>> themselves. We should not be clinging to "what's been done already" if it
>> gets in the way of "what will make Wikidata better". Everything that we
>> make and do is open, including the software we're going to operate the
>> database on. Every WMF project has done things differently from the
>> standards of the time, and people have developed tools to use our content
>> before. Wikidata will be no different in that regard.
>>
>> Sven
>>
>>
>> On Wed, Dec 19, 2012 at 12:27 PM, Martynas Jusevičius <
>> [email protected]> wrote:
>>
>>> Denny,
>>>
>>> you're sidestepping the main issue here -- every sensible architecture
>>> should build on as much previous standards as possible, and build own
>>> custom solution only if a *very* compelling reason is found to do so
>>> instead of finding a compromise between the requirements and the
>>> standard. Wikidata seems to be constantly doing the opposite --
>>> building a custom solution with whatever reason, or even without it.
>>> This drives the compatibility and reuse towards zero.
>>>
>>> This thread originally discussed datatypes for values such as numbers,
>>> dates and their intervals -- semantics for all of those are defined in
>>> XML Schema Datatypes: http://www.w3.org/TR/xmlschema-2/
>>> All the XML and RDF tools are compatible with XSD, however I don't
>>> think there is even a single mention of it in this thread? What makes
>>> Wikidata so special that its datatypes cannot build on XSD? And this
>>> is only one of the issues, I've pointed out others earlier.
>>>
>>> Martynas
>>> graphity.org
>>>
>>>
>>> On Wed, Dec 19, 2012 at 5:58 PM, Denny Vrandečić
>>> <[email protected]> wrote:
>>> > Martynas,
>>> >
>>> > could you please let me know where RDF or any of the W3C standards
>>> covers
>>> > topics like units, uncertainty, and their conversion. I would be very
>>> much
>>> > interested in that.
>>> >
>>> > Cheers,
>>> > Denny
>>> >
>>> >
>>> >
>>> >
>>> > 2012/12/19 Martynas Jusevičius <[email protected]>
>>> >>
>>> >> Hey wikidatians,
>>> >>
>>> >> occasionally checking threads in this list like the current one, I get
>>> >> a mixed feeling: on one hand, it is sad to see the efforts and
>>> >> resources waisted as Wikidata tries to reinvent RDF, and now also
>>> >> triplestore design as well as XSD datatypes. What's next, WikiQL
>>> >> instead of SPARQL?
>>> >>
>>> >> On the other hand, it feels reassuring as I was right to predict this:
>>> >>
>>> http://www.mail-archive.com/[email protected]/msg00056.html
>>> >>
>>> http://www.mail-archive.com/[email protected]/msg00750.html
>>> >>
>>> >> Best,
>>> >>
>>> >> Martynas
>>> >> graphity.org
>>> >>
>>> >> On Wed, Dec 19, 2012 at 4:11 PM, Daniel Kinzler
>>> >> <[email protected]> wrote:
>>> >> > On 19.12.2012 14:34, Friedrich Röhrs wrote:
>>> >> >> Hi,
>>> >> >>
>>> >> >> Sorry for my ignorance, if this is common knowledge: What is the
>>> use
>>> >> >> case for
>>> >> >> sorting millions of different measures from different objects?
>>> >> >
>>> >> > Finding all cities with more than 100000 inhabitants requires the
>>> >> > database to
>>> >> > look through all values for the property "population" (or even all
>>> >> > properties
>>> >> > with countable values, depending on implementation an query
>>> planning),
>>> >> > compare
>>> >> > each value with "100000" and return those with a greater value. To
>>> speed
>>> >> > this
>>> >> > up, an index sorted by this value would be needed.
>>> >> >
>>> >> >> For cars there could be entries by the manufacturer, by some
>>> >> >> car-testing magazine, etc. I don't see how this could be adequatly
>>> >> >> represented/sorted by a database only query.
>>> >> >
>>> >> > If this cannot be done adequatly on the database level, then it
>>> cannot
>>> >> > be done
>>> >> > efficiently, which means we will not allow it. So our task is to
>>> come up
>>> >> > with an
>>> >> > architecture that does allow this.
>>> >> >
>>> >> > (One way to allow "scripted" queries like this to run efficiently
>>> is to
>>> >> > do this
>>> >> > in a massively parallel way, using a map/reduce framework. But
>>> that's
>>> >> > also not
>>> >> > trivial, and would require a whole new server infrastructure).
>>> >> >
>>> >> >> If however this is necessary, i still don't understand why it must
>>> >> >> affect the
>>> >> >> datavalue structure. If a index is necessary it could be done over
>>> a
>>> >> >> serialized
>>> >> >> representation of the value.
>>> >> >
>>> >> > "Serialized" can mean a lot of things, but an index on some data
>>> blob is
>>> >> > only
>>> >> > useful for exact matches, it can not be used for greater/lesser
>>> queries.
>>> >> > We need
>>> >> > to map our values to scalar data types the database can understand
>>> >> > directly, and
>>> >> > use for indexing.
>>> >> >
>>> >> >> This needs to be done anyway, since the values are
>>> >> >> saved at a specific unit (which is just a wikidata item). To
>>> compare
>>> >> >> them on a
>>> >> >> database level they must all be saved at the same unit, or some
>>> sort of
>>> >> >> procedure must be used to compare them (or am i missing something
>>> >> >> again?).
>>> >> >
>>> >> > If they measure the same dimension, they should be saved using the
>>> same
>>> >> > unit
>>> >> > (probably the SI base unit for that dimension). Saving values using
>>> >> > different
>>> >> > units would make it impossible to run efficient queries against
>>> these
>>> >> > values,
>>> >> > thereby defying one of the major reasons for Wikidata's existance. I
>>> >> > don't see a
>>> >> > way around this.
>>> >> >
>>> >> > -- daniel
>>> >> >
>>> >> > --
>>> >> > Daniel Kinzler, Softwarearchitekt
>>> >> > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
>>> e. V.
>>> >> >
>>> >> >
>>> >> > _______________________________________________
>>> >> > Wikidata-l mailing list
>>> >> > [email protected]
>>> >> > https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> >>
>>> >> _______________________________________________
>>> >> Wikidata-l mailing list
>>> >> [email protected]
>>> >> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> >
>>> >
>>> >
>>> >
>>> > --
>>> > Project director Wikidata
>>> > Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
>>> > Tel. +49-30-219 158 26-0 | http://wikimedia.de
>>> >
>>> > Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
>>> > Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
>>> unter
>>> > der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
>>> > Körperschaften I Berlin, Steuernummer 27/681/51985.
>>> >
>>> > _______________________________________________
>>> > Wikidata-l mailing list
>>> > [email protected]
>>> > https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>> >
>>>
>>> _______________________________________________
>>> Wikidata-l mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>>
>>
>>
>> _______________________________________________
>> Wikidata-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>>
>>
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>
>

_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to