I like that phrase

"Is the data going to be used? Data that is not used is exponentially
harder to maintain because less people see it"

To take a specific example I've been building a name and identifier
recognition system based on data from Freebase that is focused on certain
kinds of spatial regions.  I'm going to underline that this is not an
academic project (where,  in the worst case,  I might proudly announce that
I got 81.2% accuracy and that this beats the last group that got 80.3%) but
a commercial system that (1) needs to be hyperaccurate (at least three
nines if not four) and (2) where I need to fix anything that management or
customers find wrong right away.

Another aspect of it is that I can get (barely) two nine accuracy for
entities while only resolving about 40% of place names that appear once
because these entities are concentrated in certain places.  Many of the
most popular regions need data corrections to resolve correctly because
they tend to be national capitals where there are multiple geographic
entities that occupy the same land area or they are ontologically troubled
islands.

Looking in Freebase I don't find 100% of the identifiers that are used in
my data set, and another issues is that some containment relationships are
missing because sometimes @fbase couldn't figure out the relative hierarchy
of places.

I address both of those issues by applying "fact patches" to my knowledge
base.

In principle I could push these changes back to @fbase,  but since mqlwrite
is broken and @fbase is heading towards EOL,  I won't.

There are other problems though,  that I end up addressing in my rule base,
or that I have add different vocabulary if I want to solve them.  For
instance,  I get a lot of references to "Hong Kong Island" which is not to
be confused with

http://en.wikipedia.org/wiki/Islands_District

it turns out HKI has four administrative districts. With a little more
logic I can probably figure out which district these things are in,  but
maybe it doesn't make any real difference to end users and I'm not sure
mail would be delivered to the "Central and Western District",  so I could
make HKI an "honorary" administrative district (something I wouldn't push
back to upstream)

So you notice two themes here.  Some of my patches are things that would
belong in Wikidata because they are filling in fields that Wikidata already
has and following conventional conventions.

There are other patches I need to make to reflect requirements of my
application that I'd never want to push upstream because they are "correct"
in the context of my application but "incorrect" or "arguable" in general.

-----

One of the troubles people have consistently had with DBpedia has been
trying to get a list of the top cities in the world (by one or another
metric)  It's hard to do for two reasons:

(1) some facts are absent in DBpedia,  and
(2) many of the biggest/most important "cities" in the world such as London
and Tokyo are not,  technically,  cities.

Success in this project,  therefore,  requires patching absent or incorrect
facts in DBpedia,  but also the creation of a vernacular concept of "city"
which reflects the "common sense" perception here.


On Sat, Jan 3, 2015 at 9:48 AM, Lydia Pintscher <
[email protected]> wrote:

> Hey folks :)
>
> Happy new year everyone. It is surely going to be an exciting one for
> Wikidata. Over the last weeks I've been thinking a lot about the year
> ahead of us. One thing is clear to me: It will be about successfully
> scaling Wikidata and keeping all the amazing things we have achieved
> in the process.
>
> I've written down my thoughts on the subject in a blog post to kick
> off some thinking and discussions:
>
> http://blog.wikimedia.de/2015/01/03/scaling-wikidata-success-means-making-the-pie-bigger/
>
>
> Cheers
> Lydia
>
> --
> Lydia Pintscher - http://about.me/lydia.pintscher
> Product Manager for Wikidata
>
> Wikimedia Deutschland e.V.
> Tempelhofer Ufer 23-24
> 10963 Berlin
> www.wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.
>
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
> unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
> Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985.
>
> _______________________________________________
> Wikidata-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata-l
>



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254    paul.houle on Skype   [email protected]
http://legalentityidentifier.info/lei/lookup
_______________________________________________
Wikidata-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Reply via email to