Re: [Wikidata] Data model explanation and protection

Tom Morris Wed, 28 Oct 2015 12:23:05 -0700

BTW, merges aren't the only problem.  For all languages except English,
it's the protein Wikidata item [1] that points to the corresponding
Wikipedia page, while for Engish it's the gene item [2] that points to the
corresponding English article [3].


[1] https://www.wikidata.org/wiki/Q13561329
[2] https://www.wikidata.org/wiki/Q414043
[3] https://en.wikipedia.org/wiki/Reelin


On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris <[email protected]> wrote:

> This is a deep-seated semantic confusion going back to at least 2006 [1]
> when the Protein Infobox had Entrez and OMIM gene IDs.  Freebase naively
> adopted in its initial protein schema in 2007 when it was importing from
> those infoboxes.  Although it made some progress in improving the schema
> later, anything not aligned with how Wikipedians want to do things is
> shoveling against the tide.  It's also very difficult to manage
> equivalences when Wikipedia articles are about multiple things like the
> protein/gene articles.
>
> If you look at the recent merge of Reelin [3] you can see that it was done
> by the same user who contributed substantially to the article back in 2006
> [4], so clearly, as the "owner" of that article, they clearly know what's
> best.  :-) It's going to be very difficult to get people to unlearn a
> decade of habits.
>
> Another issue is that, as soon as you start trying to split things out
> into semantically clean pieces, you immediately run afoul of the notability
> restrictions. Because human (and mouse) genes don't have their own
> Wikipedia pages, they're clearly not notable, so they can't be added to
> Wikidata.
>
> This problem of chunking by notability (or lack thereof), length of text
> article, relatedness, and other attributes rather than semantic
> individuality is much more widespread than just proteins/genes.  It also
> effects things like pairs (or small sets) of people who aren't notable
> enough to have an article on their own, articles which contain infoboxes
> about people who aren't notable, so they got tacked onto related article to
> give them a how, etc.
>
> The inverse problem exists as well where a single semantic topic is broken
> up into multiple articles purely for reasons of length.  Other types of
> semantic mismatches include articles along precoordinated facets like
> Transportation in New York City (or even History of Transportation in New
> York City!), list articles (* Filmography, * Discography, * Videography,
> List of *).  Of course, some lists, like the Fortune 500, make sense to
> talk about as entities, but most Wikipedia lists are just mechanically
> generated things for human browsing which don't really need a semantic
> identifier.  Freebase deleted most of this Wikipedia cruft.
>
> Going back to Ben's original problem, one tool that Freebase used to help
> manage the problem of incompatible type merges was a set of curated sets of
> incompatible types [5] which was used by the merge tools to warn users that
> the merge they were proposing probably wasn't a good idea.  People could
> ignore the warning in the Freebase implementation, but Wikidata could make
> it a hard restriction or just a warning.
>
> Tom
>
> [1]
> https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldid=56101233
> [2] http://www.freebase.com/biology/protein/entrez_gene_id
> [3]
> https://www.wikidata.org/w/index.php?title=Q414043&type=revision&diff=262778265&oldid=262243280
> [4]
> https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=history
> [5] http://www.freebase.com/dataworld/incompatible_types?instances=
>
>
> On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <[email protected]>
> wrote:
>
>> The Gene Wiki team is experiencing a problem that may suggest some areas
>> for improvement in the general wikidata experience.
>>
>> When our project was getting started, we had some fairly long public
>> debates about how we should structure the data we wanted to load [1].
>> These resulted in a data model that, we think, remains pretty much true to
>> the semantics of the data, at the cost of distributing information about
>> closely related things (genes, proteins, orthologs) across multiple,
>> interlinked items.  Now, as long as these semantic links between the
>> different item classes are maintained, this is working out great.  However,
>> we are consistently seeing people merging items that our model needs to be
>> distinct.  Most commonly, we see people merging items about genes with
>> items about the protein product of the gene (e.g. [2]]).  This happens
>> nearly every day - especially on items related to the more popular
>> Wikipedia articles. (More examples [3])
>>
>> Merges like this, as well as other semantics-breaking edits, make it very
>> challenging to build downstream apps (like the wikipedia infobox) that
>> depend on having certain structures in place.  My question to the list is
>> how to best protect the semantic models that span multiple entity types in
>> wikidata?  Related to this, is there an opportunity for some consistent way
>> of explaining these structures to the community when they exist?
>>
>> I guess the immediate solutions are to (1) write another bot that watches
>> for model-breaking edits and reverts them and (2) to create an article on
>> wikidata somewhere that succinctly explains the model and links back to the
>> discussions that went into its creation.
>>
>> It seems that anyone that works beyond a single entity type is going to
>> face the same kind of problems, so I'm posting this here in hopes that
>> generalizable patterns (and perhaps even supporting code) can be realized
>> by this community.
>>
>> [1]
>> https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Distinguishing_between_genes_and_proteins
>> [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370
>> [3]
>> https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/mergelist.txt
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
>>
>

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Data model explanation and protection

Reply via email to