BTW, merges aren't the only problem. For all languages except English, it's the protein Wikidata item [1] that points to the corresponding Wikipedia page, while for Engish it's the gene item [2] that points to the corresponding English article [3].
[1] https://www.wikidata.org/wiki/Q13561329 [2] https://www.wikidata.org/wiki/Q414043 [3] https://en.wikipedia.org/wiki/Reelin On Wed, Oct 28, 2015 at 3:08 PM, Tom Morris <[email protected]> wrote: > This is a deep-seated semantic confusion going back to at least 2006 [1] > when the Protein Infobox had Entrez and OMIM gene IDs. Freebase naively > adopted in its initial protein schema in 2007 when it was importing from > those infoboxes. Although it made some progress in improving the schema > later, anything not aligned with how Wikipedians want to do things is > shoveling against the tide. It's also very difficult to manage > equivalences when Wikipedia articles are about multiple things like the > protein/gene articles. > > If you look at the recent merge of Reelin [3] you can see that it was done > by the same user who contributed substantially to the article back in 2006 > [4], so clearly, as the "owner" of that article, they clearly know what's > best. :-) It's going to be very difficult to get people to unlearn a > decade of habits. > > Another issue is that, as soon as you start trying to split things out > into semantically clean pieces, you immediately run afoul of the notability > restrictions. Because human (and mouse) genes don't have their own > Wikipedia pages, they're clearly not notable, so they can't be added to > Wikidata. > > This problem of chunking by notability (or lack thereof), length of text > article, relatedness, and other attributes rather than semantic > individuality is much more widespread than just proteins/genes. It also > effects things like pairs (or small sets) of people who aren't notable > enough to have an article on their own, articles which contain infoboxes > about people who aren't notable, so they got tacked onto related article to > give them a how, etc. > > The inverse problem exists as well where a single semantic topic is broken > up into multiple articles purely for reasons of length. Other types of > semantic mismatches include articles along precoordinated facets like > Transportation in New York City (or even History of Transportation in New > York City!), list articles (* Filmography, * Discography, * Videography, > List of *). Of course, some lists, like the Fortune 500, make sense to > talk about as entities, but most Wikipedia lists are just mechanically > generated things for human browsing which don't really need a semantic > identifier. Freebase deleted most of this Wikipedia cruft. > > Going back to Ben's original problem, one tool that Freebase used to help > manage the problem of incompatible type merges was a set of curated sets of > incompatible types [5] which was used by the merge tools to warn users that > the merge they were proposing probably wasn't a good idea. People could > ignore the warning in the Freebase implementation, but Wikidata could make > it a hard restriction or just a warning. > > Tom > > [1] > https://en.wikipedia.org/w/index.php?title=Reelin&diff=56108806&oldid=56101233 > [2] http://www.freebase.com/biology/protein/entrez_gene_id > [3] > https://www.wikidata.org/w/index.php?title=Q414043&type=revision&diff=262778265&oldid=262243280 > [4] > https://en.wikipedia.org/w/index.php?title=Reelin&dir=prev&action=history > [5] http://www.freebase.com/dataworld/incompatible_types?instances= > > > On Wed, Oct 28, 2015 at 1:07 PM, Benjamin Good <[email protected]> > wrote: > >> The Gene Wiki team is experiencing a problem that may suggest some areas >> for improvement in the general wikidata experience. >> >> When our project was getting started, we had some fairly long public >> debates about how we should structure the data we wanted to load [1]. >> These resulted in a data model that, we think, remains pretty much true to >> the semantics of the data, at the cost of distributing information about >> closely related things (genes, proteins, orthologs) across multiple, >> interlinked items. Now, as long as these semantic links between the >> different item classes are maintained, this is working out great. However, >> we are consistently seeing people merging items that our model needs to be >> distinct. Most commonly, we see people merging items about genes with >> items about the protein product of the gene (e.g. [2]]). This happens >> nearly every day - especially on items related to the more popular >> Wikipedia articles. (More examples [3]) >> >> Merges like this, as well as other semantics-breaking edits, make it very >> challenging to build downstream apps (like the wikipedia infobox) that >> depend on having certain structures in place. My question to the list is >> how to best protect the semantic models that span multiple entity types in >> wikidata? Related to this, is there an opportunity for some consistent way >> of explaining these structures to the community when they exist? >> >> I guess the immediate solutions are to (1) write another bot that watches >> for model-breaking edits and reverts them and (2) to create an article on >> wikidata somewhere that succinctly explains the model and links back to the >> discussions that went into its creation. >> >> It seems that anyone that works beyond a single entity type is going to >> face the same kind of problems, so I'm posting this here in hopes that >> generalizable patterns (and perhaps even supporting code) can be realized >> by this community. >> >> [1] >> https://www.wikidata.org/wiki/Wikidata_talk:WikiProject_Molecular_biology#Distinguishing_between_genes_and_proteins >> [2] https://www.wikidata.org/w/index.php?title=Q417782&oldid=262745370 >> [3] >> https://s3.amazonaws.com/uploads.hipchat.com/25885/699742/rTrv5VgLm5yQg6z/mergelist.txt >> >> >> _______________________________________________ >> Wikidata mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/wikidata >> >> >
_______________________________________________ Wikidata mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata
