Archer <[email protected]> wrote: > Please don’t understand me wrong. I’m a big fan of Wikidata but I'm against > an automated import. The mismatches list gives good examples that your > matching algorithm doesn't work very well: > http://edwardbetts.com/osm-wikidata/mismatches.html > > Some examples: > > 1. Isar Nuclear Power Plant <http://wikidata.org/wiki/Q569510>: your > algorithm matches only one reactor of the power plant: Isar 2 > <http://www.openstreetmap.org/way/32918120> but the right matching > would be Kernkraftwerke > Isar <http://www.openstreetmap.org/way/23802422>
Q569510 is matching Isar 2 (Way 32918120) because Isar 2 is in the list of German aliases in the Wikidata object: [ "KKW Isar", "AKW Isar", "Isar 2", "Kernkraftwerk Isar I", "Isar 1", "Atomkraftwerk Isar" ] The German label on the Wikidata item is "Kernkraftwerke Isar", notice the extra 'e' on the end of the first word. I could add Levenshtein distance calculations to my matching, we could say if there is a single character difference the names match. With this change both OSM objects would match and my code would skip the wikidata item. The problem with this change is that hill and hall would match. > 2. Heligoland <http://wikidata.org/wiki/Q3038>: you’ve matched the island > Heligoland <http://www.openstreetmap.org/relation/3787052> but the right > match would be the municipality Heligoland > <http://www.openstreetmap.org/relation/1157962> (for the island there > exists a different object in Wikidata) I can't find the Wikidata item that represents the island. > 3. Puerto Rico <http://wikidata.org/wiki/Q1183>: the Wikidata objects says > „is a unincorporated area of the United states“ – the right match therefore > would be the administrative relation: Puerto Rico > <http://www.openstreetmap.org/relation/306157> but your algorithm matches > the island: Island of Puerto Rico > <http://www.openstreetmap.org/node/357271412> The English Wikipedia article Puerto Rico is in the 'Islands of Puerto Rico' category, so my code considers Q1183 to represent an island. Node 357271412 is tagged as place=island, so it is perfect match. We could argue that the node doesn't have much purpose in OSM, the tags could be merged into Relation 306157. > I also don’t understand why you prefer nodes instead of ways or relations. > Ways and relations provide more information (e.g. extent of an area) than > nodes. The Matching algorithm should first look for relations, when there’s > no relation it should search for ways. Nodes should come last. The matching algorithm is only considering objects within 400m, so the nodes happen to be close, but the centre of the relation is more than 400m from the location in Wikidata. I've modified my matching algorithm to use much large distances for some types of object, it is running now. My hope is that when it is finished the code will detect the presence of the node and relation and skip the Wikidata item. Most of these node vs relation mismatches should disappear. > What does your matching algorithm when a Wikidata object describes > different objects and therefore should be split? > > A good example for this is the Wikidata object for Thasos > <https://www.wikidata.org/wiki/Q204096> (currently it describes the island > and the municipality “Thasos”) but the object has to be split into two > Wikidata objects so that you can say “the island Thasos lies in the > administrative division Thasos”. There are also other examples like mixed > up nature reserves, lakes and administrative divisions in Wikidata which > you have to solve before you can import the IDs into OSM. My code doesn't do anything special with a wikidata item that represents multiple things like islands and municipalities. If Wikidata/Wikipedia claim a thing is an island, and in OSM there is a thing tagged with place=island and the same name they will match. OSM objects can be tagged as both an island and a municipality. -- Edward. _______________________________________________ talk mailing list [email protected] https://lists.openstreetmap.org/listinfo/talk

