Hi, I'm not sure if this has been discussed or not, if yes, a pointer to the old thread would be much appreciated!
People working on imports, QA tools, and others come across the problem of comparing data in OSM to external data sources. So far, I have seen proposals that look like git and use change history, or use some kind of ID in OSM that corresponds to an ID in the external data set. However, both of these approaches have critical issues. The external IDs can be deleted or damaged during normal editing. The change history can be lost if editors delete an area and retrace rather than move the existing data around. OSM needs to optimize first and foremost for normal hand editing, so adding restrictions to prevent either kind of activity obviously should not happen. There is another approach. We instead should focus on using weighted matching algorithms. http://en.wikipedia.org/wiki/Bipartite_graph http://en.wikipedia.org/wiki/Ford%E2%80%93Fulkerson_algorithm This is a well-studied problem. Because our data is basically aligned, it is not NP hard. Doing 30,000 matches in a minute should be no problem. The algorithms and new computers are fast enough for real problems. The idea is that we make a feature vectors that get turned into a feature to feature match score. If you are matching buildings use Euclidean distance between center of masses, area, ratio area to circumference, etc. If we are doing roads, break things into segments, and use length, Euclidean distance between center of masses, PCA to get orientation, etc. If the roads have names, you could add in the hamming edit distance to score name matches. The feature vectors would be dependent on the specific kind of data that is being matched, we would probably need a handful of them to handle all of the different kinds of data OSM has. http://en.wikipedia.org/wiki/Shape_analysis_(digital_geometry) http://en.wikipedia.org/wiki/Principal_component_analysis http://en.wikipedia.org/wiki/Hamming_distance For example, if you are matching houses. You would find all of the houses that are within a specified distance in the external data set and calculate a match score between the OSM house and each house in the external data set. This forms a weighted graph between the OSM data and the external data, with the weight of each edge between the match score. The match score a derived from the feature vectors. You then use something like the augmented path maximum flow algorithm to calculate the globally optimal match. You would need to fiddle with the scoring and feature vector, but it would automate the trivial matching, and by inspecting matches with low scores, would allow the end user to focus his/her attention on areas that need a person looking at it. If you went crazy, you could even include the OSM change history to find items that were intentionally removed from OSM that are still present in the external data source. The soft matching followed by an optimization will allow us to compare OSM to existing data as it is now. It seems like having some code to do this would be useful to a bunch of different people in OSM. This kind of problem is tackled all the time by people doing machine learning and image processing. Has this been tried by anybody? Is there any code available? Thanks Jason _______________________________________________ Talk-us mailing list [email protected] http://lists.openstreetmap.org/listinfo/talk-us

