> From: Toby Murray [mailto:toby.mur...@gmail.com] > Subject: Re: [OSM-talk] Réf.: Re: All you've ever wanted to know about the french cadastre > I think the biggest cost for long tags that are heavily used is really > in the planet file size. A bigger planet takes longer to generate, longer > to download, longer to parse. The sheer size of it can be a problem to > some potential users. Especially when over 10% of it is just tags from > imports that most data consumers couldn't care less about. I think I > calculated once that the tiger:upload_uuid tag here in the US is > responsible for about 1% of the data in the planet file. Since it is a > random string with hundreds of thousands of possible values, it doesn't > compress well either.
The French cadastre imports are about twice the size of TIGER, measured in number of tagged objects. What's more interesting is the difference between the current OSM data and what it would be if excess data and tags were removed. Looking at a random TIGER way (5264081) it has 16 tags instead of the 3-4 I would use if tagging it. The 16 tags total 460 characters for keys and values and the 4 total 79. To simplify, I'll assume that TIGER consumes about 4x the space in tags it needs to. On the other hand, it has no excess nodes. The technical downsides to this extra space consumed are larger planets and more complicated tag columns and indexes. Most data consumers will drop the TIGER tags along with the source tags so it will increase the time taken for the initial load of data and applying diffs. The cadastre imports are more complicated. I'm not aware of any comprehensive studies on the quality of the imports, but I did some analysis[1] previously. Based on this, about 75% of the buildings are building=yes wall=yes and 25% are building=yes with no wall tag. The same analysis as done for TIGER indicates 2x the space in tags that it needs to use. What is considerably more complicated is the geometries used in the French import. A typical example would be a detached house with a porch, represented as two ways when most mappers would recommend one. This results in extra ways which results in extra rows, larger geometry indexes, slower queries and all the tag information being duplicated. The important question is, how many buildings are like this? It is possible to get an answer to how many share ways, but in some cases sharing ways is normal (e.g. a block in a city where the buildings are joined). A complete analysis is beyond the scope of this email, but we can get an idea from [2] and the fact that the most common unsimplified building area in the import is 6 square meters[3]. This indicates that the case of a building way with attached ways to represent the porches and other attached areas. If you assume wall=no buildings attached to buildings without a wall tag can be combined, I would estimate that the number of ways is at least 1.5x what it needs to be. An average cadastre building way uses 5.75 nodes. If you consider the case of a square building where one corner is mapped with wall=no this is a change from 7 nodes (6 in one and 4 in the other, but some nodes share) to 4 nodes. Again, we get a result on the order of 1.5x as many nodes as required. For most data consumers the bloat in objects from the cadastre imports will be far more significant than the bloat in tags on TIGER data. It's hard to convert these to raw times, but to give an idea, throwing out raw buildings reportedly reduced the Nominatim import time from 48 hours to 37 hours, and half the buildings are in France. I would welcome a more complete analysis and if anyone needs me to run some queries on my pgsnapshot DB I could do so. > One schema where you could actually make a direct comparison is pgsnapshot. > It can store listening geometry and it stores all tags in an hstore field. > I'm not really sure how the linestring geometry is stored on disk. When > queried at a postgres prompt, it returns a string that is 187 characters > long for some random 4 node way I picked out. I believe the representation on-disk uses space proportional to the string returned. This doesn't tell you how much space is taken up by nodes which is more significant. [1]: http://lists.openstreetmap.org/pipermail/talk/2012-September/064559.html [2]: http://lists.openstreetmap.org/pipermail/talk/2012-September/064576.html [3]: http://merry.paulnorman.ca:7201/dist2.pdf _______________________________________________ talk mailing list talk@openstreetmap.org http://lists.openstreetmap.org/listinfo/talk