> From: Toby Murray [mailto:toby.mur...@gmail.com] 
> Subject: Re: [OSM-talk] Réf.: Re: All you've ever wanted to know about the
french cadastre
> I think the biggest cost for long tags that are heavily used is really 
> in the planet file size. A bigger planet takes longer to generate, longer 
> to download, longer to parse. The sheer size of it can be a problem to 
> some potential users. Especially when over 10% of it is just tags from 
> imports that most data consumers couldn't care less about. I think I 
> calculated once that the tiger:upload_uuid tag here in the US is 
> responsible for about 1% of the data in the planet file. Since it is a 
> random string with hundreds of thousands of possible values, it doesn't 
> compress well either.

The French cadastre imports are about twice the size of TIGER, measured in
number of tagged objects.

What's more interesting is the difference between the current OSM data and
what it would be if excess data and tags were removed. Looking at a random
TIGER way (5264081) it has 16 tags instead of the 3-4 I would use if tagging
it. 

The 16 tags total 460 characters for keys and values and the 4 total 79. To
simplify, I'll assume that TIGER consumes about 4x the space in tags it
needs to. On the other hand, it has no excess nodes. The technical downsides
to this extra space consumed are larger planets and more complicated tag
columns and indexes. Most data consumers will drop the TIGER tags along with
the source tags so it will increase the time taken for the initial load of
data and applying diffs.

The cadastre imports are more complicated. I'm not aware of any
comprehensive studies on the quality of the imports, but I did some
analysis[1] previously. Based on this, about 75% of the buildings are
building=yes wall=yes and 25% are building=yes with no wall tag. The same
analysis as done for TIGER indicates 2x the space in tags that it needs to
use.

What is considerably more complicated is the geometries used in the French
import. A typical example would be a detached house with a porch,
represented as two ways when most mappers would recommend one. This results
in extra ways which results in extra rows, larger geometry indexes, slower
queries and all the tag information being duplicated.

The important question is, how many buildings are like this? It is possible
to get an answer to how many share ways, but in some cases sharing ways is
normal (e.g. a block in a city where the buildings are joined). A complete
analysis is beyond the scope of this email, but we can get an idea from [2]
and the fact that the most common unsimplified building area in the import
is 6 square meters[3]. This indicates that the case of a  building way with
attached ways to represent the porches and other attached areas.

If you assume wall=no buildings attached to buildings without a wall tag can
be combined, I would estimate that the number of ways is at least 1.5x what
it needs to be. An average cadastre building way uses 5.75 nodes. If you
consider the case of a square building where one corner is mapped with
wall=no this is a change from 7 nodes (6 in one and 4 in the other, but some
nodes share) to 4 nodes. 

Again, we get a result on the order of 1.5x as many nodes as required.

For most data consumers the bloat in objects from the cadastre imports will
be far more significant than the bloat in tags on TIGER data.

It's hard to convert these to raw times, but to give an idea, throwing out
raw buildings reportedly reduced the Nominatim import time from 48 hours to
37 hours, and half the buildings are in France.

I would welcome a more complete analysis and if anyone needs me to run some
queries on my pgsnapshot DB I could do so.

> One schema where you could actually make a direct comparison is
pgsnapshot. 
> It can store listening geometry and it stores all tags in an hstore field.

> I'm not really sure how the linestring geometry is stored on disk. When 
> queried at a postgres prompt, it returns a string that is 187 characters 
> long for some random 4 node way I picked out.

I believe the representation on-disk uses space proportional to the string
returned. This doesn't tell you how much space is taken up by nodes which is
more significant.

[1]:
http://lists.openstreetmap.org/pipermail/talk/2012-September/064559.html
[2]:
http://lists.openstreetmap.org/pipermail/talk/2012-September/064576.html
[3]: http://merry.paulnorman.ca:7201/dist2.pdf



_______________________________________________
talk mailing list
talk@openstreetmap.org
http://lists.openstreetmap.org/listinfo/talk

Reply via email to