On Mon, Jun 09, 2008 at 11:50:50AM +0200, Florian Lohoff wrote:
> [EMAIL PROTECTED]:~/projects/tah$ time ( ./createworkset -t dir; sync )
> Pages: 191390 Pagesize: 4096
> Need to create a working set of minimum 2242Mbyte
> 
> real    12m26.486s
> user    0m0.812s
> sys     0m17.013s
> [EMAIL PROTECTED]:~/projects/tah$ time ( ./createworkset -t flat; sync )
> Pages: 191390 Pagesize: 4096
> Need to create a working set of minimum 2242Mbyte
> 
> real    1m36.243s
> user    0m0.772s
> sys     0m14.749s
> 
> This is a working set creation with 0-32Kbyte files e.g. average of
> 16Kbyte. The dirset is a 2 level directory e.g. workset/z/x/y.png
> with z ranging from 0-7 x and y ranging from 0-512
> 
> The flat file model e.g. container is a single container with an
> additional index file.
> 
> The hardware is a single P4 2.4Ghz, 768MByte memory, a single IDE disk
> ~80% full, ext3 filesystem with atime set.
> 
> The difference in time is caused by the disk performance of unzipping or
> creation of files is bound by seeks not by streaming performance. One
> can see from the times that in the end the one file per tile model does
> not need significantly more CPU time but rather waits more on the disk.
> 
> I'll write a simple benchmark for access e.g. serve performance in a
> minute. I'll put it online for others to play afterwards ...

Without much thinking - just seleting 4000 of the roughly 140000 virtual
tiles and requesting them e.g. getting them from storage:

[EMAIL PROTECTED]:~/projects/tah$ time ./servtiles -t flat -f 
flat.list.random.4000 
Loading tile list to query
Done loading tile query list - found 4000 querys
Prepare done

real    0m40.952s
user    0m0.156s
sys     0m0.548s
[EMAIL PROTECTED]:~/projects/tah$ time ./servtiles -t dir -f 
dir.list.random.4000 
Loading tile list to query
Done loading tile query list - found 4000 querys

real    1m3.776s
user    0m0.012s
sys     0m0.368s

Currently the index is a fixed width table i read in once. If the index
is too large for main memory one would switch to a simple ndbm/ldbm file
which would only contain the tiles x, y, z as a key and as a value the
filename offset and length.

Currently i just create a single container e.g. for me with 2.2GB of
data. I am not shure where the break even is performance wise. It does
not make sense to have that enormous containers as one needs to reclaim
the space after some time e.g. search for holes in the data and rewrite
containers which are half way empty (read: same tile in other container
has a more recent create timestamp). In the end there would be a
background job searching for the least full container and joining it
with the next least full container - obsoleting the former two and
creating a new one. This is why in the index file there is a timestamp.

I put up the code at:

http://silicon-verl.de/home/flo/tmp/tahbench-20080609.tgz

to play with. Its by no means optimized or fixes all problems but it
should show that the millions of little files approach is a performance
problem. BTW: With 4K blocks in the filesystem one wastes an average
of 2K per file which with my working set (~140K files) this sums up to
280Mbyte wasted space on disk but even worse - worse cache footprint.
With 16kbyte average file size 12% of disk space is wasted because of
the last block beeing statistically half empty.

Flo
-- 
Florian Lohoff                  [EMAIL PROTECTED]             +49-171-2280134
        Those who would give up a little freedom to get a little 
          security shall soon have neither - Benjamin Franklin

Attachment: signature.asc
Description: Digital signature

_______________________________________________
Tilesathome mailing list
[email protected]
http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/tilesathome

Reply via email to