On Mon, Jun 09, 2008 at 11:50:50AM +0200, Florian Lohoff wrote: > [EMAIL PROTECTED]:~/projects/tah$ time ( ./createworkset -t dir; sync ) > Pages: 191390 Pagesize: 4096 > Need to create a working set of minimum 2242Mbyte > > real 12m26.486s > user 0m0.812s > sys 0m17.013s > [EMAIL PROTECTED]:~/projects/tah$ time ( ./createworkset -t flat; sync ) > Pages: 191390 Pagesize: 4096 > Need to create a working set of minimum 2242Mbyte > > real 1m36.243s > user 0m0.772s > sys 0m14.749s > > This is a working set creation with 0-32Kbyte files e.g. average of > 16Kbyte. The dirset is a 2 level directory e.g. workset/z/x/y.png > with z ranging from 0-7 x and y ranging from 0-512 > > The flat file model e.g. container is a single container with an > additional index file. > > The hardware is a single P4 2.4Ghz, 768MByte memory, a single IDE disk > ~80% full, ext3 filesystem with atime set. > > The difference in time is caused by the disk performance of unzipping or > creation of files is bound by seeks not by streaming performance. One > can see from the times that in the end the one file per tile model does > not need significantly more CPU time but rather waits more on the disk. > > I'll write a simple benchmark for access e.g. serve performance in a > minute. I'll put it online for others to play afterwards ...
Without much thinking - just seleting 4000 of the roughly 140000 virtual tiles and requesting them e.g. getting them from storage: [EMAIL PROTECTED]:~/projects/tah$ time ./servtiles -t flat -f flat.list.random.4000 Loading tile list to query Done loading tile query list - found 4000 querys Prepare done real 0m40.952s user 0m0.156s sys 0m0.548s [EMAIL PROTECTED]:~/projects/tah$ time ./servtiles -t dir -f dir.list.random.4000 Loading tile list to query Done loading tile query list - found 4000 querys real 1m3.776s user 0m0.012s sys 0m0.368s Currently the index is a fixed width table i read in once. If the index is too large for main memory one would switch to a simple ndbm/ldbm file which would only contain the tiles x, y, z as a key and as a value the filename offset and length. Currently i just create a single container e.g. for me with 2.2GB of data. I am not shure where the break even is performance wise. It does not make sense to have that enormous containers as one needs to reclaim the space after some time e.g. search for holes in the data and rewrite containers which are half way empty (read: same tile in other container has a more recent create timestamp). In the end there would be a background job searching for the least full container and joining it with the next least full container - obsoleting the former two and creating a new one. This is why in the index file there is a timestamp. I put up the code at: http://silicon-verl.de/home/flo/tmp/tahbench-20080609.tgz to play with. Its by no means optimized or fixes all problems but it should show that the millions of little files approach is a performance problem. BTW: With 4K blocks in the filesystem one wastes an average of 2K per file which with my working set (~140K files) this sums up to 280Mbyte wasted space on disk but even worse - worse cache footprint. With 16kbyte average file size 12% of disk space is wasted because of the last block beeing statistically half empty. Flo -- Florian Lohoff [EMAIL PROTECTED] +49-171-2280134 Those who would give up a little freedom to get a little security shall soon have neither - Benjamin Franklin
signature.asc
Description: Digital signature
_______________________________________________ Tilesathome mailing list [email protected] http://lists.openstreetmap.org/cgi-bin/mailman/listinfo/tilesathome
