Hello!

The git pack format has two uses:

1) A space-optimized format for local repository storage.

2) A compact format for transferring repository data over network.

However, these uses have some conflicting requirements, and currently
the pack format is not as optimal for the task of network transfer as
it could be.  In particular, because using pack files in a local
repository requires random access to the contained objects, all
objects in the pack are compressed separately, which negatively
impacts the compression rate.

I have made a patch which adds the "--compression-level=N" option to
git-pack-objects and tried to look what kind of improvement in the
compression rate we can get if we compress the whole pack file instead
of individual objects.  The patch is in a separate message, however,
I'm not sure if it should be applied immediately - currently there is
no infrastructure for using this option, and maybe we will choose to
implement the same idea in some different way.

Here are the results of my tests:

===========================================================================

1. Packing the whole linux-2.4 repository (13390 objects):

-rw-r--r--  1 vsu 159632105 Aug 13 17:12 
pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r--  1 vsu  30878501 Aug 13 17:16 
pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.bz2
-rw-r--r--  1 vsu  38035157 Aug 13 17:14 
pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz
-rw-r--r--  1 vsu  37739041 Aug 13 17:17 
pack-0-ef502d3d97088a8b1da4594fda438c268dc5c692.pack.gz-9
-rw-r--r--  1 vsu  43035924 Aug 13 17:12 
pack-9-ef502d3d97088a8b1da4594fda438c268dc5c692.pack
-rw-r--r--  1 vsu  43288931 Aug 13 17:11 
pack-default-ef502d3d97088a8b1da4594fda438c268dc5c692.pack

The "pack-default-*" file is made without the --compression-level
option; the "pack-0-*" and "pack-9-*" files are made with level 0 (no
compression) and 9 (max compression) respectively.  From this we can
see:

- Using maximum compression for objects in the pack provides little
  benefit - about 0.6%.
  
- Creating a pack with uncompressed objects and compressing it with
  gzip gives a 12% improvement over the pack with compressed objects.
  Using "gzip -9" at this stage gives about 0.7% more compression.

- For an offline compression, bzip2 could be used instead of gzip -
  the pack compressed with bzip2 is 28% smaller than the pack with
  zlib-compressed objects.

2. Packing the whole linux-2.6 repository (67111 objects):

-rw-r--r--  1 vsu 232977930 Aug 13 17:47 
pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r--  1 vsu  49245784 Aug 13 17:52 
pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.bz2
-rw-r--r--  1 vsu  59767656 Aug 13 17:49 
pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz
-rw-r--r--  1 vsu  59323808 Aug 13 17:50 
pack-0-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack.gz-9
-rw-r--r--  1 vsu  70067732 Aug 13 17:45 
pack-9-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack
-rw-r--r--  1 vsu  70415173 Aug 13 17:43 
pack-default-af8d554c2a184c1ebbaab13a8f844329bcbfe763.pack

- Again, --compression-level=9 does not help much - only 0.5%
  reduction.

- Using gzip on the pack with uncompressed objects gives 15%
  improvement over the pack with compressed objects; "gzip -9" does
  not help much.

- The pack with uncompressed objects compressed with bzip2 is 30%
  smaller than the pack with zlib-compressed objects.

3. Creating an incremental pack for the linux-2.6 repository (743
objects):

-rw-r--r--  1 vsu 4270645 Aug 13 17:54 
pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r--  1 vsu 1068277 Aug 13 17:56 
pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.bz2
-rw-r--r--  1 vsu 1221308 Aug 13 17:56 
pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz
-rw-r--r--  1 vsu 1214597 Aug 13 17:56 
pack-0-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack.gz-9
-rw-r--r--  1 vsu 1314817 Aug 13 17:55 
pack-9-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack
-rw-r--r--  1 vsu 1319322 Aug 13 17:54 
pack-default-8d2c7fe3c00288d4a46fe25a61db35ec965db8a1.pack

- Once again, --compression-level=9 is next to useless - 0.3%
  improvement.

- The pack with uncompressed objects compressed with gzip is 7%
  smaller than the pack with zlib-compressed objects.

- The pack with uncompressed objects compressed with bzip2 is 19%
  smaller than the pack with zlib-compressed objects.

===========================================================================

As you see, compressing the pack as a whole can give noticeable
improvements (less on smaller files, more on bigger files).  Now we
need to find a way to use this:

- For methods which use git tools on both ends (git-clone-pack,
  git-ssh-pull, git-daemon) we could just create pipes to gzip/gunzip
  in the appropriate places.

- For non-git-aware methods (rsync, http) we still can use these
  improvements, but there are additional complications because of the
  pack index file.  We could have globally-compressed pack files in a
  separate directory together with their index files, and write an
  utility which will take a pack file with its index, recompress all
  objects and produce a pack file with compressed objects and new
  index.  In theory, we could reconstruct the index from just the pack
  file alone, but this procedure may be expensive (it will need to
  reconstruct all objects represented by deltas to find their hash
  values).

BTW, it could be possible to improve the global compression even more
by optimizing the order of objects in the pack file (currently trees
and blobs seems to be intermixed).  I did not try this yet, however.

-- 
Sergey Vlasov

Attachment: pgp6f4EDppcNp.pgp
Description: PGP signature

Reply via email to