Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end

2013-09-03 Thread Jeff King
On Mon, Sep 02, 2013 at 10:05:07AM +0700, Nguyen Thai Ngoc Duy wrote:

 Current code peaks into the transfered pack's header, if the number of
 objects is under a limit, unpack-objects is called to handle the rest,
 otherwise index-pack is. This patch makes fetch-pack use index-pack
 unconditionally, then turn objects loose and remove the pack at the
 end. unpack-objects is deprecated and may be removed in future.

I do like consolidating the object-receiving code paths, but there is a
downside to this strategy: we increase the I/O in cases where we end up
unpacking, as we spool the tmpfile to disk, and then force objects loose
(whereas with the current code, unpack-objects reads straight from the
network into loose objects). I think that is what you're saying here:

  - by going through index-pack first, then unpack, we pay extra cost
for completing a thin pack into a full one. But compared to fetch's
total time, it should not be noticeable because unpack-objects is
only called when the pack contains a small number of objects.

...but the cost is paid by total pack size, not number of objects. So if
I am pushing up a commit with a large uncompressible blob, I've
effectively doubled my disk I/O. It would make more sense to me for
index-pack to learn command-line options specifying the limits, and then
to operate on the pack as it streams in. E.g., to decide after seeing
the header to unpack rather than index, or to drop large blobs from the
pack (and put them in their own pack directly) as we are streaming into
it (we do not know the blob size ahead of time, but we can make a good
guess if it has a large on-disk size in the pack).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end

2013-09-03 Thread Duy Nguyen
On Tue, Sep 3, 2013 at 1:49 PM, Jeff King p...@peff.net wrote:
  - by going through index-pack first, then unpack, we pay extra cost
for completing a thin pack into a full one. But compared to fetch's
total time, it should not be noticeable because unpack-objects is
only called when the pack contains a small number of objects.

 ...but the cost is paid by total pack size, not number of objects. So if
 I am pushing up a commit with a large uncompressible blob, I've
 effectively doubled my disk I/O. It would make more sense to me for
 index-pack to learn command-line options specifying the limits, and then
 to operate on the pack as it streams in. E.g., to decide after seeing
 the header to unpack rather than index, or to drop large blobs from the
 pack (and put them in their own pack directly) as we are streaming into
 it (we do not know the blob size ahead of time, but we can make a good
 guess if it has a large on-disk size in the pack).

Yeah letting index-pack do the work was my backup plan :) I think if
there is a big blob in the pack, then the pack should not be unpacked
at all. If you store big blobs in a separate pack you already pay the
the lookup cost of one more pack in find_pack_entry(), why go through
the process of unpacking? index-pack still has the advantage of
streaming though. Will rework.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end

2013-09-03 Thread Jeff King
On Tue, Sep 03, 2013 at 06:56:23PM +0700, Nguyen Thai Ngoc Duy wrote:

  ...but the cost is paid by total pack size, not number of objects. So if
  I am pushing up a commit with a large uncompressible blob, I've
  effectively doubled my disk I/O. It would make more sense to me for
  index-pack to learn command-line options specifying the limits, and then
  to operate on the pack as it streams in. E.g., to decide after seeing
  the header to unpack rather than index, or to drop large blobs from the
  pack (and put them in their own pack directly) as we are streaming into
  it (we do not know the blob size ahead of time, but we can make a good
  guess if it has a large on-disk size in the pack).
 
 Yeah letting index-pack do the work was my backup plan :) I think if
 there is a big blob in the pack, then the pack should not be unpacked
 at all. If you store big blobs in a separate pack you already pay the
 the lookup cost of one more pack in find_pack_entry(), why go through
 the process of unpacking? index-pack still has the advantage of
 streaming though. Will rework.

In general, our large-blob strategy is to push them out to their own
pack so that we do not incur the I/O overhead of rewriting them whenever
we repack. But the flipside is that we have to pay the cost of an extra
.idx open and lookup for each such object. In the longer term, I think
it might make sense to be able to generate a multi-pack .idx for such a
case (or even to simply store the large blobs in a special area indexed
by the object sha1, as we do for loose objects).

But that is all orthogonal to your patch. I think as long as we are
moving towards index-pack makes the decisions while it processes the
pack we are going in a good direction. Even if we do not implement all
of the decisions immediately, it leaves room for doing so later without
loss of efficiency.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end

2013-09-01 Thread Nguyễn Thái Ngọc Duy
Current code peaks into the transfered pack's header, if the number of
objects is under a limit, unpack-objects is called to handle the rest,
otherwise index-pack is. This patch makes fetch-pack use index-pack
unconditionally, then turn objects loose and remove the pack at the
end. unpack-objects is deprecated and may be removed in future.

There are a few reasons for this:

 - down to two code paths to maintain regarding pack reading
   (sha1_file.c and index-pack.c). When .pack v4 comes, we don't need
   to duplicate work in index-pack and unpack-objects.

 - the number of objects should not be the only indicator for
   unpacking. If there are a few large objects in the pack, the pack
   should be kept anyway. Keeping the pack lets us examine the whole
   pack later and make a better decision.

 - by going through index-pack first, then unpack, we pay extra cost
   for completing a thin pack into a full one. But compared to fetch's
   total time, it should not be noticeable because unpack-objects is
   only called when the pack contains a small number of objects.

 - unpack-objects does not support streaming large blobs. Granted
   force_object_loose(), which is used by this patch, does not either.
   But it'll be easier to do so because sha1_file.c interface supports
   streaming.

Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
---
 builtin/receive-pack.c | 120 +
 cache.h|   1 +
 fetch-pack.c   |  77 +++
 sha1_file.c|  70 -
 4 files changed, 128 insertions(+), 140 deletions(-)

diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c
index e3eb5fc..6eb4ffb 100644
--- a/builtin/receive-pack.c
+++ b/builtin/receive-pack.c
@@ -792,105 +792,49 @@ static struct command *read_head_info(void)
return commands;
 }
 
-static const char *parse_pack_header(struct pack_header *hdr)
-{
-   switch (read_pack_header(0, hdr)) {
-   case PH_ERROR_EOF:
-   return eof before pack header was fully read;
-
-   case PH_ERROR_PACK_SIGNATURE:
-   return protocol error (pack signature mismatch detected);
-
-   case PH_ERROR_PROTOCOL:
-   return protocol error (pack version unsupported);
-
-   default:
-   return unknown error in parse_pack_header;
-
-   case 0:
-   return NULL;
-   }
-}
-
 static const char *pack_lockfile;
 
 static const char *unpack(int err_fd)
 {
-   struct pack_header hdr;
-   const char *hdr_err;
-   char hdr_arg[38];
int fsck_objects = (receive_fsck_objects = 0
? receive_fsck_objects
: transfer_fsck_objects = 0
? transfer_fsck_objects
: 0);
 
-   hdr_err = parse_pack_header(hdr);
-   if (hdr_err) {
-   if (err_fd  0)
-   close(err_fd);
-   return hdr_err;
-   }
-   snprintf(hdr_arg, sizeof(hdr_arg),
-   --pack_header=%PRIu32,%PRIu32,
-   ntohl(hdr.hdr_version), ntohl(hdr.hdr_entries));
-
-   if (ntohl(hdr.hdr_entries)  unpack_limit) {
-   int code, i = 0;
-   struct child_process child;
-   const char *unpacker[5];
-   unpacker[i++] = unpack-objects;
-   if (quiet)
-   unpacker[i++] = -q;
-   if (fsck_objects)
-   unpacker[i++] = --strict;
-   unpacker[i++] = hdr_arg;
-   unpacker[i++] = NULL;
-   memset(child, 0, sizeof(child));
-   child.argv = unpacker;
-   child.no_stdout = 1;
-   child.err = err_fd;
-   child.git_cmd = 1;
-   code = run_command(child);
-   if (!code)
-   return NULL;
-   return unpack-objects abnormal exit;
-   } else {
-   const char *keeper[7];
-   int s, status, i = 0;
-   char keep_arg[256];
-   struct child_process ip;
-
-   s = sprintf(keep_arg, --keep=receive-pack %PRIuMAX on , 
(uintmax_t) getpid());
-   if (gethostname(keep_arg + s, sizeof(keep_arg) - s))
-   strcpy(keep_arg + s, localhost);
-
-   keeper[i++] = index-pack;
-   keeper[i++] = --stdin;
-   if (fsck_objects)
-   keeper[i++] = --strict;
-   keeper[i++] = --fix-thin;
-   keeper[i++] = hdr_arg;
-   keeper[i++] = keep_arg;
-   keeper[i++] = NULL;
-   memset(ip, 0, sizeof(ip));
-   ip.argv = keeper;
-   ip.out = -1;
-   ip.err = err_fd;
-   ip.git_cmd = 1;
-   status = start_command(ip);
-   if (status) {
-

Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end

2013-09-01 Thread Eric Sunshine
On Sun, Sep 1, 2013 at 11:05 PM, Nguyễn Thái Ngọc Duy pclo...@gmail.com wrote:
 Current code peaks into the transfered pack's header, if the number of

s/peaks/peeks/

 objects is under a limit, unpack-objects is called to handle the rest,
 otherwise index-pack is. This patch makes fetch-pack use index-pack
 unconditionally, then turn objects loose and remove the pack at the
 end. unpack-objects is deprecated and may be removed in future.

 There are a few reasons for this:

  - down to two code paths to maintain regarding pack reading
(sha1_file.c and index-pack.c). When .pack v4 comes, we don't need
to duplicate work in index-pack and unpack-objects.

  - the number of objects should not be the only indicator for
unpacking. If there are a few large objects in the pack, the pack
should be kept anyway. Keeping the pack lets us examine the whole
pack later and make a better decision.

  - by going through index-pack first, then unpack, we pay extra cost
for completing a thin pack into a full one. But compared to fetch's
total time, it should not be noticeable because unpack-objects is
only called when the pack contains a small number of objects.

  - unpack-objects does not support streaming large blobs. Granted
force_object_loose(), which is used by this patch, does not either.
But it'll be easier to do so because sha1_file.c interface supports
streaming.

 Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html