Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end
On Mon, Sep 02, 2013 at 10:05:07AM +0700, Nguyen Thai Ngoc Duy wrote: Current code peaks into the transfered pack's header, if the number of objects is under a limit, unpack-objects is called to handle the rest, otherwise index-pack is. This patch makes fetch-pack use index-pack unconditionally, then turn objects loose and remove the pack at the end. unpack-objects is deprecated and may be removed in future. I do like consolidating the object-receiving code paths, but there is a downside to this strategy: we increase the I/O in cases where we end up unpacking, as we spool the tmpfile to disk, and then force objects loose (whereas with the current code, unpack-objects reads straight from the network into loose objects). I think that is what you're saying here: - by going through index-pack first, then unpack, we pay extra cost for completing a thin pack into a full one. But compared to fetch's total time, it should not be noticeable because unpack-objects is only called when the pack contains a small number of objects. ...but the cost is paid by total pack size, not number of objects. So if I am pushing up a commit with a large uncompressible blob, I've effectively doubled my disk I/O. It would make more sense to me for index-pack to learn command-line options specifying the limits, and then to operate on the pack as it streams in. E.g., to decide after seeing the header to unpack rather than index, or to drop large blobs from the pack (and put them in their own pack directly) as we are streaming into it (we do not know the blob size ahead of time, but we can make a good guess if it has a large on-disk size in the pack). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end
On Tue, Sep 3, 2013 at 1:49 PM, Jeff King p...@peff.net wrote: - by going through index-pack first, then unpack, we pay extra cost for completing a thin pack into a full one. But compared to fetch's total time, it should not be noticeable because unpack-objects is only called when the pack contains a small number of objects. ...but the cost is paid by total pack size, not number of objects. So if I am pushing up a commit with a large uncompressible blob, I've effectively doubled my disk I/O. It would make more sense to me for index-pack to learn command-line options specifying the limits, and then to operate on the pack as it streams in. E.g., to decide after seeing the header to unpack rather than index, or to drop large blobs from the pack (and put them in their own pack directly) as we are streaming into it (we do not know the blob size ahead of time, but we can make a good guess if it has a large on-disk size in the pack). Yeah letting index-pack do the work was my backup plan :) I think if there is a big blob in the pack, then the pack should not be unpacked at all. If you store big blobs in a separate pack you already pay the the lookup cost of one more pack in find_pack_entry(), why go through the process of unpacking? index-pack still has the advantage of streaming though. Will rework. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end
On Tue, Sep 03, 2013 at 06:56:23PM +0700, Nguyen Thai Ngoc Duy wrote: ...but the cost is paid by total pack size, not number of objects. So if I am pushing up a commit with a large uncompressible blob, I've effectively doubled my disk I/O. It would make more sense to me for index-pack to learn command-line options specifying the limits, and then to operate on the pack as it streams in. E.g., to decide after seeing the header to unpack rather than index, or to drop large blobs from the pack (and put them in their own pack directly) as we are streaming into it (we do not know the blob size ahead of time, but we can make a good guess if it has a large on-disk size in the pack). Yeah letting index-pack do the work was my backup plan :) I think if there is a big blob in the pack, then the pack should not be unpacked at all. If you store big blobs in a separate pack you already pay the the lookup cost of one more pack in find_pack_entry(), why go through the process of unpacking? index-pack still has the advantage of streaming though. Will rework. In general, our large-blob strategy is to push them out to their own pack so that we do not incur the I/O overhead of rewriting them whenever we repack. But the flipside is that we have to pay the cost of an extra .idx open and lookup for each such object. In the longer term, I think it might make sense to be able to generate a multi-pack .idx for such a case (or even to simply store the large blobs in a special area indexed by the object sha1, as we do for loose objects). But that is all orthogonal to your patch. I think as long as we are moving towards index-pack makes the decisions while it processes the pack we are going in a good direction. Even if we do not implement all of the decisions immediately, it leaves room for doing so later without loss of efficiency. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end
Current code peaks into the transfered pack's header, if the number of objects is under a limit, unpack-objects is called to handle the rest, otherwise index-pack is. This patch makes fetch-pack use index-pack unconditionally, then turn objects loose and remove the pack at the end. unpack-objects is deprecated and may be removed in future. There are a few reasons for this: - down to two code paths to maintain regarding pack reading (sha1_file.c and index-pack.c). When .pack v4 comes, we don't need to duplicate work in index-pack and unpack-objects. - the number of objects should not be the only indicator for unpacking. If there are a few large objects in the pack, the pack should be kept anyway. Keeping the pack lets us examine the whole pack later and make a better decision. - by going through index-pack first, then unpack, we pay extra cost for completing a thin pack into a full one. But compared to fetch's total time, it should not be noticeable because unpack-objects is only called when the pack contains a small number of objects. - unpack-objects does not support streaming large blobs. Granted force_object_loose(), which is used by this patch, does not either. But it'll be easier to do so because sha1_file.c interface supports streaming. Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com --- builtin/receive-pack.c | 120 + cache.h| 1 + fetch-pack.c | 77 +++ sha1_file.c| 70 - 4 files changed, 128 insertions(+), 140 deletions(-) diff --git a/builtin/receive-pack.c b/builtin/receive-pack.c index e3eb5fc..6eb4ffb 100644 --- a/builtin/receive-pack.c +++ b/builtin/receive-pack.c @@ -792,105 +792,49 @@ static struct command *read_head_info(void) return commands; } -static const char *parse_pack_header(struct pack_header *hdr) -{ - switch (read_pack_header(0, hdr)) { - case PH_ERROR_EOF: - return eof before pack header was fully read; - - case PH_ERROR_PACK_SIGNATURE: - return protocol error (pack signature mismatch detected); - - case PH_ERROR_PROTOCOL: - return protocol error (pack version unsupported); - - default: - return unknown error in parse_pack_header; - - case 0: - return NULL; - } -} - static const char *pack_lockfile; static const char *unpack(int err_fd) { - struct pack_header hdr; - const char *hdr_err; - char hdr_arg[38]; int fsck_objects = (receive_fsck_objects = 0 ? receive_fsck_objects : transfer_fsck_objects = 0 ? transfer_fsck_objects : 0); - hdr_err = parse_pack_header(hdr); - if (hdr_err) { - if (err_fd 0) - close(err_fd); - return hdr_err; - } - snprintf(hdr_arg, sizeof(hdr_arg), - --pack_header=%PRIu32,%PRIu32, - ntohl(hdr.hdr_version), ntohl(hdr.hdr_entries)); - - if (ntohl(hdr.hdr_entries) unpack_limit) { - int code, i = 0; - struct child_process child; - const char *unpacker[5]; - unpacker[i++] = unpack-objects; - if (quiet) - unpacker[i++] = -q; - if (fsck_objects) - unpacker[i++] = --strict; - unpacker[i++] = hdr_arg; - unpacker[i++] = NULL; - memset(child, 0, sizeof(child)); - child.argv = unpacker; - child.no_stdout = 1; - child.err = err_fd; - child.git_cmd = 1; - code = run_command(child); - if (!code) - return NULL; - return unpack-objects abnormal exit; - } else { - const char *keeper[7]; - int s, status, i = 0; - char keep_arg[256]; - struct child_process ip; - - s = sprintf(keep_arg, --keep=receive-pack %PRIuMAX on , (uintmax_t) getpid()); - if (gethostname(keep_arg + s, sizeof(keep_arg) - s)) - strcpy(keep_arg + s, localhost); - - keeper[i++] = index-pack; - keeper[i++] = --stdin; - if (fsck_objects) - keeper[i++] = --strict; - keeper[i++] = --fix-thin; - keeper[i++] = hdr_arg; - keeper[i++] = keep_arg; - keeper[i++] = NULL; - memset(ip, 0, sizeof(ip)); - ip.argv = keeper; - ip.out = -1; - ip.err = err_fd; - ip.git_cmd = 1; - status = start_command(ip); - if (status) { -
Re: [PATCH] {fetch,receive}-pack: drop unpack-objects, delay loosing objects until the end
On Sun, Sep 1, 2013 at 11:05 PM, Nguyễn Thái Ngọc Duy pclo...@gmail.com wrote: Current code peaks into the transfered pack's header, if the number of s/peaks/peeks/ objects is under a limit, unpack-objects is called to handle the rest, otherwise index-pack is. This patch makes fetch-pack use index-pack unconditionally, then turn objects loose and remove the pack at the end. unpack-objects is deprecated and may be removed in future. There are a few reasons for this: - down to two code paths to maintain regarding pack reading (sha1_file.c and index-pack.c). When .pack v4 comes, we don't need to duplicate work in index-pack and unpack-objects. - the number of objects should not be the only indicator for unpacking. If there are a few large objects in the pack, the pack should be kept anyway. Keeping the pack lets us examine the whole pack later and make a better decision. - by going through index-pack first, then unpack, we pay extra cost for completing a thin pack into a full one. But compared to fetch's total time, it should not be noticeable because unpack-objects is only called when the pack contains a small number of objects. - unpack-objects does not support streaming large blobs. Granted force_object_loose(), which is used by this patch, does not either. But it'll be easier to do so because sha1_file.c interface supports streaming. Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html