Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary
On Sat, Jun 28, 2014 at 1:56 AM, Thomas Braun thomas.br...@virtuell-zuhause.de wrote: The name is misleading and forced me to read it twice before I realized that this is populating the is-binary bit. It might make it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or perhaps DIFF_POPULATE_CHECK_BINARY or something. For consistency, the other bit may want to be also renamed from SIZE_ONLY to either (1) CHECK_SIZE_ONLY (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally make SIZE_ONLY the union of two I do not have strong preference either way; the latter may be more logical in that not loading contents and check size are sort of orthogonal in that you can later choose to check, without loading contents, only the binary-ness without checking size, but no calles that passes a non-zero flag to the populate-filespec function will want to slurp in the contents in practice, so in that sense we could declare that the NO_CONENTS bit is implied. Will do (and probably go with (1) as I still prefer zero as good defaults) But more importantly, would this patch actually help? Well yes as demonstrated by the new test ;-) Unfortunately the scope of help is limited to --stat.. I should have done more thorough testing. For one thing, this wouldn't (and shouldn't) help if the user wants --binary diff out of us anyway, I suspect, but I wonder what the following codepath in the builtin_diff() function would do: ... } else if (!DIFF_OPT_TST(o, TEXT) ( (!textconv_one diff_filespec_is_binary(one)) || (!textconv_two diff_filespec_is_binary(two)) )) { if (fill_mmfile(mf1, one) 0 || fill_mmfile(mf2, two) 0) die(unable to read files to diff); /* Quite common confusing case */ if (mf1.size == mf2.size !memcmp(mf1.ptr, mf2.ptr, mf1.size)) { if (must_show_header) fprintf(o-file, %s, header.buf); goto free_ab_and_return; } fprintf(o-file, %s, header.buf); strbuf_reset(header); if (DIFF_OPT_TST(o, BINARY)) emit_binary_diff(o-file, mf1, mf2, line_prefix); else fprintf(o-file, %sBinary files %s and %s differ\n, line_prefix, lbl[0], lbl[1]); o-found_changes = 1; } else { ... If we weren't told with --text/-a to force textual output, and at least one of the sides is marked as binary (and this patch marks a large blob as binary unless attributes says otherwise), we still call fill_mmfile() on them to slurp the contents to be compared, no? And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to check if the sides are identical, so... Good point. So how about an additional change roughly sketched as @@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct diff_filespec *one) return userdiff_get_textconv(one-driver); } +/* read the files in small chunks into memory and compare them */ +static int filecmp_chunked(struct diff_filespec *one, + struct diff_filespec *two) +{ + // TODO add implementation + return 0; +} + We have object streaming interface to do similar like this. In fact index-pack already does large file memcmp() for hash collision test. I think I can move some code around and support large file in this code path.. -- Duy -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary
Am 26.06.2014 19:55, schrieb Junio C Hamano: Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: Too large files may lead to failure to allocate memory. If it happens here, it could impact quite a few commands that involve diff. Moreover, too large files are inefficient to compare anyway (and most likely non-text), so mark them binary and skip looking at their content. ... diff --git a/diff.c b/diff.c index a489540..7a977aa 100644 --- a/diff.c +++ b/diff.c @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one) one-is_binary = one-driver-binary; else { if (!one-data DIFF_FILE_VALID(one)) -diff_populate_filespec(one, 0); -if (one-data) +diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY); +if (one-is_binary == -1 one-data) one-is_binary = buffer_is_binary(one-data, one-size); if (one-is_binary == -1) The name is misleading and forced me to read it twice before I realized that this is populating the is-binary bit. It might make it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or perhaps DIFF_POPULATE_CHECK_BINARY or something. For consistency, the other bit may want to be also renamed from SIZE_ONLY to either (1) CHECK_SIZE_ONLY (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally make SIZE_ONLY the union of two I do not have strong preference either way; the latter may be more logical in that not loading contents and check size are sort of orthogonal in that you can later choose to check, without loading contents, only the binary-ness without checking size, but no calles that passes a non-zero flag to the populate-filespec function will want to slurp in the contents in practice, so in that sense we could declare that the NO_CONENTS bit is implied. But more importantly, would this patch actually help? For one thing, this wouldn't (and shouldn't) help if the user wants --binary diff out of us anyway, I suspect, but I wonder what the following codepath in the builtin_diff() function would do: ... } else if (!DIFF_OPT_TST(o, TEXT) ( (!textconv_one diff_filespec_is_binary(one)) || (!textconv_two diff_filespec_is_binary(two)) )) { if (fill_mmfile(mf1, one) 0 || fill_mmfile(mf2, two) 0) die(unable to read files to diff); /* Quite common confusing case */ if (mf1.size == mf2.size !memcmp(mf1.ptr, mf2.ptr, mf1.size)) { if (must_show_header) fprintf(o-file, %s, header.buf); goto free_ab_and_return; } fprintf(o-file, %s, header.buf); strbuf_reset(header); if (DIFF_OPT_TST(o, BINARY)) emit_binary_diff(o-file, mf1, mf2, line_prefix); else fprintf(o-file, %sBinary files %s and %s differ\n, line_prefix, lbl[0], lbl[1]); o-found_changes = 1; } else { ... If we weren't told with --text/-a to force textual output, and at least one of the sides is marked as binary (and this patch marks a large blob as binary unless attributes says otherwise), we still call fill_mmfile() on them to slurp the contents to be compared, no? And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to check if the sides are identical, so... Good point. So how about an additional change roughly sketched as @@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct diff_filespec *one) return userdiff_get_textconv(one-driver); } +/* read the files in small chunks into memory and compare them */ +static int filecmp_chunked(struct diff_filespec *one, + struct diff_filespec *two) +{ + // TODO add implementation + return 0; +} + static void builtin_diff(const char *name_a, const char *name_b, struct diff_filespec *one, @@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a, } else if (!DIFF_OPT_TST(o, TEXT) ( (!textconv_one diff_filespec_is_binary(one)) || (!textconv_two diff_filespec_is_binary(two)) )) { - if (fill_mmfile(mf1, one) 0 || fill_mmfile(mf2,two) 0) - die(unable to read files to diff); + + unsigned long size1 = diff_filespec_size(one); + unsigned long size2 = diff_filespec_size(two); + + if (size1 == 0 || size2 == 0) + die(unable to retrieve file sizes for diff); /* Quite common confusing case */ - if
Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary
Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: Too large files may lead to failure to allocate memory. If it happens here, it could impact quite a few commands that involve diff. Moreover, too large files are inefficient to compare anyway (and most likely non-text), so mark them binary and skip looking at their content. ... diff --git a/diff.c b/diff.c index a489540..7a977aa 100644 --- a/diff.c +++ b/diff.c @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one) one-is_binary = one-driver-binary; else { if (!one-data DIFF_FILE_VALID(one)) - diff_populate_filespec(one, 0); - if (one-data) + diff_populate_filespec(one, DIFF_POPULATE_IS_BINARY); + if (one-is_binary == -1 one-data) one-is_binary = buffer_is_binary(one-data, one-size); if (one-is_binary == -1) The name is misleading and forced me to read it twice before I realized that this is populating the is-binary bit. It might make it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or perhaps DIFF_POPULATE_CHECK_BINARY or something. For consistency, the other bit may want to be also renamed from SIZE_ONLY to either (1) CHECK_SIZE_ONLY (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally make SIZE_ONLY the union of two I do not have strong preference either way; the latter may be more logical in that not loading contents and check size are sort of orthogonal in that you can later choose to check, without loading contents, only the binary-ness without checking size, but no calles that passes a non-zero flag to the populate-filespec function will want to slurp in the contents in practice, so in that sense we could declare that the NO_CONENTS bit is implied. But more importantly, would this patch actually help? For one thing, this wouldn't (and shouldn't) help if the user wants --binary diff out of us anyway, I suspect, but I wonder what the following codepath in the builtin_diff() function would do: ... } else if (!DIFF_OPT_TST(o, TEXT) ( (!textconv_one diff_filespec_is_binary(one)) || (!textconv_two diff_filespec_is_binary(two)) )) { if (fill_mmfile(mf1, one) 0 || fill_mmfile(mf2, two) 0) die(unable to read files to diff); /* Quite common confusing case */ if (mf1.size == mf2.size !memcmp(mf1.ptr, mf2.ptr, mf1.size)) { if (must_show_header) fprintf(o-file, %s, header.buf); goto free_ab_and_return; } fprintf(o-file, %s, header.buf); strbuf_reset(header); if (DIFF_OPT_TST(o, BINARY)) emit_binary_diff(o-file, mf1, mf2, line_prefix); else fprintf(o-file, %sBinary files %s and %s differ\n, line_prefix, lbl[0], lbl[1]); o-found_changes = 1; } else { ... If we weren't told with --text/-a to force textual output, and at least one of the sides is marked as binary (and this patch marks a large blob as binary unless attributes says otherwise), we still call fill_mmfile() on them to slurp the contents to be compared, no? And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to check if the sides are identical, so... -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html