Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-28 Thread Duy Nguyen
On Sat, Jun 28, 2014 at 1:56 AM, Thomas Braun
thomas.br...@virtuell-zuhause.de wrote:
 The name is misleading and forced me to read it twice before I
 realized that this is populating the is-binary bit.  It might make
 it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
 perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
 the other bit may want to be also renamed from SIZE_ONLY to either

  (1) CHECK_SIZE_ONLY

  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
  make SIZE_ONLY the union of two

 I do not have strong preference either way; the latter may be more
 logical in that not loading contents and check size are sort of
 orthogonal in that you can later choose to check, without loading
 contents, only the binary-ness without checking size, but no calles
 that passes a non-zero flag to the populate-filespec function will
 want to slurp in the contents in practice, so in that sense we could
 declare that the NO_CONENTS bit is implied.

Will do (and probably go with (1) as I still prefer zero as good defaults)

 But more importantly, would this patch actually help?

Well yes as demonstrated by the new test ;-) Unfortunately the scope
of help is limited to --stat.. I should have done more thorough
testing.

 For one
 thing, this wouldn't (and shouldn't) help if the user wants --binary
 diff out of us anyway, I suspect, but I wonder what the following
 codepath in the builtin_diff() function would do:

   ...
   } else if (!DIFF_OPT_TST(o, TEXT) 
   ( (!textconv_one  diff_filespec_is_binary(one)) ||
 (!textconv_two  diff_filespec_is_binary(two)) )) {
   if (fill_mmfile(mf1, one)  0 || fill_mmfile(mf2, two)  0)
   die(unable to read files to diff);
   /* Quite common confusing case */
   if (mf1.size == mf2.size 
   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
   if (must_show_header)
   fprintf(o-file, %s, header.buf);
   goto free_ab_and_return;
   }
   fprintf(o-file, %s, header.buf);
   strbuf_reset(header);
   if (DIFF_OPT_TST(o, BINARY))
   emit_binary_diff(o-file, mf1, mf2, line_prefix);
   else
   fprintf(o-file, %sBinary files %s and %s differ\n,
   line_prefix, lbl[0], lbl[1]);
   o-found_changes = 1;
   } else {
   ...

 If we weren't told with --text/-a to force textual output, and
 at least one of the sides is marked as binary (and this patch marks
 a large blob as binary unless attributes says otherwise), we still
 call fill_mmfile() on them to slurp the contents to be compared, no?

 And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
 check if the sides are identical, so...

 Good point. So how about an additional change roughly sketched as

 @@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
 diff_filespec *one)
 return userdiff_get_textconv(one-driver);
  }

 +/* read the files in small chunks into memory and compare them */
 +static int filecmp_chunked(struct diff_filespec *one,
 +   struct diff_filespec *two)
 +{
 +   // TODO add implementation
 +   return 0;
 +}
 +


We have object streaming interface to do similar like this. In fact
index-pack already does large file memcmp() for hash collision test. I
think I can move some code around and support large file in this code
path..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-27 Thread Thomas Braun
Am 26.06.2014 19:55, schrieb Junio C Hamano:
 Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:
 
 Too large files may lead to failure to allocate memory. If it happens
 here, it could impact quite a few commands that involve
 diff. Moreover, too large files are inefficient to compare anyway (and
 most likely non-text), so mark them binary and skip looking at their
 content.
 ...
 diff --git a/diff.c b/diff.c
 index a489540..7a977aa 100644
 --- a/diff.c
 +++ b/diff.c
 @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
  one-is_binary = one-driver-binary;
  else {
  if (!one-data  DIFF_FILE_VALID(one))
 -diff_populate_filespec(one, 0);
 -if (one-data)
 +diff_populate_filespec(one, 
 DIFF_POPULATE_IS_BINARY);
 +if (one-is_binary == -1  one-data)
  one-is_binary = buffer_is_binary(one-data,
  one-size);
  if (one-is_binary == -1)
 
 The name is misleading and forced me to read it twice before I
 realized that this is populating the is-binary bit.  It might make
 it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
 perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
 the other bit may want to be also renamed from SIZE_ONLY to either
 
  (1) CHECK_SIZE_ONLY
 
  (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
  make SIZE_ONLY the union of two
 
 I do not have strong preference either way; the latter may be more
 logical in that not loading contents and check size are sort of
 orthogonal in that you can later choose to check, without loading
 contents, only the binary-ness without checking size, but no calles
 that passes a non-zero flag to the populate-filespec function will
 want to slurp in the contents in practice, so in that sense we could
 declare that the NO_CONENTS bit is implied.
 
 But more importantly, would this patch actually help?  For one
 thing, this wouldn't (and shouldn't) help if the user wants --binary
 diff out of us anyway, I suspect, but I wonder what the following
 codepath in the builtin_diff() function would do:
 
   ...
   } else if (!DIFF_OPT_TST(o, TEXT) 
   ( (!textconv_one  diff_filespec_is_binary(one)) ||
 (!textconv_two  diff_filespec_is_binary(two)) )) {
   if (fill_mmfile(mf1, one)  0 || fill_mmfile(mf2, two)  0)
   die(unable to read files to diff);
   /* Quite common confusing case */
   if (mf1.size == mf2.size 
   !memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
   if (must_show_header)
   fprintf(o-file, %s, header.buf);
   goto free_ab_and_return;
   }
   fprintf(o-file, %s, header.buf);
   strbuf_reset(header);
   if (DIFF_OPT_TST(o, BINARY))
   emit_binary_diff(o-file, mf1, mf2, line_prefix);
   else
   fprintf(o-file, %sBinary files %s and %s differ\n,
   line_prefix, lbl[0], lbl[1]);
   o-found_changes = 1;
   } else {
   ...
 
 If we weren't told with --text/-a to force textual output, and
 at least one of the sides is marked as binary (and this patch marks
 a large blob as binary unless attributes says otherwise), we still
 call fill_mmfile() on them to slurp the contents to be compared, no?
 
 And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
 check if the sides are identical, so...

Good point. So how about an additional change roughly sketched as

@@ -2223,6 +2223,14 @@ struct userdiff_driver *get_textconv(struct
diff_filespec *one)
return userdiff_get_textconv(one-driver);
 }

+/* read the files in small chunks into memory and compare them */
+static int filecmp_chunked(struct diff_filespec *one,
+   struct diff_filespec *two)
+{
+   // TODO add implementation
+   return 0;
+}
+
 static void builtin_diff(const char *name_a,
 const char *name_b,
 struct diff_filespec *one,
@@ -2325,19 +2333,26 @@ static void builtin_diff(const char *name_a,
} else if (!DIFF_OPT_TST(o, TEXT) 
( (!textconv_one  diff_filespec_is_binary(one)) ||
  (!textconv_two  diff_filespec_is_binary(two)) )) {
-   if (fill_mmfile(mf1, one)  0 || fill_mmfile(mf2,two) 0)
-   die(unable to read files to diff);
+
+   unsigned long size1 = diff_filespec_size(one);
+   unsigned long size2 = diff_filespec_size(two);
+
+   if (size1 == 0 || size2 == 0)
+   die(unable to retrieve file sizes for diff);
/* Quite common confusing case */
-   if 

Re: [PATCH v2 4/4] diff: mark any file larger than core.bigfilethreshold binary

2014-06-26 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 Too large files may lead to failure to allocate memory. If it happens
 here, it could impact quite a few commands that involve
 diff. Moreover, too large files are inefficient to compare anyway (and
 most likely non-text), so mark them binary and skip looking at their
 content.
 ...
 diff --git a/diff.c b/diff.c
 index a489540..7a977aa 100644
 --- a/diff.c
 +++ b/diff.c
 @@ -2185,8 +2185,8 @@ int diff_filespec_is_binary(struct diff_filespec *one)
   one-is_binary = one-driver-binary;
   else {
   if (!one-data  DIFF_FILE_VALID(one))
 - diff_populate_filespec(one, 0);
 - if (one-data)
 + diff_populate_filespec(one, 
 DIFF_POPULATE_IS_BINARY);
 + if (one-is_binary == -1  one-data)
   one-is_binary = buffer_is_binary(one-data,
   one-size);
   if (one-is_binary == -1)

The name is misleading and forced me to read it twice before I
realized that this is populating the is-binary bit.  It might make
it a bit better if you renamed it to DIFF_POPULATE_IS_BINARY_BIT or
perhaps DIFF_POPULATE_CHECK_BINARY or something.  For consistency,
the other bit may want to be also renamed from SIZE_ONLY to either

 (1) CHECK_SIZE_ONLY

 (2) One bit for CHECK_SIZE, another for NO_CONTENTS, and optionally
 make SIZE_ONLY the union of two

I do not have strong preference either way; the latter may be more
logical in that not loading contents and check size are sort of
orthogonal in that you can later choose to check, without loading
contents, only the binary-ness without checking size, but no calles
that passes a non-zero flag to the populate-filespec function will
want to slurp in the contents in practice, so in that sense we could
declare that the NO_CONENTS bit is implied.

But more importantly, would this patch actually help?  For one
thing, this wouldn't (and shouldn't) help if the user wants --binary
diff out of us anyway, I suspect, but I wonder what the following
codepath in the builtin_diff() function would do:

...
} else if (!DIFF_OPT_TST(o, TEXT) 
( (!textconv_one  diff_filespec_is_binary(one)) ||
  (!textconv_two  diff_filespec_is_binary(two)) )) {
if (fill_mmfile(mf1, one)  0 || fill_mmfile(mf2, two)  0)
die(unable to read files to diff);
/* Quite common confusing case */
if (mf1.size == mf2.size 
!memcmp(mf1.ptr, mf2.ptr, mf1.size)) {
if (must_show_header)
fprintf(o-file, %s, header.buf);
goto free_ab_and_return;
}
fprintf(o-file, %s, header.buf);
strbuf_reset(header);
if (DIFF_OPT_TST(o, BINARY))
emit_binary_diff(o-file, mf1, mf2, line_prefix);
else
fprintf(o-file, %sBinary files %s and %s differ\n,
line_prefix, lbl[0], lbl[1]);
o-found_changes = 1;
} else {
...

If we weren't told with --text/-a to force textual output, and
at least one of the sides is marked as binary (and this patch marks
a large blob as binary unless attributes says otherwise), we still
call fill_mmfile() on them to slurp the contents to be compared, no?

And before you get to the DIFF_OPT_TST(o, BINARY), you memcmp(3) to
check if the sides are identical, so...

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html