Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-26 Thread Jeff King
On Mon, Aug 25, 2014 at 06:55:51PM +0200, Steffen Prohaska wrote:

 It could be handled that way, but we would be back to the original problem
 that 32-bit git fails for large files.  The convert code path currently
 assumes that all data is available in a single buffer at some point to apply
 crlf and ident filters.
 
 If the initial filter, which is assumed to reduce the file size, fails, we
 could seek to 0 and read the entire file.  But git would then fail for large
 files with out-of-memory.  We would not gain anything for the use case that
 I describe in the commit message's first paragraph.

Ah. So the real problem is that we cannot handle _other_ conversions for
large files, and we must try to intercept the data before it gets to
them. So this is really just helping reduction filters. Even if our
streaming filter succeeds, it does not help the situation if it did not
reduce the large file to a smaller one.

It would be nice in the long run to let the other filters stream, too,
but that is not a problem we need to solve immediately. Your patch is a
strict improvement.

Thanks for the explanation; your approach makes a lot more sense to me
now.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-26 Thread Jeff King
On Mon, Aug 25, 2014 at 11:35:45AM -0700, Junio C Hamano wrote:

 Steffen Prohaska proha...@zib.de writes:
 
  Couldn't we do that with an lseek (or even an mmap with offset 0)? That
  obviously would not work for non-file inputs, but I think we address
  that already in index_fd: we push non-seekable things off to index_pipe,
  where we spool them to memory.
 
  It could be handled that way, but we would be back to the original problem
  that 32-bit git fails for large files.
 
 Correct, and you are making an incremental improvement so that such
 a large blob can be handled _when_ the filters can successfully
 munge it back and forth.  If we fail due to out of memory when the
 filters cannot, that would be the same as without your improvement,
 so you are still making progress.

I do not think my proposal makes anything worse than Steffen's patch.
_If_ you have a non-required filter, and _if_ we can run it, then we
stream the filter and hopefully end up with a small enough result to fit
into memory. If we cannot run the filter, we are screwed anyway (we
follow the regular code path and dump the whole thing into memory; i.e.,
the same as without this patch series).

I think the main argument against going further is just that it is not
worth the complexity. Tell people doing reduction filters they need to
use required, and that accomplishes the same thing.

  So it seems like the ideal strategy would be:
  
   1. If it's seekable, try streaming. If not, fall back to lseek/mmap.
  
   2. If it's not seekable and the filter is required, try streaming. We
  die anyway if we fail.
 
 Puzzled...  Is it assumed that any content the filters tell us to
 use the contents from the db as-is by exiting with non-zero status
 will always be large not to fit in-core?  For small contents, isn't
 this ideal strategy a regression?

I am not sure what you mean by regression here. We will try to stream
more often, but I do not see that as a bad thing.

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-26 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 On Mon, Aug 25, 2014 at 11:35:45AM -0700, Junio C Hamano wrote:

 Steffen Prohaska proha...@zib.de writes:
 
  Couldn't we do that with an lseek (or even an mmap with offset 0)? That
  obviously would not work for non-file inputs, but I think we address
  that already in index_fd: we push non-seekable things off to index_pipe,
  where we spool them to memory.
 
  It could be handled that way, but we would be back to the original problem
  that 32-bit git fails for large files.
 
 Correct, and you are making an incremental improvement so that such
 a large blob can be handled _when_ the filters can successfully
 munge it back and forth.  If we fail due to out of memory when the
 filters cannot, that would be the same as without your improvement,
 so you are still making progress.

 I do not think my proposal makes anything worse than Steffen's patch.

I think we are saying the same thing, but perhaps I didn't phrase it
well.

 I think the main argument against going further is just that it is not
 worth the complexity. Tell people doing reduction filters they need to
 use required, and that accomplishes the same thing.

  So it seems like the ideal strategy would be:
  
   1. If it's seekable, try streaming. If not, fall back to lseek/mmap.
  
   2. If it's not seekable and the filter is required, try streaming. We
  die anyway if we fail.
 
 Puzzled...  Is it assumed that any content the filters tell us to
 use the contents from the db as-is by exiting with non-zero status
 will always be large not to fit in-core?  For small contents, isn't
 this ideal strategy a regression?

 I am not sure what you mean by regression here. We will try to stream
 more often, but I do not see that as a bad thing.

I thought the proposed flow I was commenting on was

- try streaming and die if the filter fails

For an optional filter working on contents that would fit in core,
we currently do

- slurp in memory, filter it, use the original if the filter fails

If we switched to 2., then... ahh, ok, I misread is required part.
The regression does not apply to that case at all.


--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-25 Thread Jeff King
On Sun, Aug 24, 2014 at 06:07:46PM +0200, Steffen Prohaska wrote:

 The data is streamed to the filter process anyway.  Better avoid mapping
 the file if possible.  This is especially useful if a clean filter
 reduces the size, for example if it computes a sha1 for binary data,
 like git media.  The file size that the previous implementation could
 handle was limited by the available address space; large files for
 example could not be handled with (32-bit) msysgit.  The new
 implementation can filter files of any size as long as the filter output
 is small enough.
 
 The new code path is only taken if the filter is required.  The filter
 consumes data directly from the fd.  The original data is not available
 to git, so it must fail if the filter fails.

Can you clarify this second paragraph a bit more? If I understand
correctly, we handle a non-required filter failing by just reading the
data again (which we can do because we either read it into memory
ourselves, or mmap it). With the streaming approach, we will read the
whole file through our stream; if that fails we would then want to read
the stream from the start.

Couldn't we do that with an lseek (or even an mmap with offset 0)? That
obviously would not work for non-file inputs, but I think we address
that already in index_fd: we push non-seekable things off to index_pipe,
where we spool them to memory.

So it seems like the ideal strategy would be:

  1. If it's seekable, try streaming. If not, fall back to lseek/mmap.

  2. If it's not seekable and the filter is required, try streaming. We
 die anyway if we fail.

  3. If it's not seekable and the filter is not required, decide based
 on file size:

   a. If it's small, spool to memory and proceed as we do now.

   b. If it's big, spool to a seekable tempfile.

Your patch implements part 2. But I would think part 1 is the most common
case. And while part 3b seems unpleasant, it is better than the current
code (with or without your patch), which will do 3a on a large file.

Hmm. Though I guess in (3) we do not have the size up front, so it's
complicated (we could spool N bytes to memory, then start dumping to a
file after that). I do not think we necessarily need to implement that
part, though. It seems like (1) is the thing I would expect to hit the
most (i.e., people do not always mark their filters are required).

 - write_err = (write_in_full(child_process.in, params-src, params-size) 
  0);
 + if (params-src) {
 + write_err = (write_in_full(child_process.in, params-src, 
 params-size)  0);

Style: 4-space indentation (rather than a tab). There's more of it in
this function (and in would_convert...) that I didn't mark.

 + } else {
 + /* dup(), because copy_fd() closes the input fd. */
 + fd = dup(params-fd);

Not a problem you are introducing, but this seem kind of like a
misfeature in copy_fd. Is it worth fixing? The function only has two
existing callers.

 + /* Apply a filter to an fd only if the filter is required to succeed.
 +  * We must die if the filter fails, because the original data before
 +  * filtering is not available.
 +  */

Style nit:

  /*
   * We have a blank line at the top of our
   * multi-line comments.
   */

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-25 Thread Steffen Prohaska

On Aug 25, 2014, at 2:43 PM, Jeff King p...@peff.net wrote:

 On Sun, Aug 24, 2014 at 06:07:46PM +0200, Steffen Prohaska wrote:
 
 The data is streamed to the filter process anyway.  Better avoid mapping
 the file if possible.  This is especially useful if a clean filter
 reduces the size, for example if it computes a sha1 for binary data,
 like git media.  The file size that the previous implementation could
 handle was limited by the available address space; large files for
 example could not be handled with (32-bit) msysgit.  The new
 implementation can filter files of any size as long as the filter output
 is small enough.
 
 The new code path is only taken if the filter is required.  The filter
 consumes data directly from the fd.  The original data is not available
 to git, so it must fail if the filter fails.
 
 Can you clarify this second paragraph a bit more? If I understand
 correctly, we handle a non-required filter failing by just reading the
 data again (which we can do because we either read it into memory
 ourselves, or mmap it).

We don't read the data again.  convert_to_git() assumes that it is already
in memory and simply keeps the original buffer if the filter fails.


 With the streaming approach, we will read the
 whole file through our stream; if that fails we would then want to read
 the stream from the start.
 
 Couldn't we do that with an lseek (or even an mmap with offset 0)? That
 obviously would not work for non-file inputs, but I think we address
 that already in index_fd: we push non-seekable things off to index_pipe,
 where we spool them to memory.

It could be handled that way, but we would be back to the original problem
that 32-bit git fails for large files.  The convert code path currently
assumes that all data is available in a single buffer at some point to apply
crlf and ident filters.

If the initial filter, which is assumed to reduce the file size, fails, we
could seek to 0 and read the entire file.  But git would then fail for large
files with out-of-memory.  We would not gain anything for the use case that
I describe in the commit message's first paragraph.

To implement something like the ideal strategy below, the entire convert 
machinery for crlf and ident would have to be converted to a streaming
approach.  Another option would be to detect that only the clean filter
would be applied and not crlf and ident.  Maybe we could get away with
something simpler then.

But I think that if the clean filter's purpose is to reduce file size, it
does not make sense to try to handle the case of a failing filter with a 
fallback plan.  The filter should simply be marked required, because
any sane operation requires it.


 So it seems like the ideal strategy would be:
 
  1. If it's seekable, try streaming. If not, fall back to lseek/mmap.
 
  2. If it's not seekable and the filter is required, try streaming. We
 die anyway if we fail.
 
  3. If it's not seekable and the filter is not required, decide based
 on file size:
 
   a. If it's small, spool to memory and proceed as we do now.
 
   b. If it's big, spool to a seekable tempfile.
 
 Your patch implements part 2. But I would think part 1 is the most common
 case. And while part 3b seems unpleasant, it is better than the current
 code (with or without your patch), which will do 3a on a large file.
 
 Hmm. Though I guess in (3) we do not have the size up front, so it's
 complicated (we could spool N bytes to memory, then start dumping to a
 file after that). I do not think we necessarily need to implement that
 part, though. It seems like (1) is the thing I would expect to hit the
 most (i.e., people do not always mark their filters are required).

Well, I think they have to mark it if the filter's purpose is to reduce size.

I'll add a bit of the discussion to the commit message.  I'm not convinced
that we should do more at this point.


 +} else {
 +/* dup(), because copy_fd() closes the input fd. */
 +fd = dup(params-fd);
 
 Not a problem you are introducing, but this seem kind of like a
 misfeature in copy_fd. Is it worth fixing? The function only has two
 existing callers.

I found it confusing.  I think it's worth fixing.

Steffen
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-25 Thread Junio C Hamano
Steffen Prohaska proha...@zib.de writes:

 Couldn't we do that with an lseek (or even an mmap with offset 0)? That
 obviously would not work for non-file inputs, but I think we address
 that already in index_fd: we push non-seekable things off to index_pipe,
 where we spool them to memory.

 It could be handled that way, but we would be back to the original problem
 that 32-bit git fails for large files.

Correct, and you are making an incremental improvement so that such
a large blob can be handled _when_ the filters can successfully
munge it back and forth.  If we fail due to out of memory when the
filters cannot, that would be the same as without your improvement,
so you are still making progress.

 To implement something like the ideal strategy below, the entire convert 
 machinery for crlf and ident would have to be converted to a streaming
 approach.

Yes, that has always been the longer term vision since the day the
streaming infrastructure was introduced.

 So it seems like the ideal strategy would be:
 
  1. If it's seekable, try streaming. If not, fall back to lseek/mmap.
 
  2. If it's not seekable and the filter is required, try streaming. We
 die anyway if we fail.

Puzzled...  Is it assumed that any content the filters tell us to
use the contents from the db as-is by exiting with non-zero status
will always be large not to fit in-core?  For small contents, isn't
this ideal strategy a regression?

  3. If it's not seekable and the filter is not required, decide based
 on file size:
 
   a. If it's small, spool to memory and proceed as we do now.
 
   b. If it's big, spool to a seekable tempfile.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap

2014-08-24 Thread Steffen Prohaska
The data is streamed to the filter process anyway.  Better avoid mapping
the file if possible.  This is especially useful if a clean filter
reduces the size, for example if it computes a sha1 for binary data,
like git media.  The file size that the previous implementation could
handle was limited by the available address space; large files for
example could not be handled with (32-bit) msysgit.  The new
implementation can filter files of any size as long as the filter output
is small enough.

The new code path is only taken if the filter is required.  The filter
consumes data directly from the fd.  The original data is not available
to git, so it must fail if the filter fails.

The environment variable GIT_MMAP_LIMIT, which has been introduced in
the previous commit is used to test that the expected code path is
taken.  A related test that exercises required filters is modified to
verify that the data actually has been modified on its way from the file
system to the object store.

Signed-off-by: Steffen Prohaska proha...@zib.de
---
 convert.c | 60 +--
 convert.h |  5 +
 sha1_file.c   | 27 ++-
 t/t0021-conversion.sh | 24 -
 4 files changed, 104 insertions(+), 12 deletions(-)

diff --git a/convert.c b/convert.c
index cb5fbb4..463f6de 100644
--- a/convert.c
+++ b/convert.c
@@ -312,11 +312,12 @@ static int crlf_to_worktree(const char *path, const char 
*src, size_t len,
 struct filter_params {
const char *src;
unsigned long size;
+   int fd;
const char *cmd;
const char *path;
 };
 
-static int filter_buffer(int in, int out, void *data)
+static int filter_buffer_or_fd(int in, int out, void *data)
 {
/*
 * Spawn cmd and feed the buffer contents through its stdin.
@@ -325,6 +326,7 @@ static int filter_buffer(int in, int out, void *data)
struct filter_params *params = (struct filter_params *)data;
int write_err, status;
const char *argv[] = { NULL, NULL };
+   int fd;
 
/* apply % substitution to cmd */
struct strbuf cmd = STRBUF_INIT;
@@ -355,7 +357,17 @@ static int filter_buffer(int in, int out, void *data)
 
sigchain_push(SIGPIPE, SIG_IGN);
 
-   write_err = (write_in_full(child_process.in, params-src, params-size) 
 0);
+   if (params-src) {
+   write_err = (write_in_full(child_process.in, params-src, 
params-size)  0);
+   } else {
+   /* dup(), because copy_fd() closes the input fd. */
+   fd = dup(params-fd);
+   if (fd  0)
+   write_err = error(failed to dup file descriptor.);
+   else
+   write_err = copy_fd(fd, child_process.in);
+   }
+
if (close(child_process.in))
write_err = 1;
if (write_err)
@@ -371,7 +383,7 @@ static int filter_buffer(int in, int out, void *data)
return (write_err || status);
 }
 
-static int apply_filter(const char *path, const char *src, size_t len,
+static int apply_filter(const char *path, const char *src, size_t len, int fd,
 struct strbuf *dst, const char *cmd)
 {
/*
@@ -392,11 +404,12 @@ static int apply_filter(const char *path, const char 
*src, size_t len,
return 1;
 
memset(async, 0, sizeof(async));
-   async.proc = filter_buffer;
+   async.proc = filter_buffer_or_fd;
async.data = params;
async.out = -1;
params.src = src;
params.size = len;
+   params.fd = fd;
params.cmd = cmd;
params.path = path;
 
@@ -747,6 +760,24 @@ static void convert_attrs(struct conv_attrs *ca, const 
char *path)
}
 }
 
+int would_convert_to_git_filter_fd(const char *path)
+{
+   struct conv_attrs ca;
+
+   convert_attrs(ca, path);
+   if (!ca.drv)
+   return 0;
+
+   /* Apply a filter to an fd only if the filter is required to succeed.
+* We must die if the filter fails, because the original data before
+* filtering is not available.
+*/
+   if (!ca.drv-required)
+   return 0;
+
+   return apply_filter(path, NULL, 0, -1, NULL, ca.drv-clean);
+}
+
 int convert_to_git(const char *path, const char *src, size_t len,
struct strbuf *dst, enum safe_crlf checksafe)
 {
@@ -761,7 +792,7 @@ int convert_to_git(const char *path, const char *src, 
size_t len,
required = ca.drv-required;
}
 
-   ret |= apply_filter(path, src, len, dst, filter);
+   ret |= apply_filter(path, src, len, -1, dst, filter);
if (!ret  required)
die(%s: clean filter '%s' failed, path, ca.drv-name);
 
@@ -778,6 +809,23 @@ int convert_to_git(const char *path, const char *src, 
size_t len,
return ret | ident_to_git(path, src, len, dst, ca.ident);
 }
 
+void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst,
+