Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
On Mon, Aug 25, 2014 at 06:55:51PM +0200, Steffen Prohaska wrote: It could be handled that way, but we would be back to the original problem that 32-bit git fails for large files. The convert code path currently assumes that all data is available in a single buffer at some point to apply crlf and ident filters. If the initial filter, which is assumed to reduce the file size, fails, we could seek to 0 and read the entire file. But git would then fail for large files with out-of-memory. We would not gain anything for the use case that I describe in the commit message's first paragraph. Ah. So the real problem is that we cannot handle _other_ conversions for large files, and we must try to intercept the data before it gets to them. So this is really just helping reduction filters. Even if our streaming filter succeeds, it does not help the situation if it did not reduce the large file to a smaller one. It would be nice in the long run to let the other filters stream, too, but that is not a problem we need to solve immediately. Your patch is a strict improvement. Thanks for the explanation; your approach makes a lot more sense to me now. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
On Mon, Aug 25, 2014 at 11:35:45AM -0700, Junio C Hamano wrote: Steffen Prohaska proha...@zib.de writes: Couldn't we do that with an lseek (or even an mmap with offset 0)? That obviously would not work for non-file inputs, but I think we address that already in index_fd: we push non-seekable things off to index_pipe, where we spool them to memory. It could be handled that way, but we would be back to the original problem that 32-bit git fails for large files. Correct, and you are making an incremental improvement so that such a large blob can be handled _when_ the filters can successfully munge it back and forth. If we fail due to out of memory when the filters cannot, that would be the same as without your improvement, so you are still making progress. I do not think my proposal makes anything worse than Steffen's patch. _If_ you have a non-required filter, and _if_ we can run it, then we stream the filter and hopefully end up with a small enough result to fit into memory. If we cannot run the filter, we are screwed anyway (we follow the regular code path and dump the whole thing into memory; i.e., the same as without this patch series). I think the main argument against going further is just that it is not worth the complexity. Tell people doing reduction filters they need to use required, and that accomplishes the same thing. So it seems like the ideal strategy would be: 1. If it's seekable, try streaming. If not, fall back to lseek/mmap. 2. If it's not seekable and the filter is required, try streaming. We die anyway if we fail. Puzzled... Is it assumed that any content the filters tell us to use the contents from the db as-is by exiting with non-zero status will always be large not to fit in-core? For small contents, isn't this ideal strategy a regression? I am not sure what you mean by regression here. We will try to stream more often, but I do not see that as a bad thing. -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
Jeff King p...@peff.net writes: On Mon, Aug 25, 2014 at 11:35:45AM -0700, Junio C Hamano wrote: Steffen Prohaska proha...@zib.de writes: Couldn't we do that with an lseek (or even an mmap with offset 0)? That obviously would not work for non-file inputs, but I think we address that already in index_fd: we push non-seekable things off to index_pipe, where we spool them to memory. It could be handled that way, but we would be back to the original problem that 32-bit git fails for large files. Correct, and you are making an incremental improvement so that such a large blob can be handled _when_ the filters can successfully munge it back and forth. If we fail due to out of memory when the filters cannot, that would be the same as without your improvement, so you are still making progress. I do not think my proposal makes anything worse than Steffen's patch. I think we are saying the same thing, but perhaps I didn't phrase it well. I think the main argument against going further is just that it is not worth the complexity. Tell people doing reduction filters they need to use required, and that accomplishes the same thing. So it seems like the ideal strategy would be: 1. If it's seekable, try streaming. If not, fall back to lseek/mmap. 2. If it's not seekable and the filter is required, try streaming. We die anyway if we fail. Puzzled... Is it assumed that any content the filters tell us to use the contents from the db as-is by exiting with non-zero status will always be large not to fit in-core? For small contents, isn't this ideal strategy a regression? I am not sure what you mean by regression here. We will try to stream more often, but I do not see that as a bad thing. I thought the proposed flow I was commenting on was - try streaming and die if the filter fails For an optional filter working on contents that would fit in core, we currently do - slurp in memory, filter it, use the original if the filter fails If we switched to 2., then... ahh, ok, I misread is required part. The regression does not apply to that case at all. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
On Sun, Aug 24, 2014 at 06:07:46PM +0200, Steffen Prohaska wrote: The data is streamed to the filter process anyway. Better avoid mapping the file if possible. This is especially useful if a clean filter reduces the size, for example if it computes a sha1 for binary data, like git media. The file size that the previous implementation could handle was limited by the available address space; large files for example could not be handled with (32-bit) msysgit. The new implementation can filter files of any size as long as the filter output is small enough. The new code path is only taken if the filter is required. The filter consumes data directly from the fd. The original data is not available to git, so it must fail if the filter fails. Can you clarify this second paragraph a bit more? If I understand correctly, we handle a non-required filter failing by just reading the data again (which we can do because we either read it into memory ourselves, or mmap it). With the streaming approach, we will read the whole file through our stream; if that fails we would then want to read the stream from the start. Couldn't we do that with an lseek (or even an mmap with offset 0)? That obviously would not work for non-file inputs, but I think we address that already in index_fd: we push non-seekable things off to index_pipe, where we spool them to memory. So it seems like the ideal strategy would be: 1. If it's seekable, try streaming. If not, fall back to lseek/mmap. 2. If it's not seekable and the filter is required, try streaming. We die anyway if we fail. 3. If it's not seekable and the filter is not required, decide based on file size: a. If it's small, spool to memory and proceed as we do now. b. If it's big, spool to a seekable tempfile. Your patch implements part 2. But I would think part 1 is the most common case. And while part 3b seems unpleasant, it is better than the current code (with or without your patch), which will do 3a on a large file. Hmm. Though I guess in (3) we do not have the size up front, so it's complicated (we could spool N bytes to memory, then start dumping to a file after that). I do not think we necessarily need to implement that part, though. It seems like (1) is the thing I would expect to hit the most (i.e., people do not always mark their filters are required). - write_err = (write_in_full(child_process.in, params-src, params-size) 0); + if (params-src) { + write_err = (write_in_full(child_process.in, params-src, params-size) 0); Style: 4-space indentation (rather than a tab). There's more of it in this function (and in would_convert...) that I didn't mark. + } else { + /* dup(), because copy_fd() closes the input fd. */ + fd = dup(params-fd); Not a problem you are introducing, but this seem kind of like a misfeature in copy_fd. Is it worth fixing? The function only has two existing callers. + /* Apply a filter to an fd only if the filter is required to succeed. + * We must die if the filter fails, because the original data before + * filtering is not available. + */ Style nit: /* * We have a blank line at the top of our * multi-line comments. */ -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
On Aug 25, 2014, at 2:43 PM, Jeff King p...@peff.net wrote: On Sun, Aug 24, 2014 at 06:07:46PM +0200, Steffen Prohaska wrote: The data is streamed to the filter process anyway. Better avoid mapping the file if possible. This is especially useful if a clean filter reduces the size, for example if it computes a sha1 for binary data, like git media. The file size that the previous implementation could handle was limited by the available address space; large files for example could not be handled with (32-bit) msysgit. The new implementation can filter files of any size as long as the filter output is small enough. The new code path is only taken if the filter is required. The filter consumes data directly from the fd. The original data is not available to git, so it must fail if the filter fails. Can you clarify this second paragraph a bit more? If I understand correctly, we handle a non-required filter failing by just reading the data again (which we can do because we either read it into memory ourselves, or mmap it). We don't read the data again. convert_to_git() assumes that it is already in memory and simply keeps the original buffer if the filter fails. With the streaming approach, we will read the whole file through our stream; if that fails we would then want to read the stream from the start. Couldn't we do that with an lseek (or even an mmap with offset 0)? That obviously would not work for non-file inputs, but I think we address that already in index_fd: we push non-seekable things off to index_pipe, where we spool them to memory. It could be handled that way, but we would be back to the original problem that 32-bit git fails for large files. The convert code path currently assumes that all data is available in a single buffer at some point to apply crlf and ident filters. If the initial filter, which is assumed to reduce the file size, fails, we could seek to 0 and read the entire file. But git would then fail for large files with out-of-memory. We would not gain anything for the use case that I describe in the commit message's first paragraph. To implement something like the ideal strategy below, the entire convert machinery for crlf and ident would have to be converted to a streaming approach. Another option would be to detect that only the clean filter would be applied and not crlf and ident. Maybe we could get away with something simpler then. But I think that if the clean filter's purpose is to reduce file size, it does not make sense to try to handle the case of a failing filter with a fallback plan. The filter should simply be marked required, because any sane operation requires it. So it seems like the ideal strategy would be: 1. If it's seekable, try streaming. If not, fall back to lseek/mmap. 2. If it's not seekable and the filter is required, try streaming. We die anyway if we fail. 3. If it's not seekable and the filter is not required, decide based on file size: a. If it's small, spool to memory and proceed as we do now. b. If it's big, spool to a seekable tempfile. Your patch implements part 2. But I would think part 1 is the most common case. And while part 3b seems unpleasant, it is better than the current code (with or without your patch), which will do 3a on a large file. Hmm. Though I guess in (3) we do not have the size up front, so it's complicated (we could spool N bytes to memory, then start dumping to a file after that). I do not think we necessarily need to implement that part, though. It seems like (1) is the thing I would expect to hit the most (i.e., people do not always mark their filters are required). Well, I think they have to mark it if the filter's purpose is to reduce size. I'll add a bit of the discussion to the commit message. I'm not convinced that we should do more at this point. +} else { +/* dup(), because copy_fd() closes the input fd. */ +fd = dup(params-fd); Not a problem you are introducing, but this seem kind of like a misfeature in copy_fd. Is it worth fixing? The function only has two existing callers. I found it confusing. I think it's worth fixing. Steffen -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
Steffen Prohaska proha...@zib.de writes: Couldn't we do that with an lseek (or even an mmap with offset 0)? That obviously would not work for non-file inputs, but I think we address that already in index_fd: we push non-seekable things off to index_pipe, where we spool them to memory. It could be handled that way, but we would be back to the original problem that 32-bit git fails for large files. Correct, and you are making an incremental improvement so that such a large blob can be handled _when_ the filters can successfully munge it back and forth. If we fail due to out of memory when the filters cannot, that would be the same as without your improvement, so you are still making progress. To implement something like the ideal strategy below, the entire convert machinery for crlf and ident would have to be converted to a streaming approach. Yes, that has always been the longer term vision since the day the streaming infrastructure was introduced. So it seems like the ideal strategy would be: 1. If it's seekable, try streaming. If not, fall back to lseek/mmap. 2. If it's not seekable and the filter is required, try streaming. We die anyway if we fail. Puzzled... Is it assumed that any content the filters tell us to use the contents from the db as-is by exiting with non-zero status will always be large not to fit in-core? For small contents, isn't this ideal strategy a regression? 3. If it's not seekable and the filter is not required, decide based on file size: a. If it's small, spool to memory and proceed as we do now. b. If it's big, spool to a seekable tempfile. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v5 4/4] convert: Stream from fd to required clean filter instead of mmap
The data is streamed to the filter process anyway. Better avoid mapping the file if possible. This is especially useful if a clean filter reduces the size, for example if it computes a sha1 for binary data, like git media. The file size that the previous implementation could handle was limited by the available address space; large files for example could not be handled with (32-bit) msysgit. The new implementation can filter files of any size as long as the filter output is small enough. The new code path is only taken if the filter is required. The filter consumes data directly from the fd. The original data is not available to git, so it must fail if the filter fails. The environment variable GIT_MMAP_LIMIT, which has been introduced in the previous commit is used to test that the expected code path is taken. A related test that exercises required filters is modified to verify that the data actually has been modified on its way from the file system to the object store. Signed-off-by: Steffen Prohaska proha...@zib.de --- convert.c | 60 +-- convert.h | 5 + sha1_file.c | 27 ++- t/t0021-conversion.sh | 24 - 4 files changed, 104 insertions(+), 12 deletions(-) diff --git a/convert.c b/convert.c index cb5fbb4..463f6de 100644 --- a/convert.c +++ b/convert.c @@ -312,11 +312,12 @@ static int crlf_to_worktree(const char *path, const char *src, size_t len, struct filter_params { const char *src; unsigned long size; + int fd; const char *cmd; const char *path; }; -static int filter_buffer(int in, int out, void *data) +static int filter_buffer_or_fd(int in, int out, void *data) { /* * Spawn cmd and feed the buffer contents through its stdin. @@ -325,6 +326,7 @@ static int filter_buffer(int in, int out, void *data) struct filter_params *params = (struct filter_params *)data; int write_err, status; const char *argv[] = { NULL, NULL }; + int fd; /* apply % substitution to cmd */ struct strbuf cmd = STRBUF_INIT; @@ -355,7 +357,17 @@ static int filter_buffer(int in, int out, void *data) sigchain_push(SIGPIPE, SIG_IGN); - write_err = (write_in_full(child_process.in, params-src, params-size) 0); + if (params-src) { + write_err = (write_in_full(child_process.in, params-src, params-size) 0); + } else { + /* dup(), because copy_fd() closes the input fd. */ + fd = dup(params-fd); + if (fd 0) + write_err = error(failed to dup file descriptor.); + else + write_err = copy_fd(fd, child_process.in); + } + if (close(child_process.in)) write_err = 1; if (write_err) @@ -371,7 +383,7 @@ static int filter_buffer(int in, int out, void *data) return (write_err || status); } -static int apply_filter(const char *path, const char *src, size_t len, +static int apply_filter(const char *path, const char *src, size_t len, int fd, struct strbuf *dst, const char *cmd) { /* @@ -392,11 +404,12 @@ static int apply_filter(const char *path, const char *src, size_t len, return 1; memset(async, 0, sizeof(async)); - async.proc = filter_buffer; + async.proc = filter_buffer_or_fd; async.data = params; async.out = -1; params.src = src; params.size = len; + params.fd = fd; params.cmd = cmd; params.path = path; @@ -747,6 +760,24 @@ static void convert_attrs(struct conv_attrs *ca, const char *path) } } +int would_convert_to_git_filter_fd(const char *path) +{ + struct conv_attrs ca; + + convert_attrs(ca, path); + if (!ca.drv) + return 0; + + /* Apply a filter to an fd only if the filter is required to succeed. +* We must die if the filter fails, because the original data before +* filtering is not available. +*/ + if (!ca.drv-required) + return 0; + + return apply_filter(path, NULL, 0, -1, NULL, ca.drv-clean); +} + int convert_to_git(const char *path, const char *src, size_t len, struct strbuf *dst, enum safe_crlf checksafe) { @@ -761,7 +792,7 @@ int convert_to_git(const char *path, const char *src, size_t len, required = ca.drv-required; } - ret |= apply_filter(path, src, len, dst, filter); + ret |= apply_filter(path, src, len, -1, dst, filter); if (!ret required) die(%s: clean filter '%s' failed, path, ca.drv-name); @@ -778,6 +809,23 @@ int convert_to_git(const char *path, const char *src, size_t len, return ret | ident_to_git(path, src, len, dst, ca.ident); } +void convert_to_git_filter_fd(const char *path, int fd, struct strbuf *dst, +