Re: [PATCH 1/9] Convert pack-objects to size_t
On Wed, Aug 16, 2017 at 10:22:39PM +0200, Martin Koegler wrote: > On Mon, Aug 14, 2017 at 10:08:05AM -0700, Junio C Hamano wrote: > >It may help reducing the maintenance if we introduced obj_size_t > >that is defined to be size_t for now, so that we can later swap > >it to ofs_t or some larger type when we know we do need to > >support objects whose size cannot be expressed in size_t, but I > >do not offhand know what the pros-and-cons with such an approach > >would look like. > > Where should the use of obj_size_t end and the use of size_t start? > > We often determine a object size and then pass it to malloc. > We would start with a larger datatyp and then truncate for memory allocation, > which use size_t. The truncation is done with xsize_t: The "old" sha1_file.c has something like this: idx_size = xsize_t(st.st_size); I personally don't think that obj_size_t gives us so much. There are "objects" which are "on disk", and they may have a length off_t, And there are "objects" loaded into memory, and they have a length size_t. And everybody can check that we use the right type. Additionally I don't like it very much, when object size in memory is counted in a 64 bit value (and this will happen if obj_size_ = off_t == 64bit) Anyhow, to answer your question: All calles xmalloc() must be prepared like this: p = xmalloc(xsize_t(size_of_object)); That should do the trick. > > Regards, > Martin
Re: [PATCH 1/9] Convert pack-objects to size_t
On Mon, Aug 14, 2017 at 10:08:05AM -0700, Junio C Hamano wrote: >It may help reducing the maintenance if we introduced obj_size_t >that is defined to be size_t for now, so that we can later swap >it to ofs_t or some larger type when we know we do need to >support objects whose size cannot be expressed in size_t, but I >do not offhand know what the pros-and-cons with such an approach >would look like. Where should the use of obj_size_t end and the use of size_t start? We often determine a object size and then pass it to malloc. We would start with a larger datatyp and then truncate for memory allocation, which use size_t. Regards, Martin
Re: [PATCH 1/9] Convert pack-objects to size_t
On 14/08/17 21:32, Torsten Bögershausen wrote: > In general, yes. > I had a patch that allowed a 32 bit linux to commit a file >4GB. > There is however a restriction: The file must not yet be known to the > index. Otherwise the "diff" machinery kicks in, and tries to > compare the "old" and the "new" versions, which means that -both- > must fit into "memory" at the same time. Memory means here the available > address space rather then real memory. > So there may be room for improvements for 32 bit systems, but that is another > story, which can be developped independent of the ulong->size_t conversion. Oh, I absolutely agree. > Thanks to Martin for working on this. Indeed! Thanks Martin. ATB, Ramsay Jones
Re: [PATCH 1/9] Convert pack-objects to size_t
On Mon, Aug 14, 2017 at 08:31:50PM +0100, Ramsay Jones wrote: > > > On 14/08/17 18:08, Junio C Hamano wrote: > > Junio C Hamanowrites: > > > >> One interesting question is which of these two types we should use > >> for the size of objects Git uses. > >> > >> Most of the "interesting" operations done by Git require that the > >> thing is in core as a whole before we can do anything (e.g. compare > >> two such things to produce delta, have one in core and apply patch), > >> so it is tempting that we deal with size_t, but at the lowest level > >> to serve as a SCM, i.e. recording the state of a file at each > >> version, we actually should be able to exceed the in-core > >> limit---both "git add" of a huge file whose contents would not fit > >> in-core and "git checkout" of a huge blob whose inflated contents > >> would not fit in-core should (in theory, modulo bugs) be able to > >> exercise the streaming interface to handle such case without holding > >> everything in-core at once. So from that point of view, even size_t > >> may not be the "correct" type to use. > > > > A few additions to the above observations. > > > > - We have varint that encodes how far the location from a delta > >representation of an object to its base object in the packfile. > >Both encoding and decoding sides in the current code use off_t to > >represent this offset, so we can already reference an object that > >is far in the same packfile as a base. > > > > - I think it is OK in practice to limit the size of individual > >objects to size_t (i.e. on 32-bit arch, you cannot interact with > >a repository with an object whose size exceeds 4GB). Using off_t > >would allow occasional ultra-huge objects that can only be added > >and checked in via the streaming API on such a platform, but I > >suspect that it may become too much of a hassle to maintain. > > In a previous comment, I said that (on 32-bit Linux) it was likely > that an object of > 4GB could not be handled correctly anyway. (more > likely > 2GB). This was based on the code from (quite some) years ago. > In particular, before you added the "streaming API". So, maybe a 32-bit > arch _should_ be able to handle objects as large as the LFS API allows. > (Ignoring, for the moment, that I think anybody who puts files of that > size into an SCM probably gets what they deserve. :-P ). In general, yes. I had a patch that allowed a 32 bit linux to commit a file >4GB. There is however a restriction: The file must not yet be known to the index. Otherwise the "diff" machinery kicks in, and tries to compare the "old" and the "new" versions, which means that -both- must fit into "memory" at the same time. Memory means here the available address space rather then real memory. So there may be room for improvements for 32 bit systems, but that is another story, which can be developped independent of the ulong->size_t conversion. > > The two patches I commented on, however, changed the type of some > variables from off_t to size_t. In general, the patches did not > seem to make anything worse, but these type changes could potentially > do harm. Hence my comment. (I still haven't tried the patches on my > 32-bit Linux system. I only boot it up about once a week, and I would > rather wait until the patches are in the 'pu' branch before testing). > > ATB, > Ramsay Jones > And here (in agreement with the answer from Junio): We should not change any off_t into size_t, since that means a possible downgrade. Whenever an off_t is converted into size_t, a function in git-compat-util.h is used: static inline size_t xsize_t(off_t len) { if (len > (size_t) len) die("Cannot handle files this big"); return (size_t)len; } Having written this, it may be a good idea to define a similar function xulong() which is used when we do calls into zlib. Thanks to Martin for working on this. If there is a branch on a public repo, I can e.g. run the test suite on different systems.
Re: [PATCH 1/9] Convert pack-objects to size_t
Ramsay Joneswrites: > In a previous comment, I said that (on 32-bit Linux) it was likely > that an object of > 4GB could not be handled correctly anyway. (more > likely > 2GB). This was based on the code from (quite some) years ago. > In particular, before you added the "streaming API". So, maybe a 32-bit > arch _should_ be able to handle objects as large as the LFS API allows. > (Ignoring, for the moment, that I think anybody who puts files of that > size into an SCM probably gets what they deserve. :-P ). > > The two patches I commented on, however, changed the type of some > variables from off_t to size_t. In general, the patches did not > seem to make anything worse, but these type changes could potentially > do harm. Hence my comment. (I still haven't tried the patches on my > 32-bit Linux system. I only boot it up about once a week, and I would > rather wait until the patches are in the 'pu' branch before testing). We are in perfect agreement. I didn't mean to say that it is OK to replace off_t with size_t without a good reason, especially when the current code (at least the part I looked at anyway, like the OFS_DELTA part) seems to use off_t correctly, and your review comments are very much appreciated, so is the effort started by Martin to take us in the direction of using types more appropriate than "ulong".
Re: [PATCH 1/9] Convert pack-objects to size_t
On 14/08/17 18:08, Junio C Hamano wrote: > Junio C Hamanowrites: > >> One interesting question is which of these two types we should use >> for the size of objects Git uses. >> >> Most of the "interesting" operations done by Git require that the >> thing is in core as a whole before we can do anything (e.g. compare >> two such things to produce delta, have one in core and apply patch), >> so it is tempting that we deal with size_t, but at the lowest level >> to serve as a SCM, i.e. recording the state of a file at each >> version, we actually should be able to exceed the in-core >> limit---both "git add" of a huge file whose contents would not fit >> in-core and "git checkout" of a huge blob whose inflated contents >> would not fit in-core should (in theory, modulo bugs) be able to >> exercise the streaming interface to handle such case without holding >> everything in-core at once. So from that point of view, even size_t >> may not be the "correct" type to use. > > A few additions to the above observations. > > - We have varint that encodes how far the location from a delta >representation of an object to its base object in the packfile. >Both encoding and decoding sides in the current code use off_t to >represent this offset, so we can already reference an object that >is far in the same packfile as a base. > > - I think it is OK in practice to limit the size of individual >objects to size_t (i.e. on 32-bit arch, you cannot interact with >a repository with an object whose size exceeds 4GB). Using off_t >would allow occasional ultra-huge objects that can only be added >and checked in via the streaming API on such a platform, but I >suspect that it may become too much of a hassle to maintain. In a previous comment, I said that (on 32-bit Linux) it was likely that an object of > 4GB could not be handled correctly anyway. (more likely > 2GB). This was based on the code from (quite some) years ago. In particular, before you added the "streaming API". So, maybe a 32-bit arch _should_ be able to handle objects as large as the LFS API allows. (Ignoring, for the moment, that I think anybody who puts files of that size into an SCM probably gets what they deserve. :-P ). The two patches I commented on, however, changed the type of some variables from off_t to size_t. In general, the patches did not seem to make anything worse, but these type changes could potentially do harm. Hence my comment. (I still haven't tried the patches on my 32-bit Linux system. I only boot it up about once a week, and I would rather wait until the patches are in the 'pu' branch before testing). ATB, Ramsay Jones
Re: [PATCH 1/9] Convert pack-objects to size_t
Junio C Hamanowrites: > One interesting question is which of these two types we should use > for the size of objects Git uses. > > Most of the "interesting" operations done by Git require that the > thing is in core as a whole before we can do anything (e.g. compare > two such things to produce delta, have one in core and apply patch), > so it is tempting that we deal with size_t, but at the lowest level > to serve as a SCM, i.e. recording the state of a file at each > version, we actually should be able to exceed the in-core > limit---both "git add" of a huge file whose contents would not fit > in-core and "git checkout" of a huge blob whose inflated contents > would not fit in-core should (in theory, modulo bugs) be able to > exercise the streaming interface to handle such case without holding > everything in-core at once. So from that point of view, even size_t > may not be the "correct" type to use. A few additions to the above observations. - We have varint that encodes how far the location from a delta representation of an object to its base object in the packfile. Both encoding and decoding sides in the current code use off_t to represent this offset, so we can already reference an object that is far in the same packfile as a base. - I think it is OK in practice to limit the size of individual objects to size_t (i.e. on 32-bit arch, you cannot interact with a repository with an object whose size exceeds 4GB). Using off_t would allow occasional ultra-huge objects that can only be added and checked in via the streaming API on such a platform, but I suspect that it may become too much of a hassle to maintain. It may help reducing the maintenance if we introduced obj_size_t that is defined to be size_t for now, so that we can later swap it to ofs_t or some larger type when we know we do need to support objects whose size cannot be expressed in size_t, but I do not offhand know what the pros-and-cons with such an approach would look like. Thanks.
Re: [PATCH 1/9] Convert pack-objects to size_t
Ramsay Joneswrites: > It should be, see commit b97e911643 ("Support for large files > on 32bit systems.", 17-02-2007), where you can see that the > _FILE_OFFSET_BITS macro is set to 64. This asks et.al., > to use the "Large File System" API and a 64-bit off_t. Correct. Roughly speaking, we should view off_t as size of things you can store in a file, and size_t as size of things you can have in core. When we allocate a region of memory, and then repeatedly fill it with some data and hand that memory to a helper function e.g. git_deflate(), each of these calls should expect to get data whose size can be representable by size_t, and if that is shorter than ulong which we currently use, we are artificially limiting our potential by using a type that is narrower than necessary. The result from these helper functions that are repeatedly called may be sent to a file as the output from the loop. If that logic in the outer loop wants to keep a tally of the total size of data they processed, that number may not be fit in size_t and instead may require off_t. One interesting question is which of these two types we should use for the size of objects Git uses. Most of the "interesting" operations done by Git require that the thing is in core as a whole before we can do anything (e.g. compare two such things to produce delta, have one in core and apply patch), so it is tempting that we deal with size_t, but at the lowest level to serve as a SCM, i.e. recording the state of a file at each version, we actually should be able to exceed the in-core limit---both "git add" of a huge file whose contents would not fit in-core and "git checkout" of a huge blob whose inflated contents would not fit in-core should (in theory, modulo bugs) be able to exercise the streaming interface to handle such case without holding everything in-core at once. So from that point of view, even size_t may not be the "correct" type to use.
Re: [PATCH 1/9] Convert pack-objects to size_t
On 13/08/17 19:30, Martin Koegler wrote: > On Sat, Aug 12, 2017 at 02:47:25PM +0100, Ramsay Jones wrote: >> On 32-bit Linux, off_t is 64-bit and size_t is 32-bit. > > --- t.c --- > #include > #include > > int main() > { > printf("%d %d\n", sizeof(size_t), sizeof(off_t)); > } > > $ gcc -m32 -o t t.c > $ ./t.c > 4 4 > > So is that really true? It should be, see commit b97e911643 ("Support for large files on 32bit systems.", 17-02-2007), where you can see that the _FILE_OFFSET_BITS macro is set to 64. This asks et.al., to use the "Large File System" API and a 64-bit off_t. I can't boot my 32-bit installation at the moment, and it seems that my 64-bit multilib system is not playing ball: $ gcc -m32 -D_FILE_OFFSET_BITS=64 -o t t.c In file included from /usr/include/stdio.h:27:0, from t.c:2: /usr/include/features.h:367:25: fatal error: sys/cdefs.h: No such file or directory compilation terminated. $ ATB, Ramsay Jones
Re: [PATCH 1/9] Convert pack-objects to size_t
On Sat, Aug 12, 2017 at 02:47:25PM +0100, Ramsay Jones wrote: > On 32-bit Linux, off_t is 64-bit and size_t is 32-bit. --- t.c --- #include #include int main() { printf("%d %d\n", sizeof(size_t), sizeof(off_t)); } $ gcc -m32 -o t t.c $ ./t.c 4 4 So is that really true? Regards, Martin
Re: [PATCH 1/9] Convert pack-objects to size_t
On Sat, Aug 12, 2017 at 11:59:15AM +0200, Torsten Bögershausen wrote: > Thanks for working on this - unfortunatly the patch does not apply on > git.git/master. > > Which baseline did you use ? next - 98096fd7a85b93626db8757f944f2d8ffdf7e96a It accumulated to 19 patches. Regards, Martin
Re: [PATCH 1/9] Convert pack-objects to size_t
On 12/08/17 09:47, Martin Koegler wrote: > From: Martin Koegler> > Signed-off-by: Martin Koegler > --- > builtin/pack-objects.c | 48 > 1 file changed, 24 insertions(+), 24 deletions(-) > > diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c > index aa70f80..f8db283 100644 > --- a/builtin/pack-objects.c > +++ b/builtin/pack-objects.c > @@ -56,7 +56,7 @@ static struct pack_idx_option pack_idx_opts; > static const char *base_name; > static int progress = 1; > static int window = 10; > -static unsigned long pack_size_limit; > +static size_t pack_size_limit; > static int depth = 50; > static int delta_search_threads; > static int pack_to_stdout; > @@ -72,11 +72,11 @@ static int use_bitmap_index = -1; > static int write_bitmap_index; > static uint16_t write_bitmap_options; > > -static unsigned long delta_cache_size = 0; > -static unsigned long max_delta_cache_size = 256 * 1024 * 1024; > -static unsigned long cache_max_small_delta_size = 1000; > +static size_t delta_cache_size = 0; > +static size_t max_delta_cache_size = 256 * 1024 * 1024; > +static size_t cache_max_small_delta_size = 1000; > > -static unsigned long window_memory_limit = 0; > +static size_t window_memory_limit = 0; > > /* > * stats > @@ -124,11 +124,11 @@ static void *get_delta(struct object_entry *entry) > return delta_buf; > } > > -static unsigned long do_compress(void **pptr, unsigned long size) > +static size_t do_compress(void **pptr, size_t size) > { > git_zstream stream; > void *in, *out; > - unsigned long maxsize; > + size_t maxsize; > > git_deflate_init(, pack_compression_level); > maxsize = git_deflate_bound(, size); > @@ -149,13 +149,13 @@ static unsigned long do_compress(void **pptr, unsigned > long size) > return stream.total_out; > } > > -static unsigned long write_large_blob_data(struct git_istream *st, struct > sha1file *f, > +static size_t write_large_blob_data(struct git_istream *st, struct sha1file > *f, > const unsigned char *sha1) > { > git_zstream stream; > unsigned char ibuf[1024 * 16]; > unsigned char obuf[1024 * 16]; > - unsigned long olen = 0; > + size_t olen = 0; > > git_deflate_init(, pack_compression_level); > > @@ -196,7 +196,7 @@ static int check_pack_inflate(struct packed_git *p, > struct pack_window **w_curs, > off_t offset, > off_t len, > - unsigned long expect) > + size_t expect) > { > git_zstream stream; > unsigned char fakebuf[4096], *in; > @@ -238,13 +238,13 @@ static void copy_pack_data(struct sha1file *f, > } > > /* Return 0 if we will bust the pack-size limit */ > -static unsigned long write_no_reuse_object(struct sha1file *f, struct > object_entry *entry, > -unsigned long limit, int > usable_delta) > +static size_t write_no_reuse_object(struct sha1file *f, struct object_entry > *entry, > + size_t limit, int usable_delta) > { > size_t size, datalen; > unsigned char header[MAX_PACK_OBJECT_HEADER], > dheader[MAX_PACK_OBJECT_HEADER]; > - unsigned hdrlen; > + size_t hdrlen; > enum object_type type; > void *buf; > struct git_istream *st = NULL; > @@ -350,17 +350,17 @@ static unsigned long write_no_reuse_object(struct > sha1file *f, struct object_ent > > /* Return 0 if we will bust the pack-size limit */ > static off_t write_reuse_object(struct sha1file *f, struct object_entry > *entry, > - unsigned long limit, int usable_delta) > + size_t limit, int usable_delta) > { > struct packed_git *p = entry->in_pack; > struct pack_window *w_curs = NULL; > struct revindex_entry *revidx; > off_t offset; > enum object_type type = entry->type; > - off_t datalen; > + size_t datalen; On 32-bit Linux, off_t is 64-bit and size_t is 32-bit. Later in this function, datalen is set the the difference of two (revidx) off_t offsets. I haven't applied and tried this on my 32-bit Linux install yet (that installation only gets booted about once a week), but I suspect a 'possible data-loss' warning. For other reasons, I don't think it is not possible for a 32-bit system to create a packfile with an individual object of size > 4GB anyway, so this may not be a problem in practice. (a 32-bit system should be able to read such a packfile, presumably created on a 64-bit system, and refuse to load such an object, of course). ATB, Ramsay Jones
Re: [PATCH 1/9] Convert pack-objects to size_t
Thanks for working on this - unfortunatly the patch does not apply on git.git/master. Which baseline did you use ?