Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-17 Thread Torsten Bögershausen
On Wed, Aug 16, 2017 at 10:22:39PM +0200, Martin Koegler wrote:
> On Mon, Aug 14, 2017 at 10:08:05AM -0700, Junio C Hamano wrote:
> >It may help reducing the maintenance if we introduced obj_size_t
> >that is defined to be size_t for now, so that we can later swap
> >it to ofs_t or some larger type when we know we do need to
> >support objects whose size cannot be expressed in size_t, but I
> >do not offhand know what the pros-and-cons with such an approach
> >would look like.
> 
> Where should the use of obj_size_t end and the use of size_t start? 
> 
> We often determine a object size and then pass it to malloc. 
> We would start with a larger datatyp and then truncate for memory allocation, 
> which use size_t.

The truncation is done with xsize_t:
The "old" sha1_file.c has something like this:
idx_size = xsize_t(st.st_size);

I personally don't think that obj_size_t gives us so much.
There are "objects" which are "on disk", and they may have a length off_t,
And there are "objects" loaded into memory, and they have a length size_t.
And everybody can check that we use the right type.
Additionally I don't like it very much, when object size in memory is counted
in a 64 bit value (and this will happen if  obj_size_ = off_t == 64bit)

Anyhow, to answer your question:
All calles xmalloc() must be prepared like this:

p = xmalloc(xsize_t(size_of_object));

That should do the trick.

> 
> Regards,
> Martin


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-16 Thread Martin Koegler
On Mon, Aug 14, 2017 at 10:08:05AM -0700, Junio C Hamano wrote:
>It may help reducing the maintenance if we introduced obj_size_t
>that is defined to be size_t for now, so that we can later swap
>it to ofs_t or some larger type when we know we do need to
>support objects whose size cannot be expressed in size_t, but I
>do not offhand know what the pros-and-cons with such an approach
>would look like.

Where should the use of obj_size_t end and the use of size_t start? 

We often determine a object size and then pass it to malloc. 
We would start with a larger datatyp and then truncate for memory allocation, 
which use size_t.

Regards,
Martin


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-14 Thread Ramsay Jones


On 14/08/17 21:32, Torsten Bögershausen wrote:
> In general, yes.
> I had a patch that allowed a 32 bit linux to commit a file >4GB.
> There is however a restriction: The file must not yet be known to the
> index. Otherwise the "diff" machinery kicks in, and tries to
> compare the "old" and the "new" versions, which means that -both-
> must fit into "memory" at the same time. Memory means here the available
> address space rather then real memory.
> So there may be room for improvements for 32 bit systems, but that is another
> story, which can be developped independent of the ulong->size_t conversion.

Oh, I absolutely agree.

> Thanks to Martin for working on this.

Indeed! Thanks Martin.

ATB,
Ramsay Jones




Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-14 Thread Torsten Bögershausen
On Mon, Aug 14, 2017 at 08:31:50PM +0100, Ramsay Jones wrote:
> 
> 
> On 14/08/17 18:08, Junio C Hamano wrote:
> > Junio C Hamano  writes:
> > 
> >> One interesting question is which of these two types we should use
> >> for the size of objects Git uses.  
> >>
> >> Most of the "interesting" operations done by Git require that the
> >> thing is in core as a whole before we can do anything (e.g. compare
> >> two such things to produce delta, have one in core and apply patch),
> >> so it is tempting that we deal with size_t, but at the lowest level
> >> to serve as a SCM, i.e. recording the state of a file at each
> >> version, we actually should be able to exceed the in-core
> >> limit---both "git add" of a huge file whose contents would not fit
> >> in-core and "git checkout" of a huge blob whose inflated contents
> >> would not fit in-core should (in theory, modulo bugs) be able to
> >> exercise the streaming interface to handle such case without holding
> >> everything in-core at once.  So from that point of view, even size_t
> >> may not be the "correct" type to use.
> > 
> > A few additions to the above observations.
> > 
> >  - We have varint that encodes how far the location from a delta
> >representation of an object to its base object in the packfile.
> >Both encoding and decoding sides in the current code use off_t to
> >represent this offset, so we can already reference an object that
> >is far in the same packfile as a base.
> > 
> >  - I think it is OK in practice to limit the size of individual
> >objects to size_t (i.e. on 32-bit arch, you cannot interact with
> >a repository with an object whose size exceeds 4GB).  Using off_t
> >would allow occasional ultra-huge objects that can only be added
> >and checked in via the streaming API on such a platform, but I
> >suspect that it may become too much of a hassle to maintain.
> 
> In a previous comment, I said that (on 32-bit Linux) it was likely
> that an object of > 4GB could not be handled correctly anyway. (more
> likely > 2GB). This was based on the code from (quite some) years ago.
> In particular, before you added the "streaming API". So, maybe a 32-bit
> arch _should_ be able to handle objects as large as the LFS API allows.
> (Ignoring, for the moment, that I think anybody who puts files of that
> size into an SCM probably gets what they deserve. :-P ).

In general, yes.
I had a patch that allowed a 32 bit linux to commit a file >4GB.
There is however a restriction: The file must not yet be known to the
index. Otherwise the "diff" machinery kicks in, and tries to
compare the "old" and the "new" versions, which means that -both-
must fit into "memory" at the same time. Memory means here the available
address space rather then real memory.
So there may be room for improvements for 32 bit systems, but that is another
story, which can be developped independent of the ulong->size_t conversion.

> 
> The two patches I commented on, however, changed the type of some
> variables from off_t to size_t. In general, the patches did not
> seem to make anything worse, but these type changes could potentially
> do harm. Hence my comment. (I still haven't tried the patches on my
> 32-bit Linux system. I only boot it up about once a week, and I would
> rather wait until the patches are in the 'pu' branch before testing).
> 
> ATB,
> Ramsay Jones
> 

And here (in agreement with the answer from Junio):
We should not change any off_t into size_t, since that means a possible
downgrade.
Whenever an off_t is converted into size_t, a function in git-compat-util.h is
used:

static inline size_t xsize_t(off_t len)
{
if (len > (size_t) len)
die("Cannot handle files this big");
return (size_t)len;
}

Having written this, it may be a good idea to define a similar function
xulong() which is used when we do calls into zlib.

Thanks to Martin for working on this.
If there is a branch on a public repo, I can e.g. run the test suite on
different systems.


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-14 Thread Junio C Hamano
Ramsay Jones  writes:

> In a previous comment, I said that (on 32-bit Linux) it was likely
> that an object of > 4GB could not be handled correctly anyway. (more
> likely > 2GB). This was based on the code from (quite some) years ago.
> In particular, before you added the "streaming API". So, maybe a 32-bit
> arch _should_ be able to handle objects as large as the LFS API allows.
> (Ignoring, for the moment, that I think anybody who puts files of that
> size into an SCM probably gets what they deserve. :-P ).
>
> The two patches I commented on, however, changed the type of some
> variables from off_t to size_t. In general, the patches did not
> seem to make anything worse, but these type changes could potentially
> do harm. Hence my comment. (I still haven't tried the patches on my
> 32-bit Linux system. I only boot it up about once a week, and I would
> rather wait until the patches are in the 'pu' branch before testing).

We are in perfect agreement.

I didn't mean to say that it is OK to replace off_t with size_t
without a good reason, especially when the current code (at least
the part I looked at anyway, like the OFS_DELTA part) seems to use
off_t correctly, and your review comments are very much appreciated,
so is the effort started by Martin to take us in the direction of
using types more appropriate than "ulong".


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-14 Thread Ramsay Jones


On 14/08/17 18:08, Junio C Hamano wrote:
> Junio C Hamano  writes:
> 
>> One interesting question is which of these two types we should use
>> for the size of objects Git uses.  
>>
>> Most of the "interesting" operations done by Git require that the
>> thing is in core as a whole before we can do anything (e.g. compare
>> two such things to produce delta, have one in core and apply patch),
>> so it is tempting that we deal with size_t, but at the lowest level
>> to serve as a SCM, i.e. recording the state of a file at each
>> version, we actually should be able to exceed the in-core
>> limit---both "git add" of a huge file whose contents would not fit
>> in-core and "git checkout" of a huge blob whose inflated contents
>> would not fit in-core should (in theory, modulo bugs) be able to
>> exercise the streaming interface to handle such case without holding
>> everything in-core at once.  So from that point of view, even size_t
>> may not be the "correct" type to use.
> 
> A few additions to the above observations.
> 
>  - We have varint that encodes how far the location from a delta
>representation of an object to its base object in the packfile.
>Both encoding and decoding sides in the current code use off_t to
>represent this offset, so we can already reference an object that
>is far in the same packfile as a base.
> 
>  - I think it is OK in practice to limit the size of individual
>objects to size_t (i.e. on 32-bit arch, you cannot interact with
>a repository with an object whose size exceeds 4GB).  Using off_t
>would allow occasional ultra-huge objects that can only be added
>and checked in via the streaming API on such a platform, but I
>suspect that it may become too much of a hassle to maintain.

In a previous comment, I said that (on 32-bit Linux) it was likely
that an object of > 4GB could not be handled correctly anyway. (more
likely > 2GB). This was based on the code from (quite some) years ago.
In particular, before you added the "streaming API". So, maybe a 32-bit
arch _should_ be able to handle objects as large as the LFS API allows.
(Ignoring, for the moment, that I think anybody who puts files of that
size into an SCM probably gets what they deserve. :-P ).

The two patches I commented on, however, changed the type of some
variables from off_t to size_t. In general, the patches did not
seem to make anything worse, but these type changes could potentially
do harm. Hence my comment. (I still haven't tried the patches on my
32-bit Linux system. I only boot it up about once a week, and I would
rather wait until the patches are in the 'pu' branch before testing).

ATB,
Ramsay Jones



Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-14 Thread Junio C Hamano
Junio C Hamano  writes:

> One interesting question is which of these two types we should use
> for the size of objects Git uses.  
>
> Most of the "interesting" operations done by Git require that the
> thing is in core as a whole before we can do anything (e.g. compare
> two such things to produce delta, have one in core and apply patch),
> so it is tempting that we deal with size_t, but at the lowest level
> to serve as a SCM, i.e. recording the state of a file at each
> version, we actually should be able to exceed the in-core
> limit---both "git add" of a huge file whose contents would not fit
> in-core and "git checkout" of a huge blob whose inflated contents
> would not fit in-core should (in theory, modulo bugs) be able to
> exercise the streaming interface to handle such case without holding
> everything in-core at once.  So from that point of view, even size_t
> may not be the "correct" type to use.

A few additions to the above observations.

 - We have varint that encodes how far the location from a delta
   representation of an object to its base object in the packfile.
   Both encoding and decoding sides in the current code use off_t to
   represent this offset, so we can already reference an object that
   is far in the same packfile as a base.

 - I think it is OK in practice to limit the size of individual
   objects to size_t (i.e. on 32-bit arch, you cannot interact with
   a repository with an object whose size exceeds 4GB).  Using off_t
   would allow occasional ultra-huge objects that can only be added
   and checked in via the streaming API on such a platform, but I
   suspect that it may become too much of a hassle to maintain.

   It may help reducing the maintenance if we introduced obj_size_t
   that is defined to be size_t for now, so that we can later swap
   it to ofs_t or some larger type when we know we do need to
   support objects whose size cannot be expressed in size_t, but I
   do not offhand know what the pros-and-cons with such an approach
   would look like.

Thanks.


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-13 Thread Junio C Hamano
Ramsay Jones  writes:

> It should be, see commit b97e911643 ("Support for large files
> on 32bit systems.", 17-02-2007), where you can see that the
> _FILE_OFFSET_BITS macro is set to 64. This asks  et.al.,
> to use the "Large File System" API and a 64-bit off_t.

Correct.  Roughly speaking, we should view off_t as size of things
you can store in a file, and size_t as size of things you can have
in core.  When we allocate a region of memory, and then repeatedly
fill it with some data and hand that memory to a helper function
e.g. git_deflate(), each of these calls should expect to get data
whose size can be representable by size_t, and if that is shorter
than ulong which we currently use, we are artificially limiting our
potential by using a type that is narrower than necessary.  

The result from these helper functions that are repeatedly called
may be sent to a file as the output from the loop.  If that logic in
the outer loop wants to keep a tally of the total size of data they
processed, that number may not be fit in size_t and instead may
require off_t.

One interesting question is which of these two types we should use
for the size of objects Git uses.  

Most of the "interesting" operations done by Git require that the
thing is in core as a whole before we can do anything (e.g. compare
two such things to produce delta, have one in core and apply patch),
so it is tempting that we deal with size_t, but at the lowest level
to serve as a SCM, i.e. recording the state of a file at each
version, we actually should be able to exceed the in-core
limit---both "git add" of a huge file whose contents would not fit
in-core and "git checkout" of a huge blob whose inflated contents
would not fit in-core should (in theory, modulo bugs) be able to
exercise the streaming interface to handle such case without holding
everything in-core at once.  So from that point of view, even size_t
may not be the "correct" type to use.


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-13 Thread Ramsay Jones


On 13/08/17 19:30, Martin Koegler wrote:
> On Sat, Aug 12, 2017 at 02:47:25PM +0100, Ramsay Jones wrote:
>> On 32-bit Linux, off_t is 64-bit and size_t is 32-bit.
> 
> --- t.c ---
> #include 
> #include 
> 
> int main()
> {
> printf("%d %d\n", sizeof(size_t), sizeof(off_t));
> }
> 
> $ gcc -m32 -o t t.c
> $ ./t.c
> 4 4
> 
> So is that really true?

It should be, see commit b97e911643 ("Support for large files
on 32bit systems.", 17-02-2007), where you can see that the
_FILE_OFFSET_BITS macro is set to 64. This asks  et.al.,
to use the "Large File System" API and a 64-bit off_t.

I can't boot my 32-bit installation at the moment, and it seems
that my 64-bit multilib system is not playing ball:

  $ gcc -m32 -D_FILE_OFFSET_BITS=64 -o t t.c
  In file included from /usr/include/stdio.h:27:0,
   from t.c:2:
  /usr/include/features.h:367:25: fatal error: sys/cdefs.h: No such file or 
directory
  compilation terminated.
  $ 

ATB,
Ramsay Jones



Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-13 Thread Martin Koegler
On Sat, Aug 12, 2017 at 02:47:25PM +0100, Ramsay Jones wrote:
> On 32-bit Linux, off_t is 64-bit and size_t is 32-bit.

--- t.c ---
#include 
#include 

int main()
{
printf("%d %d\n", sizeof(size_t), sizeof(off_t));
}

$ gcc -m32 -o t t.c
$ ./t.c
4 4

So is that really true?

Regards,
Martin 


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-13 Thread Martin Koegler
On Sat, Aug 12, 2017 at 11:59:15AM +0200, Torsten Bögershausen wrote:
> Thanks for working on this - unfortunatly the patch does not apply on
> git.git/master.
> 
> Which baseline did you use ?

next - 98096fd7a85b93626db8757f944f2d8ffdf7e96a
It accumulated to 19 patches.

Regards,
Martin 


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-12 Thread Ramsay Jones


On 12/08/17 09:47, Martin Koegler wrote:
> From: Martin Koegler 
> 
> Signed-off-by: Martin Koegler 
> ---
>  builtin/pack-objects.c | 48 
>  1 file changed, 24 insertions(+), 24 deletions(-)
> 
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index aa70f80..f8db283 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -56,7 +56,7 @@ static struct pack_idx_option pack_idx_opts;
>  static const char *base_name;
>  static int progress = 1;
>  static int window = 10;
> -static unsigned long pack_size_limit;
> +static size_t pack_size_limit;
>  static int depth = 50;
>  static int delta_search_threads;
>  static int pack_to_stdout;
> @@ -72,11 +72,11 @@ static int use_bitmap_index = -1;
>  static int write_bitmap_index;
>  static uint16_t write_bitmap_options;
>  
> -static unsigned long delta_cache_size = 0;
> -static unsigned long max_delta_cache_size = 256 * 1024 * 1024;
> -static unsigned long cache_max_small_delta_size = 1000;
> +static size_t delta_cache_size = 0;
> +static size_t max_delta_cache_size = 256 * 1024 * 1024;
> +static size_t cache_max_small_delta_size = 1000;
>  
> -static unsigned long window_memory_limit = 0;
> +static size_t window_memory_limit = 0;
>  
>  /*
>   * stats
> @@ -124,11 +124,11 @@ static void *get_delta(struct object_entry *entry)
>   return delta_buf;
>  }
>  
> -static unsigned long do_compress(void **pptr, unsigned long size)
> +static size_t do_compress(void **pptr, size_t size)
>  {
>   git_zstream stream;
>   void *in, *out;
> - unsigned long maxsize;
> + size_t maxsize;
>  
>   git_deflate_init(, pack_compression_level);
>   maxsize = git_deflate_bound(, size);
> @@ -149,13 +149,13 @@ static unsigned long do_compress(void **pptr, unsigned 
> long size)
>   return stream.total_out;
>  }
>  
> -static unsigned long write_large_blob_data(struct git_istream *st, struct 
> sha1file *f,
> +static size_t write_large_blob_data(struct git_istream *st, struct sha1file 
> *f,
>  const unsigned char *sha1)
>  {
>   git_zstream stream;
>   unsigned char ibuf[1024 * 16];
>   unsigned char obuf[1024 * 16];
> - unsigned long olen = 0;
> + size_t olen = 0;
>  
>   git_deflate_init(, pack_compression_level);
>  
> @@ -196,7 +196,7 @@ static int check_pack_inflate(struct packed_git *p,
>   struct pack_window **w_curs,
>   off_t offset,
>   off_t len,
> - unsigned long expect)
> + size_t expect)
>  {
>   git_zstream stream;
>   unsigned char fakebuf[4096], *in;
> @@ -238,13 +238,13 @@ static void copy_pack_data(struct sha1file *f,
>  }
>  
>  /* Return 0 if we will bust the pack-size limit */
> -static unsigned long write_no_reuse_object(struct sha1file *f, struct 
> object_entry *entry,
> -unsigned long limit, int 
> usable_delta)
> +static size_t write_no_reuse_object(struct sha1file *f, struct object_entry 
> *entry,
> + size_t limit, int usable_delta)
>  {
>   size_t size, datalen;
>   unsigned char header[MAX_PACK_OBJECT_HEADER],
> dheader[MAX_PACK_OBJECT_HEADER];
> - unsigned hdrlen;
> + size_t hdrlen;
>   enum object_type type;
>   void *buf;
>   struct git_istream *st = NULL;
> @@ -350,17 +350,17 @@ static unsigned long write_no_reuse_object(struct 
> sha1file *f, struct object_ent
>  
>  /* Return 0 if we will bust the pack-size limit */
>  static off_t write_reuse_object(struct sha1file *f, struct object_entry 
> *entry,
> - unsigned long limit, int usable_delta)
> + size_t limit, int usable_delta)
>  {
>   struct packed_git *p = entry->in_pack;
>   struct pack_window *w_curs = NULL;
>   struct revindex_entry *revidx;
>   off_t offset;
>   enum object_type type = entry->type;
> - off_t datalen;
> + size_t datalen;

On 32-bit Linux, off_t is 64-bit and size_t is 32-bit.

Later in this function, datalen is set the the difference of
two (revidx) off_t offsets. I haven't applied and tried this
on my 32-bit Linux install yet (that installation only gets
booted about once a week), but I suspect a 'possible data-loss'
warning.

For other reasons, I don't think it is not possible for
a 32-bit system to create a packfile with an individual object
of size > 4GB anyway, so this may not be a problem in practice.
(a 32-bit system should be able to read such a packfile, presumably
created on a 64-bit system, and refuse to load such an object, of
course).

ATB,
Ramsay Jones


Re: [PATCH 1/9] Convert pack-objects to size_t

2017-08-12 Thread Torsten Bögershausen
Thanks for working on this - unfortunatly the patch does not apply on
git.git/master.

Which baseline did you use ?