Re: [RFC] Design for http-pull on repo with packs

2005-07-12 Thread Dan Holmsand

Junio C Hamano wrote:

Dan Holmsand <[EMAIL PROTECTED]> writes:

Repacking all of that to a single pack file gives, somewhat
surprisingly, a pack size of 62M (+ 1.3M index). In other words, the
cost of getting all those branches, and all of the new stuff from
Linus, turns out to be *negative* (probably due to some strange
deltification coincidence).



We do _not_ want to optimize for initial slurps into empty
repositories.  Quite the opposite.  We want to optimize for
allowing quick updates of reasonably up-to-date developer repos.
If initial slurps are _also_ efficient then that is an added
bonus; that is something the baseline big pack (60M Linus pack)
would give us already.  So repacking everything into a single
pack nightly is _not_ what we want to do, even though that would
give the maximum compression ;-).  I know you understand this,
but just stating the second of the above paragraphs would give
casual readers a wrong impression.


I agree, to a point: I think the bonus is quite nice to have... As it 
is, it's actually faster on my machine to clone a fresh tree of Linus' 
than it is to "git clone" a local tree (without doing the hardlinking 
"cheating", that is). And it's kind of nice to have the option to start 
completely fresh.


Anyway, my point is this: to make pulling efficient, we should ideally 
have (1) as few object files to pull as possible, especially when using 
http, and (2) have as few packs as possible, to gain some compression 
for those who pull more seldom. Point 1 is obviously the most important one.


To make this happen, relatively frequent repacking and re-repacking 
(even if only on parts of the repository) would be necessary. Or at 
least nice to have...


Which was why I wanted the "dumb fetch" thingies to at least do some 
"relatively smart un/repacking" to avoid duplication. And, ideally, that 
they would avoid downloading entire packs that we just want the 
beginning of. That would lessen the cost of repacking, which I happen to 
think is a good thing.


Also, it's kind of strange when the ssh/local fetching *always* unpacks 
everything, and rsync/http *never* does this...



You are correct.  For somebody like Jeff, having the Linus
baseline pack with one pack of all of his head (incremental that
excludes what is already in the Linus baseline pack) would help
pullers.


That would work, of course. It, however, means that Linus becomes the 
"official repository maintainer" in a way that doesn't feel very 
distributed. Perhaps then Linus' packs should be marked "official" in 
some way?



The big problem, however, comes when Jeff (or anyone else) decides to
repack. Then, if you fetch both his repo and Linus', you might end up
with several really big pack files, that mostly overlap. That could
easily mean storing most objects many times, if you don't do some
smart selective un/repacking when fetching.



Indeed.  Overlapping packs is a possibility, but my gut feeling
is that it would not be too bad, if things are arranged so that
packs are expanded-and-then-repacked _very_ rarely if ever.
Instead, at least for your public repository, if you only repack
incrementally I think you would be OK.


To be exact, you're ok (in the meaning of avoiding duplicates) as long 
as you always rsync in the "official packs", and coordinate with others 
you're merging with, before you do any repacking of your own. Sure, this 
works. It just feels a bit "un-distributed" for my personal taste...


/dan
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-11 Thread Junio C Hamano
Dan Holmsand <[EMAIL PROTECTED]> writes:

> I did a little experiment. I cloned Linus' current tree, and git
> repacked everything (that's 63M + 3.3M worth of pack files). Then I
> got something like 25 or so of Jeff's branches. That's 6.9M of object
> files, and 1.4M packed. Total size: 70M for the entire
> .git/objects/pack directory.
>
> Repacking all of that to a single pack file gives, somewhat
> surprisingly, a pack size of 62M (+ 1.3M index). In other words, the
> cost of getting all those branches, and all of the new stuff from
> Linus, turns out to be *negative* (probably due to some strange
> deltification coincidence).

We do _not_ want to optimize for initial slurps into empty
repositories.  Quite the opposite.  We want to optimize for
allowing quick updates of reasonably up-to-date developer repos.
If initial slurps are _also_ efficient then that is an added
bonus; that is something the baseline big pack (60M Linus pack)
would give us already.  So repacking everything into a single
pack nightly is _not_ what we want to do, even though that would
give the maximum compression ;-).  I know you understand this,
but just stating the second of the above paragraphs would give
casual readers a wrong impression.

> I think that this shows that (at least in this case), having many
> branches isn't particularly wasteful (1.4M in this case with one
> incremental pack).

> And that fewer packs beats many packs quite handily.

You are correct.  For somebody like Jeff, having the Linus
baseline pack with one pack of all of his head (incremental that
excludes what is already in the Linus baseline pack) would help
pullers.

> The big problem, however, comes when Jeff (or anyone else) decides to
> repack. Then, if you fetch both his repo and Linus', you might end up
> with several really big pack files, that mostly overlap. That could
> easily mean storing most objects many times, if you don't do some
> smart selective un/repacking when fetching.

Indeed.  Overlapping packs is a possibility, but my gut feeling
is that it would not be too bad, if things are arranged so that
packs are expanded-and-then-repacked _very_ rarely if ever.
Instead, at least for your public repository, if you only repack
incrementally I think you would be OK.


-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-11 Thread Tony Luck
> The big problem, however, comes when Jeff (or anyone else) decides to
> repack. Then, if you fetch both his repo and Linus', you might end up
> with several really big pack files, that mostly overlap. That could
> easily mean storing most objects many times, if you don't do some smart
> selective un/repacking when fetching.

So although it is possible to pack and re-pack at any time, perhaps we
need some guidelines?  Maybe Linus should just do a re-pack as each
2.6.x release is made (or perhaps just every 2.6.even release if that is
too often).  It has already been noted offlist that repositories hosted on
kernel.org can just copy pack files from Linus (or even better hardlink them).

-Tony
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-11 Thread Dan Holmsand

Junio C Hamano wrote:

One very minor problem I have with Holmsand approach [*1*] is
that the original Barkalow puller allowed a really dumb http
server by not requiring directory index at all.  For somebody
like me with a cheap ISP account [*2*], it was great that I did
not have to update 256 index.html files for objects/??/
directories.  Admittedly, it would be just one directory
object/pack/, but still...


I totally agree that you shouldn't have to do any special kind of 
prepping to serve a repository thru http. Which was why I thought it was 
a good thing to use the default directory listing of the web-server, 
assuming that this feature would be available on most servers... 
Apparently not yours, though :-(


And Cogito already relies on directory listings (to find tags to download).

But if git-repack-script generates a "pack index file" automagically, 
then of course everything is fine.



On the other hand, picking an optimum set of packs from
overlapping set of packs is indeed a very interesting (and hard
combinatorial) problem to solve.  I am hoping that in practice
people would not force clients to do it with "interesting" set
of packs.  I would hope them to have just a full pack and
incrementals, never having ovelaps, like Linus plans to do on
his kernel repo.

On the other hand, for somebody like Jeff Garzik with 50 heads,
it might make some sense to have a handful different overlapping
packs, optimized for different sets of people wanting to pull
some but not all of his heads.


Well, it is an interresting problem... But I don't think that the 
solution is to create more pack files. In fact, you'd want as few pack 
files as possible, for maximum overall efficiency.


I did a little experiment. I cloned Linus' current tree, and git 
repacked everything (that's 63M + 3.3M worth of pack files). Then I got 
something like 25 or so of Jeff's branches. That's 6.9M of object files, 
and 1.4M packed. Total size: 70M for the entire .git/objects/pack directory.


Repacking all of that to a single pack file gives, somewhat 
surprisingly, a pack size of 62M (+ 1.3M index). In other words, the 
cost of getting all those branches, and all of the new stuff from Linus, 
turns out to be *negative* (probably due to some strange deltification 
coincidence).


I think that this shows that (at least in this case), having many 
branches isn't particularly wasteful (1.4M in this case with one 
incremental pack).


And that fewer packs beats many packs quite handily.

The big problem, however, comes when Jeff (or anyone else) decides to 
repack. Then, if you fetch both his repo and Linus', you might end up 
with several really big pack files, that mostly overlap. That could 
easily mean storing most objects many times, if you don't do some smart 
selective un/repacking when fetching.


/dan
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-10 Thread Junio C Hamano
One very minor problem I have with Holmsand approach [*1*] is
that the original Barkalow puller allowed a really dumb http
server by not requiring directory index at all.  For somebody
like me with a cheap ISP account [*2*], it was great that I did
not have to update 256 index.html files for objects/??/
directories.  Admittedly, it would be just one directory
object/pack/, but still...

On the other hand, picking an optimum set of packs from
overlapping set of packs is indeed a very interesting (and hard
combinatorial) problem to solve.  I am hoping that in practice
people would not force clients to do it with "interesting" set
of packs.  I would hope them to have just a full pack and
incrementals, never having ovelaps, like Linus plans to do on
his kernel repo.

On the other hand, for somebody like Jeff Garzik with 50 heads,
it might make some sense to have a handful different overlapping
packs, optimized for different sets of people wanting to pull
some but not all of his heads.

Having said that, even if we want to support such a repository,
we should remember that the server side optimization needs to be
done only once per push to support many pulls by different
downstream clients.  Maybe preparing more than "list of pack
file names" to help clients decide which packs to pull is
desirable anyway.  Say, "here are the list of packs.  If you want
to sync with this and that head, I would suggest starting by
getting this pack."


[Footnotes]

*1* I was about to type Dan's, but both of you are ;-).

*2* Not having a public, rsync-reachable repository gave me a
lot of incentive to think about issues to support small/cheap
projects well ;-).

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-10 Thread Dan Holmsand

Daniel Barkalow wrote:

On Sun, 10 Jul 2005, Dan Holmsand wrote:

Daniel Barkalow wrote:

If an individual file is not available, figure out what packs are
 available:

  Get the list of pack files the repository has
   (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
  For any packs we don't have, get the index files.


This part might be slightly expensive, for large repositories. If one 
assumes that packs are named as by git-repack-script, however, one might 
cache indexes we've already seen (again, see below). Or, if you go for 
the mandatory "pack-index-file", require that it has a reliable order, 
so that you can get the last added index first.



Nothing bad happens if you have index files for pack files you don't have,
as it turns out; the library ignores them. So we can keep the index files
around so we can quickly check if they have the objects we want. That way,
we don't have to worry about skipping something now (because it's not
needed) and then ignoring it when the branch gets merged in.

So what I actually do is make a list of the pack files that aren't already
downloaded that are available from the server, and download the index
files for any where the index file isn't downloaded, either.


Aah. In other words, you do the caching thing as well. It seems a little 
ugly, though, to store the index-only index files with the rest of the 
pack. It might be preferable to introduce something like 
$GIT_DIR/index-cache or something, so than it can be easily cleaned (and 
don't follow us around forever when 
cloning-by-hardlinking-the-entire-object-directory).


You might end up with quite a large number of index files, after a while 
though, if you pull from several repositories that are regularly repacked.



  Keep a list of the struct packed_gits for the packs the server has
   (these are not used as places to look for objects)

Each time we need an object, check the list for it. If it is in there,
 download the corresponding pack and report success.


Here you will need some strategy to deal with packs that overlap with 
what we've already got. Basically, small and overlapping packs should be 
unpacked, big and non-overlapping ones saved as is (since 
git-unpack-objects is painfully slow and memory-hungry...).



I don't think there's an issue to having overlapping packs, either with
each other or with separate objects. If the user wants, stuff can be
repacked outside of the pull operation (note, though, that the index files
should be truncated rather than removed, so that the program doesn't fetch
them again next time some object can't be found easily).


Well, the only issue is obviously waste of space. If you fetch a lot of 
branches from independently packed repos, it might mean a lot of waste, 
though.


About truncating index files: this seems a bit ugly. You get a file that 
doesn't contain what it says it contains, which may cause trouble if for 
example the git prune thing is used.


You might be better off with a simple list of index files we know we 
have all the objects of (and make sure that git-prune-script deletes 
this file, since it possibly breaks the contract).


One could also optimize the pack-download bit, by figuring out the last 
object in the pack that we need (easy enough to do from the index file), 
 and just get the part of the pack file leading up to that object. That 
could be a huge win for independently packed repositories (I don't do 
that in my code below, though).



That's only possible if you can figure out what you want to have before
you get it. My code is walking the reachability graph on the client; it
can only figure out what other objects it needs after it's mapped the pack
file.


No, but we can find out which objects we *don't* want (i.e. the ones we 
have). And that may be a lot, e.g. if a repository is fully repacked, or 
if we track branches on several similar but independently packed 
repositories. And as far as I understand git-pack-objects, it tries to 
put recent objects in the front.


I don't have any numbers to back this up with, though. Some testing may 
be needed, but since the population of packed public repositories is 1, 
this is tricky...



I might use that method for listing the available packs, although I'd sort
of like to encourage a clean solution first.


Encouraging cleanliness is obviously a good thing :-)

/dan
-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-10 Thread Daniel Barkalow
On Sun, 10 Jul 2005, Dan Holmsand wrote:

> Daniel Barkalow wrote:
> > I have a design for using http-pull on a packed repository, and it only
> > requires one extra file in the repository: an append-only list of the pack
> > files (because getting the directory listing is very painful and
> > failure-prone).
> 
> A few comments (as I've been tinkering with a way to solve the problem 
> myself).
> 
> As long as the pack files are named sensibly (i.e. if they are created 
> by git-repack-script), it's not very error-prone to just get the 
> directory listing, and look for matches for pack-.idx. It seems to 
> work quite well (see below). It isn't beautiful in any way, but it works...

I may grab your code for that; the version I just sent seems to be working
except for that.

> >  If an individual file is not available, figure out what packs are
> >   available:
> > 
> >Get the list of pack files the repository has
> > (currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
> >For any packs we don't have, get the index files.
> 
> This part might be slightly expensive, for large repositories. If one 
> assumes that packs are named as by git-repack-script, however, one might 
> cache indexes we've already seen (again, see below). Or, if you go for 
> the mandatory "pack-index-file", require that it has a reliable order, 
> so that you can get the last added index first.

Nothing bad happens if you have index files for pack files you don't have,
as it turns out; the library ignores them. So we can keep the index files
around so we can quickly check if they have the objects we want. That way,
we don't have to worry about skipping something now (because it's not
needed) and then ignoring it when the branch gets merged in.

So what I actually do is make a list of the pack files that aren't already
downloaded that are available from the server, and download the index
files for any where the index file isn't downloaded, either.

> >Keep a list of the struct packed_gits for the packs the server has
> > (these are not used as places to look for objects)
> > 
> >  Each time we need an object, check the list for it. If it is in there,
> >   download the corresponding pack and report success.
> 
> Here you will need some strategy to deal with packs that overlap with 
> what we've already got. Basically, small and overlapping packs should be 
> unpacked, big and non-overlapping ones saved as is (since 
> git-unpack-objects is painfully slow and memory-hungry...).

I don't think there's an issue to having overlapping packs, either with
each other or with separate objects. If the user wants, stuff can be
repacked outside of the pull operation (note, though, that the index files
should be truncated rather than removed, so that the program doesn't fetch
them again next time some object can't be found easily).

> One could also optimize the pack-download bit, by figuring out the last 
> object in the pack that we need (easy enough to do from the index file), 
>   and just get the part of the pack file leading up to that object. That 
> could be a huge win for independently packed repositories (I don't do 
> that in my code below, though).

That's only possible if you can figure out what you want to have before
you get it. My code is walking the reachability graph on the client; it
can only figure out what other objects it needs after it's mapped the pack
file.

> Anyway, here's my attempt at the same thing. It introduces 
> "git-dumb-fetch", with usage like git-fetch-pack (except that it works 
> with http and rsync). And it adds some uglyness to git-cat-file, for 
> figuring out which objects we already have.

I might use that method for listing the available packs, although I'd sort
of like to encourage a clean solution first.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] Design for http-pull on repo with packs

2005-07-10 Thread Dan Holmsand

Daniel Barkalow wrote:

I have a design for using http-pull on a packed repository, and it only
requires one extra file in the repository: an append-only list of the pack
files (because getting the directory listing is very painful and
failure-prone).


A few comments (as I've been tinkering with a way to solve the problem 
myself).


As long as the pack files are named sensibly (i.e. if they are created 
by git-repack-script), it's not very error-prone to just get the 
directory listing, and look for matches for pack-.idx. It seems to 
work quite well (see below). It isn't beautiful in any way, but it works...


[snip]


 If an individual file is not available, figure out what packs are
  available:

   Get the list of pack files the repository has
(currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
   For any packs we don't have, get the index files.


This part might be slightly expensive, for large repositories. If one 
assumes that packs are named as by git-repack-script, however, one might 
cache indexes we've already seen (again, see below). Or, if you go for 
the mandatory "pack-index-file", require that it has a reliable order, 
so that you can get the last added index first.



   Keep a list of the struct packed_gits for the packs the server has
(these are not used as places to look for objects)

 Each time we need an object, check the list for it. If it is in there,
  download the corresponding pack and report success.


Here you will need some strategy to deal with packs that overlap with 
what we've already got. Basically, small and overlapping packs should be 
unpacked, big and non-overlapping ones saved as is (since 
git-unpack-objects is painfully slow and memory-hungry...).


One could also optimize the pack-download bit, by figuring out the last 
object in the pack that we need (easy enough to do from the index file), 
 and just get the part of the pack file leading up to that object. That 
could be a huge win for independently packed repositories (I don't do 
that in my code below, though).


Anyway, here's my attempt at the same thing. It introduces 
"git-dumb-fetch", with usage like git-fetch-pack (except that it works 
with http and rsync). And it adds some uglyness to git-cat-file, for 
figuring out which objects we already have.


I'm sort of using the same basic strategy as you, except that I check 
the pack files first (I didn't want to mess with http-pull.c, and I 
wanted something that would work with rsync as well).


The strategy is this:

   o Check if the repository has some pack files we haven't seen
 already

   o If there are new pack files, download indexes, and see if
 they contain anything new. If so, download pack file and
 store or unpack. In either case, note that we have seen the
 pack file in question (I've used $GIT_DIR/checked_packs).

   o Then
   o if http: do the git-http-pull stuff, and we're done

   o if rsync: get a list of all object files in the
 repository, and download the ones we're still missing.

Feel free to take a look, and use anything that might be useful (if 
anything...)


I'm not claiming that this method is better than your way; the only main 
differences are the caching of seen index files, and that I download 
packs first.


My way is faster if the repository contains overlapping object files and 
packs. And doesn't require any new infrastructure.


On the other hand, my method risks fetching too many objects, if a pack 
file solely contains stuff from a branch we don't want. And it requires 
the git-repack-script naming convention to be used on the remote side.


/dan
diff --git a/cat-file.c b/cat-file.c
--- a/cat-file.c
+++ b/cat-file.c
@@ -11,6 +11,42 @@ int main(int argc, char **argv)
char type[20];
void *buf;
unsigned long size;
+   int obj_count = 0;
+   int missing_count = 0;
+   char line[1000];
+
+   if (argc == 2 && !strcmp("--count", argv[1])) {
+   while (fgets(line, sizeof(line), stdin)) {
+   if (get_sha1(line, sha1))
+   die("invalid id %s", line);
+   if (has_sha1_file(sha1))
+   ++obj_count;
+   else
+   ++missing_count;
+   }
+   printf("%i %i\n", obj_count, missing_count);
+   return 0;
+   }
+
+   if (argc == 2 && !strcmp("--existing", argv[1])) {
+   while (fgets(line, sizeof(line), stdin)) {
+   if (get_sha1(line, sha1))
+   die("invalid id %s", line);
+   if (has_sha1_file(sha1))
+   printf ("%s", line);
+   }
+   return 0;
+   }
+
+   if (argc == 2 && !strcmp("--missing", argv[1])) {
+   while (fgets(line, sizeof(line), stdin)) {
+   if (get_sha1(line, sh

[RFC] Design for http-pull on repo with packs

2005-07-10 Thread Daniel Barkalow
I have a design for using http-pull on a packed repository, and it only
requires one extra file in the repository: an append-only list of the pack
files (because getting the directory listing is very painful and
failure-prone).

The first thing to note is that fetch() is allowed to get more than just
the requested object. This means that we can get the pack file with the
requested object, and this will fulfill the contract of fetch(), and,
hopefully, be extra-helpful (since we expect the repository owner to have
packed stuff together usefully). So I do this:

 Try to get individual files. So long as this works, everything is as
  before.

 If an individual file is not available, figure out what packs are
  available:

   Get the list of pack files the repository has
(currently, I just use "e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135")
   For any packs we don't have, get the index files.
   Keep a list of the struct packed_gits for the packs the server has
(these are not used as places to look for objects)

 Each time we need an object, check the list for it. If it is in there,
  download the corresponding pack and report success.

I've nearly got an implementation ready, except for not having a way of
getting a list of available packs. It seems to work for getting
e3117bbaf6a59cb53c3f6f0d9b17b9433f0e4135 when necessary, although I'm
still debugging the last few things.

-Daniel
*This .sig left intentionally blank*

-
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html