Re: [PATCH] add: add --bulk to index all objects into a pack file

2013-10-03 Thread Junio C Hamano
Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 The use case is

 tar -xzf bigproject.tar.gz
 cd bigproject
 git init
 git add .
 # git grep or something

Two obvious thoughts, and a half.

 (1) This particular invocation of git add can easily detect that
 it is run in a repository with no $GIT_INDEX_FILE yet, which is
 the most typical case for a big initial import.  It could even
 ask if the current branch is unborn if you wanted to make the
 heuristic more specific to this use case.  Perhaps it would
 make sense to automatically plug the bulk import machinery in
 such a case without an option?

 (2) Imagine performing a dry-run of update_files_in_cache() using a
 different diff-files callback that is similar to the
 update_callback() but that uses the lstat(2) data to see how
 big an import this really is, instead of calling
 add_file_to_index(), before actually registering the data to
 the object database.  If you benchmark to see how expensive it
 is, you may find that such a scheme might be a workable
 auto-tuning mechanism to trigger this.  Even if it were
 moderately expensive, when combined with the heuristics above
 for (1), it might be a worthwhile thing to do only when it is
 likely to be an initial import.

 (3) Is it always a good idea to send everything to a packfile on a
 large addition, or are you often better off importing the
 initial fileset as loose objects?  If the latter, then the
 option name --bulk may give users a wrong hint if you are
 doing a bulk-import, you are bettern off using this option.

This is a very logical extension to what was started at 568508e7
(bulk-checkin: replace fast-import based implementation,
2011-10-28), and I like it.  I suspect --bulk=threashold might
be a better alternative than setting the threshold unconditionally
to zero, though.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] add: add --bulk to index all objects into a pack file

2013-10-03 Thread Duy Nguyen
On Thu, Oct 3, 2013 at 1:43 PM, Junio C Hamano gits...@pobox.com wrote:
 Nguyễn Thái Ngọc Duy  pclo...@gmail.com writes:

 The use case is

 tar -xzf bigproject.tar.gz
 cd bigproject
 git init
 git add .
 # git grep or something

 Two obvious thoughts, and a half.

  (1) This particular invocation of git add can easily detect that
  it is run in a repository with no $GIT_INDEX_FILE yet, which is
  the most typical case for a big initial import.  It could even
  ask if the current branch is unborn if you wanted to make the
  heuristic more specific to this use case.  Perhaps it would
  make sense to automatically plug the bulk import machinery in
  such a case without an option?

Yeah! I did not even think of that.

  (2) Imagine performing a dry-run of update_files_in_cache() using a
  different diff-files callback that is similar to the
  update_callback() but that uses the lstat(2) data to see how
  big an import this really is, instead of calling
  add_file_to_index(), before actually registering the data to
  the object database.  If you benchmark to see how expensive it
  is, you may find that such a scheme might be a workable
  auto-tuning mechanism to trigger this.  Even if it were
  moderately expensive, when combined with the heuristics above
  for (1), it might be a worthwhile thing to do only when it is
  likely to be an initial import.

We do a lot of lstats nowadays for refreshing index so yeah it's
likely reasonably cheap, but I doubt people do mass update on existing
files often. Adding a large amount of new files (even when .git/index
exists) may be a better indication of an import and we already have
that information from fill_directory().

For the no .git/index case, packing with bulk-checkin probably
produces a reasonably good pack because they don't delta well anyway.
There are no previous versions to delta against. They can delta
against other files but I don't think we'll have good compression with
that. For the case where .git/index exists, we may intefere with git
gc --auto. We create a less optimal pack, but it's a pack and may
delay gc time longer..

  (3) Is it always a good idea to send everything to a packfile on a
  large addition, or are you often better off importing the
  initial fileset as loose objects?  If the latter, then the
  option name --bulk may give users a wrong hint if you are
  doing a bulk-import, you are bettern off using this option.

Hard question. Fewer files are definitely a good thing, for example
when you rm -rf the whole thing :-) Disk usage is another. On
gdb-7.3.1, du -sh reports .git with loose objects takes 57M, while
the packed one takes 29M. Disk access is slightly faster on packed
.git, at least for git grep --cached: 0.71s vs 0.83s.

 This is a very logical extension to what was started at 568508e7
 (bulk-checkin: replace fast-import based implementation,
 2011-10-28), and I like it.  I suspect --bulk=threashold might
 be a better alternative than setting the threshold unconditionally
 to zero, though.

The threshold may be better in form of a config setting because I
won't want to specify that every time. But does one really want to
keep some small files around in loose format? I don't see it because
my goal is to keep a clean .git, but maybe loose format has some
advantages..
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html