Re: Re: [RFC/WIP PATCH] implement reading of submodule .gitmodules configuration into cache

2013-12-12 Thread Heiko Voigt
On Mon, Dec 09, 2013 at 03:37:50PM -0800, Junio C Hamano wrote:
  +void submodule_config_cache_free(struct submodule_config_cache *cache)
  +{
  +   /* NOTE: its important to iterate over the name hash here
  +* since paths might have multiple entries */
 
 Style (multi-line comments).

Will fix.

 This is interesting.  I wonder what the practical consequence is to
 have a single submodule bound to the top-level tree more than once.
 Updating from one of the working tree will make the other working
 tree out of sync because the ultimate location of the submodule
 directory pointed at by the two .git gitdirs can only have a single
 HEAD, be it detached or on a branch, and a single index.

To clarify, when writing this comment I was not thinking about the same
submodule with multiple paths in the same tree but rather with the same
name under different paths in different commits.

 Not that the decision to enforce that names are unique in the
 top-level .gitmodules, and follow that decision in this part of the
 code to be defensive (not rely on the one submodule can be bound
 only once to a top-level tree), but shouldn't such a configuration
 to have a single submodule bound to more than one place in the
 top-level tree be forbidden?

Yes IMO, that should be forbidden currently. I do not think we actually
prevent the user from doing so but it can not happen by accident since
we derive the initial name from the local path. Maybe we should be more
strict about that and put more guards in place to avoid such
configurations from entering the database.

  +   for_each_hash(cache-for_name, free_one_submodule_config, NULL);
  +   free_hash(cache-for_path);
  +   free_hash(cache-for_name);
  +}
  +
  +static unsigned int hash_sha1_string(const unsigned char *sha1, const char 
  *string)
  +{
  +   int c;
  +   unsigned int hash, string_hash = 5381;
  +   memcpy(hash, sha1, sizeof(hash));
  +
  +   /* djb2 hash */
  +   while ((c = *string++))
  +   string_hash = ((string_hash  5) + hash) + c; /* hash * 33 + c 
  */
 
 Hmm, the comment and the code does not seem to match in math here...

Yeah sorry that was a leftover from the code I started with. In the
beginning it was a pure string hash. Will remove both comments (since
its also not a pure djb2 hash anymore).

Cheers Heiko
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[RFC/WIP PATCH] implement reading of submodule .gitmodules configuration into cache

2013-12-09 Thread Heiko Voigt
This submodule configuration cache allows us to lazily read .gitmodules
configurations by commit into a runtime cache which can then be used to
easily lookup values from it. Currently only the values for path or name
are stored but it can be extended for any value needed.

This cache can be used for all purposes which need knowledge about
submodule configurations.

It is expected that .gitmodules files do not change often between
commits. Thats why we lookup the .gitmodules sha1 from a commit and then
either lookup an already parsed configuration or parse and cache an
unknown one for each sha1. The cache is lazily build on demand for each
requested commit.

Signed-off-by: Heiko Voigt hvo...@hvoigt.net
---
On Tue, Dec 03, 2013 at 07:33:01PM +0100, Heiko Voigt wrote:
 On Mon, Dec 02, 2013 at 03:55:36PM -0800, Nick Townsend wrote:
  It seems to me that the question that you are trying to solve is
  more complex than the problem I faced in git-archive, where we have a
  single commit of the top-level repository that we are chasing. 
  Perhaps we should split the work into two pieces:
  
  a. Identifying the complete submodule configuration for a single commit, and
  b. the complexity of behaviour when fetching and cloning recursively (which 
  of course requires a.)
 
 You are right the latter (b) is a separate topic. So how about I extract the
 submodule config parsing part from the mentioned patch and you can then
 use that patch as a basis for your work? As far as I understand you only
 need to parse the .gitmodules file for one commit and then lookup the
 submodule names from paths right? That would simplify matters and we can
 postpone the caching of multiple commits for the time when I continue on b.

Ok and here is a patch that you can use as basis for your work. I looked
into it and found that its actually easier to use the cache. Storing
everything in a hashmap seems like overkill for your application but it
is prepared for my usecase which actually needs to parse configurations
for multiple commits.

Since this has not been discussed yet some details of my implementation
might change.

The test I implemented is only for demonstration of usage and my quick
manual testing. If we agree on going ahead with my patch I will extend
that to a proper test. Or if anyone has an idea where we can plug that
in to make a useful user interface we can also implement a test based on
that.

Cheers Heiko

 .gitignore|   1 +
 Makefile  |   2 +
 submodule-config-cache.c  |  96 
 submodule-config-cache.h  |  33 +++
 submodule.c   | 125 ++
 submodule.h   |   4 ++
 test-submodule-config-cache.c |  45 +++
 7 files changed, 306 insertions(+)
 create mode 100644 submodule-config-cache.c
 create mode 100644 submodule-config-cache.h
 create mode 100644 test-submodule-config-cache.c

diff --git a/.gitignore b/.gitignore
index 66199ed..6c91e98 100644
--- a/.gitignore
+++ b/.gitignore
@@ -200,6 +200,7 @@
 /test-sha1
 /test-sigchain
 /test-string-list
+/test-submodule-config-cache
 /test-subprocess
 /test-svn-fe
 /test-urlmatch-normalization
diff --git a/Makefile b/Makefile
index af847f8..e3d869b 100644
--- a/Makefile
+++ b/Makefile
@@ -572,6 +572,7 @@ TEST_PROGRAMS_NEED_X += test-scrap-cache-tree
 TEST_PROGRAMS_NEED_X += test-sha1
 TEST_PROGRAMS_NEED_X += test-sigchain
 TEST_PROGRAMS_NEED_X += test-string-list
+TEST_PROGRAMS_NEED_X += test-submodule-config-cache
 TEST_PROGRAMS_NEED_X += test-subprocess
 TEST_PROGRAMS_NEED_X += test-svn-fe
 TEST_PROGRAMS_NEED_X += test-urlmatch-normalization
@@ -872,6 +873,7 @@ LIB_OBJS += strbuf.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
 LIB_OBJS += submodule.o
+LIB_OBJS += submodule-config-cache.o
 LIB_OBJS += symlinks.o
 LIB_OBJS += tag.o
 LIB_OBJS += trace.o
diff --git a/submodule-config-cache.c b/submodule-config-cache.c
new file mode 100644
index 000..7253fad
--- /dev/null
+++ b/submodule-config-cache.c
@@ -0,0 +1,96 @@
+#include cache.h
+#include submodule-config-cache.h
+#include strbuf.h
+#include hash.h
+
+void submodule_config_cache_init(struct submodule_config_cache *cache)
+{
+   init_hash(cache-for_name);
+   init_hash(cache-for_path);
+}
+
+static int free_one_submodule_config(void *ptr, void *data)
+{
+   struct submodule_config *entry = ptr;
+
+   strbuf_release(entry-path);
+   strbuf_release(entry-name);
+   free(entry);
+
+   return 0;
+}
+
+void submodule_config_cache_free(struct submodule_config_cache *cache)
+{
+   /* NOTE: its important to iterate over the name hash here
+* since paths might have multiple entries */
+   for_each_hash(cache-for_name, free_one_submodule_config, NULL);
+   free_hash(cache-for_path);
+   free_hash(cache-for_name);
+}
+
+static unsigned int hash_sha1_string(const unsigned char *sha1, const char 
*string)
+{

Re: [RFC/WIP PATCH] implement reading of submodule .gitmodules configuration into cache

2013-12-09 Thread Junio C Hamano
Heiko Voigt hvo...@hvoigt.net writes:

 This submodule configuration cache allows us to lazily read .gitmodules
 configurations by commit into a runtime cache which can then be used to
 easily lookup values from it. Currently only the values for path or name
 are stored but it can be extended for any value needed.
 ...

Thanks.

 diff --git a/submodule-config-cache.c b/submodule-config-cache.c
 new file mode 100644
 index 000..7253fad
 --- /dev/null
 +++ b/submodule-config-cache.c
 @@ -0,0 +1,96 @@
 +#include cache.h
 +#include submodule-config-cache.h
 +#include strbuf.h
 +#include hash.h
 +
 +void submodule_config_cache_init(struct submodule_config_cache *cache)
 +{
 + init_hash(cache-for_name);
 + init_hash(cache-for_path);
 +}
 +
 +static int free_one_submodule_config(void *ptr, void *data)
 +{
 + struct submodule_config *entry = ptr;
 +
 + strbuf_release(entry-path);
 + strbuf_release(entry-name);
 + free(entry);
 +
 + return 0;
 +}
 +
 +void submodule_config_cache_free(struct submodule_config_cache *cache)
 +{
 + /* NOTE: its important to iterate over the name hash here
 +  * since paths might have multiple entries */

Style (multi-line comments).

This is interesting.  I wonder what the practical consequence is to
have a single submodule bound to the top-level tree more than once.
Updating from one of the working tree will make the other working
tree out of sync because the ultimate location of the submodule
directory pointed at by the two .git gitdirs can only have a single
HEAD, be it detached or on a branch, and a single index.

Not that the decision to enforce that names are unique in the
top-level .gitmodules, and follow that decision in this part of the
code to be defensive (not rely on the one submodule can be bound
only once to a top-level tree), but shouldn't such a configuration
to have a single submodule bound to more than one place in the
top-level tree be forbidden?

 + for_each_hash(cache-for_name, free_one_submodule_config, NULL);
 + free_hash(cache-for_path);
 + free_hash(cache-for_name);
 +}
 +
 +static unsigned int hash_sha1_string(const unsigned char *sha1, const char 
 *string)
 +{
 + int c;
 + unsigned int hash, string_hash = 5381;
 + memcpy(hash, sha1, sizeof(hash));
 +
 + /* djb2 hash */
 + while ((c = *string++))
 + string_hash = ((string_hash  5) + hash) + c; /* hash * 33 + c 
 */

Hmm, the comment and the code does not seem to match in math here...
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html