Re: Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-12 Thread Jonathan Nieder
Heiko Voigt wrote:
> On Tue, Mar 11, 2014 at 07:28:52PM -0700, Jonathan Nieder wrote:
>> Heiko Voigt wrote:

>>> +static unsigned int hash_sha1_string(const unsigned char *sha1, const char 
>>> *string)
>>> +{
>>> +   return memhash(sha1, 20) + strhash(string);
>>> +}
>>
>> Feels a bit unconventional.  I can't find a strong reason to mind.
>
> Well I did not think much about this. I simply thought: hash -> kind of
> random value. Adding the two together is as good as anything (even
> overflow does not matter).
[...]
> I am fine with a switch to something different. We could use classic XOR
> in case that feels better.

Either + or ^ is fine with me (yeah, '^' is what I expected so '+'
forced me to think for a few seconds).  I don't think we have to worry
much about hostile people making repos that force git to spend a long
time dealing with hash collisions, so anything more complicated is
probably overkill. :)

[...]
>> [...]
>>> +static void warn_multiple_config(struct submodule_config *config, const 
>>> char *option)
>>> +{
>>> +   warning("%s:.gitmodules, multiple configurations found for 
>>> submodule.%s.%s. "
>>> +   "Skipping second one!", 
>>> sha1_to_hex(config->gitmodule_sha1),
>>> +   option, config->name.buf);
>>
>> Ah, so gitmodule_sha1 is a commit id?
>
> No, this output is a bug. gitmodule_sha1 is actually the sha1 of the
> .gitmodule blob we read. Thanks for noticing will fix. Should I also add
> a comment to the gitmodule_sha1 field to explain what it is?
[...]
>   with
> the clarification does the name make sense now?

Yep.  Suggested fixes:

 - call it gitmodules_sha1 instead of gitmodule_sha1 (it's the blob
   name for .gitmodules, not the name of a module)

 - add a comment where the field is declared (this would make it clear
   that it's a blob name instead of e.g. just the SHA-1 of the text)

Thanks for your thoughtfulness.
Jonathan
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-12 Thread Heiko Voigt
Hi,

On Tue, Mar 11, 2014 at 07:28:52PM -0700, Jonathan Nieder wrote:
> Heiko Voigt wrote:
> 
> > This submodule configuration cache allows us to lazily read .gitmodules
> > configurations by commit into a runtime cache which can then be used to
> > easily lookup values from it. Currently only the values for path or name
> > are stored but it can be extended for any value needed.
> 
> Makes sense.
> 
> [...]
> > Next I am planning to extend this so configuration cache so it will
> > return the local configuration (with the usual .gitmodules,
> > /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
> > null_sha1 for gitmodule or commit sha1. That way we can give this
> > configuration cache some use and implement its usage for the current
> > configuration variables in submodule.c.
> 
> Yay!
> 
> I wonder whether local configuration should also override information
> from commits at times.

Yeah but for that I plan a different code path that solves conflicts
and/or extend missing values. I think its best to keep the submodule
configurations by commit as simple as possible. But we will see once I
get to the point where we need this.

> [...]
> > --- /dev/null
> > +++ b/Documentation/technical/api-submodule-config-cache.txt
> > @@ -0,0 +1,39 @@
> > +submodule config cache API
> > +==
> 
> Most API documents in Documentation/technical have a section explaining
> general usage --- e.g. (from api-revision-walking),
> 
>   Calling sequence
>   
> 
>   The walking API has a given calling sequence: first you need to
>   initialize a rev_info structure, then add revisions to control what kind
>   of revision list do you want to get, finally you can iterate over the
>   revision list.
> 
> Or (from api-string-list):
> 
>   The caller:
> 
>   . Allocates and clears a `struct string_list` variable.
> 
>   . Initializes the members. You might want to set the flag 
> `strdup_strings`
> if the strings should be strdup()ed. For example, this is necessary
> [...]
>   . Can remove items not matching a criterion from a sorted or unsorted
> list using `filter_string_list`, or remove empty strings using
> `string_list_remove_empty_items`.
> 
>   . Finally it should free the list using `string_list_clear`.
> 
> This makes it easier to get started by looking at the documentation alone
> without having to mimic an example.

True, will extend that in the next iteration.

> Another thought: it's tempting to call this the 'submodule config API' and
> treat the (optional?) memoization as an implementation detail instead of
> reminding the caller of it too much.  But I haven't thought that through
> completely.

Thats the same point Jeff mentioned and I think that will simplify many
things. As I mentioned in the answer to Jeff I will keep passing around
the cache pointer internally but expose only functions without it to the
outside to be future proof but also easy to use for the current use
case.

> [...]
> > +`struct submodule_config *get_submodule_config_from_path(
> > +   struct submodule_config_cache *cache,
> > +   const unsigned char *commit_sha1,
> > +   const char *path)::
> > +
> > +   Lookup a submodules configuration by its commit_sha1 and path.
> 
> The cache just always grows until it's freed, right?  Would it make
> sense to allow cached configurations to be explicitly evicted when the
> caller is done with them some day?  (Just curious, not a complaint.
> Might be worth mentioning in the overview how the cache is expected to
> be used, though.)

Yes it only grows at the moment. Not sure whether we need to remove
configurations. Currently the only use case that comes to my mind would
be if it grows to big to be kept in memory, but at the moment that seems
far away for me, so I would leave that for a future extension. It will
be hard to know whether someone is done with a certain .gitmodule file
since we cache it by its sha1 expecting it to be the same over many
revisions.

> [...]
> > --- /dev/null
> > +++ b/submodule-config-cache.h
> > @@ -0,0 +1,26 @@
> > +#ifndef SUBMODULE_CONFIG_CACHE_H
> > +#define SUBMODULE_CONFIG_CACHE_H
> > +
> > +#include "hashmap.h"
> > +#include "strbuf.h"
> > +
> > +struct submodule_config_cache {
> > +   struct hashmap for_path;
> > +   struct hashmap for_name;
> > +};
> 
> Could use comments (e.g., saying what keys each maps to what values).
> Would callers ever look at the hashmaps directly or are they only
> supposed to be accessed using helper functions that take a whole
> struct?

Users should only use the helper functions, but I will hide this in the
submodule-config module. Will add some comment as well.

> [...]
> > +
> > +/* one submodule_config_cache entry */
> > +struct submodule_config {
> > +   struct strbuf path;
> > +   struct strbuf name;
> > +   unsigned char gitmodule_sha1[20];
> 
> Could also use comm

Re: Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-12 Thread Heiko Voigt
On Tue, Mar 11, 2014 at 05:58:08PM -0400, Jeff King wrote:
> On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:
> 
> > I have also moved all functions into the new submodule-config-cache
> > module. I am not completely satisfied with the naming since it is quite
> > long. If someone comes up with some different naming I am open for it.
> > Maybe simply submodule-config (submodule_config prefix for functions)?
> 
> Since the cache is totally internal to the submodule-config code, I do
> not know that you even need to have a nice submodule-config-cache API.
> Can it just be a singleton?
> 
> That is bad design in a sense (it becomes harder later if you ever want
> to pull submodule config from two sources simultaneously), but it
> matches many other subsystems in git which cache behind the scenes.
> 
> I also suspect you could call submodule_config simply "submodule", and
> let it be a stand-in for the submodule (for now, only data from the
> config, but potentially you could keep other data on it, too).
> 
> So with all that, the entry point into your code is just:
> 
>   const struct submodule *submodule_from_path(const char *path);
> 
> and the caching magically happens behind-the-scenes.

Actually since we also need to define the revision from which we request
these submodule values that would become:

   const struct submodule *submodule_from_path(const unsigned char *commit_sha1,
const char *path);

Since the configuration for submodules for a submodule are identified by
a unique commit sha1 and its unique path (or unique name) I think it is
feasible to make it a singleton. That would also simplify using it from
the existing config parsing code.

To be future proof I can internally keep the object oriented approach
always passing on the submodule_config_cache pointer. That way it would
be easy to expose to the outside in case we later need multiple caches.

So I currently I do not see any downside of making it a singleton to the
outside and would go with that.

> > +/* one submodule_config_cache entry */
> > +struct submodule_config {
> > +   struct strbuf path;
> > +   struct strbuf name;
> > +   unsigned char gitmodule_sha1[20];
> > +};
> 
> Do these strings need changed after they are written once? If not, you
> may want to just make these bare pointers (you can still use strbufs to
> create them, and then strbuf_detach at the end). That may just be a matter of
> taste, though.

No they do not need to be changed after parsing, since every path,
name mapping should be unique in one .gitmodule file. And I think it
actually would make the code more clear in one instance where I directly
set the buf pointer which Jonathan mentioned. There it is needed only
for the hashmap lookup.

Cheers Heiko
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-11 Thread Jonathan Nieder
Hi,

Some quick thoughts.

Heiko Voigt wrote:

> This submodule configuration cache allows us to lazily read .gitmodules
> configurations by commit into a runtime cache which can then be used to
> easily lookup values from it. Currently only the values for path or name
> are stored but it can be extended for any value needed.

Makes sense.

[...]
> Next I am planning to extend this so configuration cache so it will
> return the local configuration (with the usual .gitmodules,
> /etc/gitconfig, .git/config, ... overlay of variables) when you pass it
> null_sha1 for gitmodule or commit sha1. That way we can give this
> configuration cache some use and implement its usage for the current
> configuration variables in submodule.c.

Yay!

I wonder whether local configuration should also override information
from commits at times.

[...]
> --- /dev/null
> +++ b/Documentation/technical/api-submodule-config-cache.txt
> @@ -0,0 +1,39 @@
> +submodule config cache API
> +==

Most API documents in Documentation/technical have a section explaining
general usage --- e.g. (from api-revision-walking),

Calling sequence


The walking API has a given calling sequence: first you need to
initialize a rev_info structure, then add revisions to control what kind
of revision list do you want to get, finally you can iterate over the
revision list.

Or (from api-string-list):

The caller:

. Allocates and clears a `struct string_list` variable.

. Initializes the members. You might want to set the flag 
`strdup_strings`
  if the strings should be strdup()ed. For example, this is necessary
[...]
. Can remove items not matching a criterion from a sorted or unsorted
  list using `filter_string_list`, or remove empty strings using
  `string_list_remove_empty_items`.

. Finally it should free the list using `string_list_clear`.

This makes it easier to get started by looking at the documentation alone
without having to mimic an example.

Another thought: it's tempting to call this the 'submodule config API' and
treat the (optional?) memoization as an implementation detail instead of
reminding the caller of it too much.  But I haven't thought that through
completely.

[...]
> +`struct submodule_config *get_submodule_config_from_path(
> + struct submodule_config_cache *cache,
> + const unsigned char *commit_sha1,
> + const char *path)::
> +
> + Lookup a submodules configuration by its commit_sha1 and path.

The cache just always grows until it's freed, right?  Would it make
sense to allow cached configurations to be explicitly evicted when the
caller is done with them some day?  (Just curious, not a complaint.
Might be worth mentioning in the overview how the cache is expected to
be used, though.)

[...]
> --- /dev/null
> +++ b/submodule-config-cache.h
> @@ -0,0 +1,26 @@
> +#ifndef SUBMODULE_CONFIG_CACHE_H
> +#define SUBMODULE_CONFIG_CACHE_H
> +
> +#include "hashmap.h"
> +#include "strbuf.h"
> +
> +struct submodule_config_cache {
> + struct hashmap for_path;
> + struct hashmap for_name;
> +};

Could use comments (e.g., saying what keys each maps to what values).
Would callers ever look at the hashmaps directly or are they only
supposed to be accessed using helper functions that take a whole
struct?

[...]
> +
> +/* one submodule_config_cache entry */
> +struct submodule_config {
> + struct strbuf path;
> + struct strbuf name;
> + unsigned char gitmodule_sha1[20];

Could also use comments.  What does the gitmodule_sha1 field represent?

[...]
> --- /dev/null
> +++ b/submodule-config-cache.c
> @@ -0,0 +1,256 @@
> +#include "cache.h"
> +#include "submodule-config-cache.h"
> +#include "strbuf.h"
> +
> +struct submodule_config_entry {
> + struct hashmap_entry ent;
> + struct submodule_config *config;
> +};
> +
> +static int submodule_config_path_cmp(const struct submodule_config_entry *a,
> +  const struct submodule_config_entry *b,
> +  const void *unused)
> +{
> + int ret;
> + if ((ret = strcmp(a->config->path.buf, b->config->path.buf)))
> + return ret;

Style: avoid assignments in conditionals:

ret = strcmp(...);
if (ret)
return ret;

return hashcmp(...);

The actual return value from strcmp isn't important since
hashmap_cmp_fn is only used to check for equality.  Perhaps as a
separate change it would make sense to make it a hashmap_equality_fn
some day:

return !strcmp(...) && !hashcmp(...);

This checks both the path and gitmodule_sha1, so I guess it's for
looking up submodule names.

[...]
> +static int submodule_config_name_cmp(const struct submodule_config_entry *a,
> +  const struct submodule_config_entry *b,
> +  

Re: [PATCH] implement submodule config cache for lookup of submodule names

2014-03-11 Thread Jeff King
On Mon, Mar 10, 2014 at 10:24:12PM +0100, Heiko Voigt wrote:

> I have also moved all functions into the new submodule-config-cache
> module. I am not completely satisfied with the naming since it is quite
> long. If someone comes up with some different naming I am open for it.
> Maybe simply submodule-config (submodule_config prefix for functions)?

Since the cache is totally internal to the submodule-config code, I do
not know that you even need to have a nice submodule-config-cache API.
Can it just be a singleton?

That is bad design in a sense (it becomes harder later if you ever want
to pull submodule config from two sources simultaneously), but it
matches many other subsystems in git which cache behind the scenes.

I also suspect you could call submodule_config simply "submodule", and
let it be a stand-in for the submodule (for now, only data from the
config, but potentially you could keep other data on it, too).

So with all that, the entry point into your code is just:

  const struct submodule *submodule_from_path(const char *path);

and the caching magically happens behind-the-scenes.

But take all of this with a giant grain of salt, as I am not too
familiar with the needs of the callers.

> +/* one submodule_config_cache entry */
> +struct submodule_config {
> + struct strbuf path;
> + struct strbuf name;
> + unsigned char gitmodule_sha1[20];
> +};

Do these strings need changed after they are written once? If not, you
may want to just make these bare pointers (you can still use strbufs to
create them, and then strbuf_detach at the end). That may just be a matter of
taste, though.

-Peff
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] implement submodule config cache for lookup of submodule names

2014-03-10 Thread Heiko Voigt
This submodule configuration cache allows us to lazily read .gitmodules
configurations by commit into a runtime cache which can then be used to
easily lookup values from it. Currently only the values for path or name
are stored but it can be extended for any value needed.

It is expected that .gitmodules files do not change often between
commits. Thats why we lookup the .gitmodules sha1 from a commit and then
either lookup an already parsed configuration or parse and cache an
unknown one for each sha1. The cache is lazily build on demand for each
requested commit.

This cache can be used for all purposes which need knowledge about
submodule configurations. Example use cases are:

 * Recursive submodule checkout needs lookup a submodule name from its
   path when a submodule first appears. This needs be done before this
   configuration exists in the worktree.

 * The implementation of submodule support for 'git archive' needs to
   lookup the submodule name to generate the archive when given a
   revision that is not checked out.

 * 'git fetch' when given the --recurse-submodules=on-demand option (or
   configuration) needs to lookup submodule names by path from the
   database rather than reading from the worktree. For new submodule it
   needs to lookup the name from its path to allow cloning new
   submodules into the .git folder so they can be checked out without
   any network interaction when the user does a checkout of that
   revision.

Signed-off-by: Heiko Voigt 
---

This was originally part of my submodule fetch series[1] of moved or new
submodules. There has been an RFC patch separating this[2] to use it
independently for git-archive. This updated patch now uses the new
hashmap implementation from Karsten Blees.

I have also moved all functions into the new submodule-config-cache
module. I am not completely satisfied with the naming since it is quite
long. If someone comes up with some different naming I am open for it.
Maybe simply submodule-config (submodule_config prefix for functions)?

Next I am planning to extend this so configuration cache so it will
return the local configuration (with the usual .gitmodules,
/etc/gitconfig, .git/config, ... overlay of variables) when you pass it
null_sha1 for gitmodule or commit sha1. That way we can give this
configuration cache some use and implement its usage for the current
configuration variables in submodule.c. There we could get rid of the
global string_lists and only have one global submodule_config_cache
structure from which values can be read.

Let me know what you think.

Cheers Heiko

[1] http://article.gmane.org/gmane.comp.version-control.git/217018
[2] http://article.gmane.org/gmane.comp.version-control.git/239094

 .gitignore |   1 +
 .../technical/api-submodule-config-cache.txt   |  39 
 Makefile   |   2 +
 submodule-config-cache.c   | 256 +
 submodule-config-cache.h   |  26 +++
 t/t7410-submodule-config-cache.sh  |  74 ++
 test-submodule-config-cache.c  |  50 
 7 files changed, 448 insertions(+)
 create mode 100644 Documentation/technical/api-submodule-config-cache.txt
 create mode 100644 submodule-config-cache.c
 create mode 100644 submodule-config-cache.h
 create mode 100755 t/t7410-submodule-config-cache.sh
 create mode 100644 test-submodule-config-cache.c

diff --git a/.gitignore b/.gitignore
index dc600f9..ed286e4 100644
--- a/.gitignore
+++ b/.gitignore
@@ -198,6 +198,7 @@
 /test-sha1
 /test-sigchain
 /test-string-list
+/test-submodule-config-cache
 /test-subprocess
 /test-svn-fe
 /test-urlmatch-normalization
diff --git a/Documentation/technical/api-submodule-config-cache.txt 
b/Documentation/technical/api-submodule-config-cache.txt
new file mode 100644
index 000..e3c80e8
--- /dev/null
+++ b/Documentation/technical/api-submodule-config-cache.txt
@@ -0,0 +1,39 @@
+submodule config cache API
+==
+
+The submodule config cache API allows to lazily read submodule
+configurations from multiple revisions into a cache that can then be
+used for subsequent lookups of submodule configurations either by its
+path or name.
+
+Data Structures
+---
+
+`struct submodule_config_cache`::
+
+   The configuration cache which stores everything needed for
+   lookup. All functions for lookup or insertion operate
+   on this structure.
+
+`struct submodule_config`::
+
+   This structure stores the configuration for one submodule for a
+   certain revision. It is returned by the lookup functions.
+
+Functions
+-
+
+`void submodule_config_cache_init(struct submodule_config_cache *cache)`::
+`void submodule_config_cache_free(struct submodule_config_cache *cache)`::
+
+   Use these to initialize before and free a cache after your
+   lookups.
+
+`struct submodule_config *get_submodule