Re: [PATCH v3 4/9] index-helper: new daemon for caching index and related stuff

2014-07-30 Thread David Turner

On Mon, 2014-07-28 at 19:03 +0700, Nguyễn Thái Ngọc Duy wrote:
> +# Define HAVE_SHM if you platform support shm_* functions in librt.

s/you/your/

> +static void free_index_shm(struct index_shm *is)

Does not actually free its argument; should be release_index_shm.  


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/9] index-helper: new daemon for caching index and related stuff

2014-07-30 Thread Duy Nguyen
On Wed, Jul 30, 2014 at 3:08 PM, Eric Sunshine  wrote:
>> +static void share_index(struct index_state *istate, struct index_shm *is)
>> +{
>> +   void *new_mmap;
>> +   if (istate->mmap_size <= 20 ||
>> +   hashcmp(istate->sha1,
>> +   (unsigned char *)istate->mmap + istate->mmap_size - 20) 
>> ||
>> +   !hashcmp(istate->sha1, is->sha1) ||
>> +   git_shm_map(O_CREAT | O_EXCL | O_RDWR, 0700, istate->mmap_size,
>> +   &new_mmap, PROT_READ | PROT_WRITE, MAP_SHARED,
>> +   "git-index-%s", sha1_to_hex(istate->sha1)) < 0)
>
> Do any of these failure conditions deserve a diagnostic message to let
> the user know that there was a problem?

Hmm.. diagnostic messages are a problem already. This most likely runs
as a daemon, nowhere to print to. But if running on foreground, some
messages would help.

>> +static int try_shm(struct index_state *istate)
>> +{
>> +   void *new_mmap = NULL;
>> +   size_t old_size = istate->mmap_size;
>
> Is old_size needed? Can you not simply reference istate->mmap_size
> directly each place old_size is mentioned?
>
>> +   ssize_t new_length;
>
> Nit: 'size' vs. 'length' inconsistency in variable name choices.

Leftovers after many iterations. Will fix.

>> +#define SHM_PATH_LEN 72/* we don't create very long paths.. 
>> */
>> +
>> +ssize_t git_shm_map(int oflag, int perm, ssize_t length, void **mmap,
>> +   int prot, int flags, const char *fmt, ...)
>> +{
>> +   va_list ap;
>> +   char path[SHM_PATH_LEN];
>
> Is there a reason to avoid strbuf here?

Laziness (which is ironic as I'm working on patches to clean up fixed
size buffers)
-- 
Duy
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 4/9] index-helper: new daemon for caching index and related stuff

2014-07-30 Thread Eric Sunshine
On Monday, July 28, 2014, Nguyễn Thái Ngọc Duy  wrote:
> Instead of reading the index from disk and worrying about disk
> corruption, the index is cached in memory (memory bit-flips happen
> too, but hopefully less often). The result is faster read.
>
> The biggest gain is not having to verify the trailing SHA-1, which
> takes lots of time especially on large index files. But this also
> opens doors for further optimiztions:

s/optimiztions/optimizations/

>  - we could create an in-memory format that's essentially the memory
>dump of the index to eliminate most of parsing/allocation
>overhead. The mmap'd memory can be used straight away.
>
>  - we could cache non-index info such as name hash
>
> The shared memory's name folows the template "git--"

s/folows/follows/

> where  is the trailing SHA-1 of the index file.  is
> "index" for cached index files (and may be "name-hash" for name-hash
> cache). If such shared memory exists, it contains the same index
> content as on disk. The content is already validated by the daemon and
> git won't validate it again (except comparing the trailing SHA-1s).
>
> Git can poke the daemon to tell it to refresh the index cache, or to
> keep it alive some more minutes via UNIX signals. It can't give any
> real index data directly to the daemon. Real data goes to disk first,
> then the daemon reads and verifies it from there.
>
> $GIT_DIR/index-helper.pid contains a reference to daemon process (and
> it's pid on *nix). The file's mtime is updated every time it's accessed

s/it's pid/its pid/

> (or should be updated often enough). Old index-helper.pid is considered
> stale and ignored.
>
> Signed-off-by: Nguyễn Thái Ngọc Duy 
> ---
> diff --git a/index-helper.c b/index-helper.c
> new file mode 100644
> index 000..8fa0af9
> --- /dev/null
> +++ b/index-helper.c
> @@ -0,0 +1,157 @@
> +static void share_index(struct index_state *istate, struct index_shm *is)
> +{
> +   void *new_mmap;
> +   if (istate->mmap_size <= 20 ||
> +   hashcmp(istate->sha1,
> +   (unsigned char *)istate->mmap + istate->mmap_size - 20) ||
> +   !hashcmp(istate->sha1, is->sha1) ||
> +   git_shm_map(O_CREAT | O_EXCL | O_RDWR, 0700, istate->mmap_size,
> +   &new_mmap, PROT_READ | PROT_WRITE, MAP_SHARED,
> +   "git-index-%s", sha1_to_hex(istate->sha1)) < 0)

Do any of these failure conditions deserve a diagnostic message to let
the user know that there was a problem?

> +   return;
> +
> +   free_index_shm(is);
> +   is->size = istate->mmap_size;
> +   is->shm = new_mmap;
> +   hashcpy(is->sha1, istate->sha1);
> +   memcpy(new_mmap, istate->mmap, istate->mmap_size - 20);
> +
> +   /*
> +* The trailing hash must be written last after everything is
> +* written. It's the indication that the shared memory is now
> +* ready.
> +*/
> +   hashcpy((unsigned char *)new_mmap + istate->mmap_size - 20, is->sha1);
> +}
> diff --git a/read-cache.c b/read-cache.c
> index b679781..ff28142 100644
> --- a/read-cache.c
> +++ b/read-cache.c
> @@ -1465,6 +1467,65 @@ static struct cache_entry *create_from_disk(struct 
> ondisk_cache_entry *ondisk,
> return ce;
>  }
>
> +/*
> + * Try to open and verify a cached shm index if available. Return 0 if
> + * succeeds (istate->mmap and istate->mmap_size are updated). Return
> + * negative otherwise.
> + */
> +static int try_shm(struct index_state *istate)
> +{
> +   void *new_mmap = NULL;
> +   size_t old_size = istate->mmap_size;

Is old_size needed? Can you not simply reference istate->mmap_size
directly each place old_size is mentioned?

> +   ssize_t new_length;

Nit: 'size' vs. 'length' inconsistency in variable name choices.

> +   const unsigned char *sha1;
> +   struct stat st;
> +
> +   if (old_size <= 20 ||
> +   stat(git_path("index-helper.pid"), &st))
> +   return -1;
> +   sha1 = (unsigned char *)istate->mmap + old_size - 20;
> +   new_length = git_shm_map(O_RDONLY, 0700, -1, &new_mmap,
> +PROT_READ, MAP_SHARED,
> +"git-index-%s", sha1_to_hex(sha1));
> +   if (new_length <= 20 ||
> +   hashcmp((unsigned char *)istate->mmap + old_size - 20,
> +   (unsigned char *)new_mmap + new_length - 20)) {
> +   if (new_mmap)
> +   munmap(new_mmap, new_length);
> +   poke_daemon(&st, 1);
> +   return -1;
> +   }
> +   munmap(istate->mmap, istate->mmap_size);
> +   istate->mmap = new_mmap;
> +   istate->mmap_size = new_length;
> +   poke_daemon(&st, 0);
> +   return 0;
> +}
> +
>  /* remember to discard_cache() before reading a different cache! */
>  int do_read_index(struct index_state *istate, const char *path, int 
> must_exist)
>  {
> diff --git a/shm.c b/shm.c
> new file mode 100644
> in

[PATCH v3 4/9] index-helper: new daemon for caching index and related stuff

2014-07-28 Thread Nguyễn Thái Ngọc Duy
Instead of reading the index from disk and worrying about disk
corruption, the index is cached in memory (memory bit-flips happen
too, but hopefully less often). The result is faster read.

The biggest gain is not having to verify the trailing SHA-1, which
takes lots of time especially on large index files. But this also
opens doors for further optimiztions:

 - we could create an in-memory format that's essentially the memory
   dump of the index to eliminate most of parsing/allocation
   overhead. The mmap'd memory can be used straight away.

 - we could cache non-index info such as name hash

The shared memory's name folows the template "git--"
where  is the trailing SHA-1 of the index file.  is
"index" for cached index files (and may be "name-hash" for name-hash
cache). If such shared memory exists, it contains the same index
content as on disk. The content is already validated by the daemon and
git won't validate it again (except comparing the trailing SHA-1s).

Git can poke the daemon to tell it to refresh the index cache, or to
keep it alive some more minutes via UNIX signals. It can't give any
real index data directly to the daemon. Real data goes to disk first,
then the daemon reads and verifies it from there.

$GIT_DIR/index-helper.pid contains a reference to daemon process (and
it's pid on *nix). The file's mtime is updated every time it's accessed
(or should be updated often enough). Old index-helper.pid is considered
stale and ignored.

index-helper requires POSIX realtime extension. POSIX shm interface
however is abstracted away so that Windows support can be added later.

On webkit.git with index format v2, duplicating 8 times to 1.4m
entries and 200MB in size:

  (vanilla)  0.986986364 s: read_index_from .git/index
  (index-helper) 0.267850279 s: read_index_from .git/index

Interestingly with index v4, we get less out of index-helper. It makes
sense as v4 requires more processing after loading the index:

  (vanilla)  0.72249 s: read_index_from .git/index
  (index-helper) 0.302741500 s: read_index_from .git/index

Signed-off-by: Nguyễn Thái Ngọc Duy 
---
 .gitignore   |   1 +
 Documentation/git-index-helper.txt (new) |  44 +
 Makefile |   9 ++
 cache.h  |   2 +
 config.mak.uname |   1 +
 git-compat-util.h|   1 +
 index-helper.c (new) | 157 +++
 read-cache.c |  85 +++--
 shm.c (new)  |  67 +
 shm.h (new)  |  23 +
 10 files changed, 381 insertions(+), 9 deletions(-)
 create mode 100644 Documentation/git-index-helper.txt
 create mode 100644 index-helper.c
 create mode 100644 shm.c
 create mode 100644 shm.h

diff --git a/.gitignore b/.gitignore
index 81e12c0..66148ef 100644
--- a/.gitignore
+++ b/.gitignore
@@ -71,6 +71,7 @@
 /git-http-fetch
 /git-http-push
 /git-imap-send
+/git-index-helper
 /git-index-pack
 /git-init
 /git-init-db
diff --git a/Documentation/git-index-helper.txt 
b/Documentation/git-index-helper.txt
new file mode 100644
index 000..da41c4d
--- /dev/null
+++ b/Documentation/git-index-helper.txt
@@ -0,0 +1,44 @@
+git-index-helper(1)
+===
+
+NAME
+
+git-index-helper - A simple cache daemon for speeding up index file access
+
+SYNOPSIS
+
+[verse]
+'git index-helper' [options]
+
+DESCRIPTION
+---
+Keep the index file in memory for faster access. This daemon is per
+repository.
+
+OPTIONS
+---
+
+--exit-after=::
+   Exit if the cached index is not accessed for ``
+   minutes. Specify 0 to wait forever. Default is 10.
+
+NOTES
+-
+On UNIX-like systems, $GIT_DIR/index-helper.pid contains the process
+id of the daemon. At least on Linux, shared memory objects are
+availble via /dev/shm with the name pattern "git--".
+Normally the daemon will clean up shared memory objects when it exits.
+But if it crashes, some objects could remain there and they can be
+safely deleted with "rm" command. The following signals are used to
+control the daemon:
+
+SIGHUP::
+   Reread the index.
+
+SIGUSR1::
+   This signal is counted as "an access", which keeps the daemon
+   alive longer, unless `--exit-after=0` is used.
+
+GIT
+---
+Part of the linkgit:git[1] suite
diff --git a/Makefile b/Makefile
index 2320de5..24734e6 100644
--- a/Makefile
+++ b/Makefile
@@ -342,6 +342,8 @@ all::
 # return NULL when it receives a bogus time_t.
 #
 # Define HAVE_CLOCK_GETTIME if your platform has clock_gettime in librt.
+#
+# Define HAVE_SHM if you platform support shm_* functions in librt.
 
 GIT-VERSION-FILE: FORCE
@$(SHELL_PATH) ./GIT-VERSION-GEN
@@ -876,6 +878,7 @@ LIB_OBJS += sha1-lookup.o
 LIB_OBJS += sha1_file.o
 LIB_OBJS += sha1_name.o
 LIB_OBJS += shallow.o
+LIB_OBJS += shm.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o