[PATCH 08/19] ewah: compressed bitmap implementation

2013-10-24 Thread Jeff King
From: Vicent Marti tan...@gmail.com

EWAH is a word-aligned compressed variant of a bitset (i.e. a data
structure that acts as a 0-indexed boolean array for many entries).

It uses a 64-bit run-length encoding (RLE) compression scheme,
trading some compression for better processing speed.

The goal of this word-aligned implementation is not to achieve
the best compression, but rather to improve query processing time.
As it stands right now, this EWAH implementation will always be more
efficient storage-wise than its uncompressed alternative.

EWAH arrays will be used as the on-disk format to store reachability
bitmaps for all objects in a repository while keeping reasonable sizes,
in the same way that JGit does.

This EWAH implementation is a mostly straightforward port of the
original `javaewah` library that JGit currently uses. The library is
self-contained and has been embedded whole (4 files) inside the `ewah`
folder to ease redistribution.

The library is re-licensed under the GPLv2 with the permission of Daniel
Lemire, the original author. The source code for the C version can
be found on GitHub:

https://github.com/vmg/libewok

The original Java implementation can also be found on GitHub:

https://github.com/lemire/javaewah

Signed-off-by: Vicent Marti tan...@gmail.com
Signed-off-by: Jeff King p...@peff.net
---
 Makefile   |   6 +
 ewah/bitmap.c  | 226 +
 ewah/ewah_bitmap.c | 701 +
 ewah/ewah_io.c | 192 +++
 ewah/ewah_rlw.c| 120 +
 ewah/ewok.h| 223 +
 ewah/ewok_rlw.h| 114 +
 7 files changed, 1582 insertions(+)
 create mode 100644 ewah/bitmap.c
 create mode 100644 ewah/ewah_bitmap.c
 create mode 100644 ewah/ewah_io.c
 create mode 100644 ewah/ewah_rlw.c
 create mode 100644 ewah/ewok.h
 create mode 100644 ewah/ewok_rlw.h

diff --git a/Makefile b/Makefile
index 48ff0bd..eacb940 100644
--- a/Makefile
+++ b/Makefile
@@ -667,6 +667,8 @@ LIB_H += diff.h
 LIB_H += diffcore.h
 LIB_H += dir.h
 LIB_H += exec_cmd.h
+LIB_H += ewah/ewok.h
+LIB_H += ewah/ewok_rlw.h
 LIB_H += fetch-pack.h
 LIB_H += fmt-merge-msg.h
 LIB_H += fsck.h
@@ -800,6 +802,10 @@ LIB_OBJS += dir.o
 LIB_OBJS += editor.o
 LIB_OBJS += entry.o
 LIB_OBJS += environment.o
+LIB_OBJS += ewah/bitmap.o
+LIB_OBJS += ewah/ewah_bitmap.o
+LIB_OBJS += ewah/ewah_io.o
+LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
new file mode 100644
index 000..d248577
--- /dev/null
+++ b/ewah/bitmap.c
@@ -0,0 +1,226 @@
+/**
+ * Copyright 2013, GitHub, Inc
+ * Copyright 2009-2013, Daniel Lemire, Cliff Moon,
+ * David McIntosh, Robert Becho, Google Inc. and Veronika Zenz
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, 
USA.
+ */
+#include assert.h
+#include stdlib.h
+#include string.h
+
+#include ewok.h
+
+#define MASK(x) ((eword_t)1  (x % BITS_IN_WORD))
+#define BLOCK(x) (x / BITS_IN_WORD)
+
+struct bitmap *bitmap_new(void)
+{
+   struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap));
+   bitmap-words = ewah_calloc(32, sizeof(eword_t));
+   bitmap-word_alloc = 32;
+   return bitmap;
+}
+
+void bitmap_set(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block = self-word_alloc) {
+   size_t old_size = self-word_alloc;
+   self-word_alloc = block * 2;
+   self-words = ewah_realloc(self-words, self-word_alloc * 
sizeof(eword_t));
+
+   memset(self-words + old_size, 0x0,
+   (self-word_alloc - old_size) * sizeof(eword_t));
+   }
+
+   self-words[block] |= MASK(pos);
+}
+
+void bitmap_clear(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block  self-word_alloc)
+   self-words[block] = ~MASK(pos);
+}
+
+int bitmap_get(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+   return block  self-word_alloc  (self-words[block]  MASK(pos)) != 
0;
+}
+
+struct ewah_bitmap *bitmap_to_ewah(struct bitmap *bitmap)
+{
+   struct ewah_bitmap *ewah = ewah_new();
+   size_t i, running_empty_words = 0;
+   eword_t last_word = 0;
+
+   for (i = 0; i  

Re: [PATCH 08/19] ewah: compressed bitmap implementation

2013-10-24 Thread Junio C Hamano
Jeff King p...@peff.net writes:

 From: Vicent Marti tan...@gmail.com

 EWAH is a word-aligned compressed variant of a bitset (i.e. a data
 structure that acts as a 0-indexed boolean array for many entries).

I think I've already pointed out some when the original series was
posted, but this does not seem to compile with -Wdecl-after-stmt
(possibly other violations may exist; I haven't looked very closely
yet).


 It uses a 64-bit run-length encoding (RLE) compression scheme,
 trading some compression for better processing speed.

 The goal of this word-aligned implementation is not to achieve
 the best compression, but rather to improve query processing time.
 As it stands right now, this EWAH implementation will always be more
 efficient storage-wise than its uncompressed alternative.

 EWAH arrays will be used as the on-disk format to store reachability
 bitmaps for all objects in a repository while keeping reasonable sizes,
 in the same way that JGit does.

 This EWAH implementation is a mostly straightforward port of the
 original `javaewah` library that JGit currently uses. The library is
 self-contained and has been embedded whole (4 files) inside the `ewah`
 folder to ease redistribution.

 The library is re-licensed under the GPLv2 with the permission of Daniel
 Lemire, the original author. The source code for the C version can
 be found on GitHub:

   https://github.com/vmg/libewok

 The original Java implementation can also be found on GitHub:

   https://github.com/lemire/javaewah

 Signed-off-by: Vicent Marti tan...@gmail.com
 Signed-off-by: Jeff King p...@peff.net
 ---
  Makefile   |   6 +
  ewah/bitmap.c  | 226 +
  ewah/ewah_bitmap.c | 701 
 +
  ewah/ewah_io.c | 192 +++
  ewah/ewah_rlw.c| 120 +
  ewah/ewok.h| 223 +
  ewah/ewok_rlw.h| 114 +
  7 files changed, 1582 insertions(+)
  create mode 100644 ewah/bitmap.c
  create mode 100644 ewah/ewah_bitmap.c
  create mode 100644 ewah/ewah_io.c
  create mode 100644 ewah/ewah_rlw.c
  create mode 100644 ewah/ewok.h
  create mode 100644 ewah/ewok_rlw.h

 diff --git a/Makefile b/Makefile
 index 48ff0bd..eacb940 100644
 --- a/Makefile
 +++ b/Makefile
 @@ -667,6 +667,8 @@ LIB_H += diff.h
  LIB_H += diffcore.h
  LIB_H += dir.h
  LIB_H += exec_cmd.h
 +LIB_H += ewah/ewok.h
 +LIB_H += ewah/ewok_rlw.h
  LIB_H += fetch-pack.h
  LIB_H += fmt-merge-msg.h
  LIB_H += fsck.h
 @@ -800,6 +802,10 @@ LIB_OBJS += dir.o
  LIB_OBJS += editor.o
  LIB_OBJS += entry.o
  LIB_OBJS += environment.o
 +LIB_OBJS += ewah/bitmap.o
 +LIB_OBJS += ewah/ewah_bitmap.o
 +LIB_OBJS += ewah/ewah_io.o
 +LIB_OBJS += ewah/ewah_rlw.o
  LIB_OBJS += exec_cmd.o
  LIB_OBJS += fetch-pack.o
  LIB_OBJS += fsck.o
 diff --git a/ewah/bitmap.c b/ewah/bitmap.c
 new file mode 100644
 index 000..d248577
 --- /dev/null
 +++ b/ewah/bitmap.c
 @@ -0,0 +1,226 @@
 +/**
 + * Copyright 2013, GitHub, Inc
 + * Copyright 2009-2013, Daniel Lemire, Cliff Moon,
 + *   David McIntosh, Robert Becho, Google Inc. and Veronika Zenz
 + *
 + * This program is free software; you can redistribute it and/or
 + * modify it under the terms of the GNU General Public License
 + * as published by the Free Software Foundation; either version 2
 + * of the License, or (at your option) any later version.
 + *
 + * This program is distributed in the hope that it will be useful,
 + * but WITHOUT ANY WARRANTY; without even the implied warranty of
 + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 + * GNU General Public License for more details.
 + *
 + * You should have received a copy of the GNU General Public License
 + * along with this program; if not, write to the Free Software
 + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  
 02110-1301, USA.
 + */
 +#include assert.h
 +#include stdlib.h
 +#include string.h
 +
 +#include ewok.h
 +
 +#define MASK(x) ((eword_t)1  (x % BITS_IN_WORD))
 +#define BLOCK(x) (x / BITS_IN_WORD)
 +
 +struct bitmap *bitmap_new(void)
 +{
 + struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap));
 + bitmap-words = ewah_calloc(32, sizeof(eword_t));
 + bitmap-word_alloc = 32;
 + return bitmap;
 +}
 +
 +void bitmap_set(struct bitmap *self, size_t pos)
 +{
 + size_t block = BLOCK(pos);
 +
 + if (block = self-word_alloc) {
 + size_t old_size = self-word_alloc;
 + self-word_alloc = block * 2;
 + self-words = ewah_realloc(self-words, self-word_alloc * 
 sizeof(eword_t));
 +
 + memset(self-words + old_size, 0x0,
 + (self-word_alloc - old_size) * sizeof(eword_t));
 + }
 +
 + self-words[block] |= MASK(pos);
 +}
 +
 +void bitmap_clear(struct bitmap *self, size_t pos)
 +{
 + size_t block = BLOCK(pos);
 +
 + if (block  self-word_alloc)
 + self-words[block] = ~MASK(pos);
 +}
 +
 +int bitmap_get(struct bitmap *self, size_t 

Re: [PATCH 08/19] ewah: compressed bitmap implementation

2013-10-24 Thread Jeff King
On Thu, Oct 24, 2013 at 04:34:48PM -0700, Junio C Hamano wrote:

 Jeff King p...@peff.net writes:
 
  From: Vicent Marti tan...@gmail.com
 
  EWAH is a word-aligned compressed variant of a bitset (i.e. a data
  structure that acts as a 0-indexed boolean array for many entries).
 
 I think I've already pointed out some when the original series was
 posted, but this does not seem to compile with -Wdecl-after-stmt
 (possibly other violations may exist; I haven't looked very closely
 yet).

Sorry about that. I typically compile with -Wall -Werror which does
not catch this, and I didn't carefully go over the ewah code for style,
as I considered it mostly a black-box import (though certainly it is
worth fixing this case, as it's not just a style issue).

It looks like there is also some void pointer arithmetic...I _thought_
that was handled by -Wall, but I suppose not. Maybe I need to beef up my
warning settings.

Note that there is also some use of cast-mmap-file-to-struct, only one
instance of which uses __attribute__(packed). However, I notice that the
regular pack code is also guilty of this (e.g., see check_packed_git_idx),
so I don't know how careful we want to be (and whether we should be
using the packed attribute when needed, or if it is sufficiently
non-portable that we want to do it by hand in such cases).

-Peff
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html