Re: [PATCH 08/16] ewah: compressed bitmap implementation

2013-06-25 Thread Thomas Rast
Vicent Marti tan...@gmail.com writes:

 The library is re-licensed under the GPLv2 with the permission of Daniel
 Lemire, the original author.

This says GPLv2, but the license blurbs all say or (at your option)
any later version.  IANAL, does this cause any problems?  If so, can
they be GPLv2-only instead?

  Makefile   |6 +
  ewah/bitmap.c  |  229 +
  ewah/ewah_bitmap.c |  703 
 
  ewah/ewah_io.c |  199 +++
  ewah/ewah_rlw.c|  124 +
  ewah/ewok.h|  194 +++
  ewah/ewok_rlw.h|  114 +

Can we have a Documentation/technical/api-ewah.txt?

(Maybe if you insert all the comments I ask for in the below, it's not
necessary, but it would still be nice to have some central place where
the formats are documented.)

[...]
 +struct ewah_bitmap *bitmap_to_ewah(struct bitmap *bitmap)
 +{
 + struct ewah_bitmap *ewah = ewah_new();
 + size_t i, running_empty_words = 0;
 + eword_t last_word = 0;
 +
 + for (i = 0; i  bitmap-word_alloc; ++i) {
 + if (bitmap-words[i] == 0) {
 + running_empty_words++;
 + continue;
 + }
 +
 + if (last_word != 0) {
 + ewah_add(ewah, last_word);
 + }

There are a lot of noisy braces -- like in this instance -- if you
apply the git style to the files in ewah/.  I assume we'll give the
directory its own style, so that it should always use braces even on
one-line blocks.

[...]
 + ewah_add(ewah, last_word);
 + return ewah;
 +}
 +
 +struct bitmap *ewah_to_bitmap(struct ewah_bitmap *ewah)
 +{
 + struct bitmap *bitmap = bitmap_new();
 + struct ewah_iterator it;
 + eword_t blowup;
 + size_t i = 0;
 +
 + ewah_iterator_init(it, ewah);
 +
 + while (ewah_iterator_next(blowup, it)) {
 + if (i = bitmap-word_alloc) {
 + bitmap-word_alloc *= 1.5;

Any reason that this uses a scale factor of 1.5, while the bitmap_set
operation above uses 2?

 + bitmap-words = ewah_realloc(
 + bitmap-words, bitmap-word_alloc * 
 sizeof(eword_t));
 + }
[...]
 +
 +void bitmap_each_bit(struct bitmap *self, ewah_callback callback, void *data)
 +{
[...]
 + for (offset = 0; offset  BITS_IN_WORD; ++offset) {
 + if ((word  offset) == 0)
 + break;
 +
 + offset += __builtin_ctzll(word  offset);

Here and in the rest, you use __builtin_* within the code.  This needs
to be either in a separate helper that reimplements the function in
terms of C if it is not available (i.e. you don't use GCC).
(Alternatively, the whole series could be conditional on some
HAVE_GCC_BUILTINS macro.  I'd think that would be a bad tradeoff
though.)

 + callback(pos + offset, data);
 + }
 + pos += BITS_IN_WORD;
 + }
 + }
 +}
[...]

 diff --git a/ewah/ewah_bitmap.c b/ewah/ewah_bitmap.c
[...]
 +void ewah_free(struct ewah_bitmap *bitmap)
 +{
 + if (bitmap-alloc_size)
 + free(bitmap-buffer);
 +
 + free(bitmap);
 +}

Maybe first if (!bitmap) return, so that it behaves like other free()s?

 diff --git a/ewah/ewah_io.c b/ewah/ewah_io.c
[...]
 +int ewah_serialize_native(struct ewah_bitmap *self, int fd)
 +{
 + uint32_t write32;
 + size_t to_write = self-buffer_size * 8;
 +
 + /* 32 bit -- bit size fr the map */

You cutpasted the typo (for) throughout the file :-)

[...]
 + /** 32 bit -- number of compressed 64-bit words */
 + write32 = (uint32_t)self-buffer_size;
 + if (write(fd, write32, 4) != 4)
 + return -1;
 +
 + if (write(fd, self-buffer, to_write) != to_write)
 + return -1;

Shouldn't you use our neat write_in_full() and read_in_full() helpers,
throughout the file?

[...]
 diff --git a/ewah/ewok.h b/ewah/ewok.h
[...]
 +#ifndef __EWOK_BITMAP_C__
 +#define __EWOK_BITMAP_C__

_H_?

 +#ifndef ewah_malloc
 +#define ewah_malloc malloc
 +#endif
 +#ifndef ewah_realloc
 +#define ewah_realloc realloc
 +#endif
 +#ifndef ewah_calloc
 +#define ewah_calloc calloc
 +#endif

I see you later #define them to the corresponding x*alloc version in
pack-bitmap.h.  Good.

 +
 +typedef uint64_t eword_t;

I assume this isn't ifdef'd to help 32bit platforms because the on-disk
format depends on it?

 +#define BITS_IN_WORD (sizeof(eword_t) * 8)
 +
 +struct ewah_bitmap {
 + eword_t *buffer;
 + size_t buffer_size;
 + size_t alloc_size;
 + size_t bit_size;
 + eword_t *rlw;
 +};
 +
 +typedef void (*ewah_callback)(size_t pos, void *);
 +
 +struct ewah_bitmap *ewah_pool_new(void);
 +void ewah_pool_free(struct ewah_bitmap *bitmap);

How do the pool versions differ from the non-pool versions below?  I
would have expected 

Re: [PATCH 08/16] ewah: compressed bitmap implementation

2013-06-25 Thread Junio C Hamano
Junio C Hamano gits...@pobox.com writes:

 Vicent Marti tan...@gmail.com writes:

 The library is re-licensed under the GPLv2 with the permission of Daniel
 Lemire, the original author. The source code for the C version can
 be found on GitHub:

  https://github.com/vmg/libewok

 The original Java implementation can also be found on GitHub:

  https://github.com/lemire/javaewah
 ---

 Please make sure that all patches are properly signed off.

  Makefile   |6 +
  ewah/bitmap.c  |  229 +
  ewah/ewah_bitmap.c |  703 
 
  ewah/ewah_io.c |  199 +++
  ewah/ewah_rlw.c|  124 +
  ewah/ewok.h|  194 +++
  ewah/ewok_rlw.h|  114 +

 This is lovely.  A few comments after an initial quick scan-through.

  - The code and the headers are well commented, which is good.

  - What's __builtin_popcountll() doing there in a presumably generic
codepath?

  - Two variants of bitmap are given different and easy to
understand type names (vanilla one is bitmap, the clever one is
ewah_bitmap), but at many places, a pointer to ewah_bitmap is
simply called bitmap or bitmap_i without ewah anywhere,
which waas confusing to read.  Especially, the NAND operation
for bitmap takes two bitmaps, while OR takes one bitmap and
ewah_bitmap.  That is fine as long as the combination is
convenient for callers, but I wished the ewah variables be called
with ewah somewhere in their names.

  - I compile with -Werror -Wdeclaration-after-statement; some
places seem to trigger it.

  - Some extern declarations in *.c sources were irritating;
shouldn't they be declared in *.h file and included?

  - There are some instances of if (condition) stmt; on a single
line; looked irritating.   

  - bool is not a C type we use (and not a particularly good type
in C++, either).

One more.

  - Use of unnecessary float (e.g. oldval *= 1.5) were moderately
annoying.


 That is it for now. I am looking forward to read through the users
 of the library ;-)

 Thanks for working on this.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/16] ewah: compressed bitmap implementation

2013-06-24 Thread Vicent Marti
EWAH is a word-aligned compressed variant of a bitset (i.e. a data
structure that acts as a 0-indexed boolean array for many entries).

It uses a 64-bit run-length encoding (RLE) compression scheme,
trading some compression for better processing speed.

The goal of this word-aligned implementation is not to achieve
the best compression, but rather to improve query processing time.
As it stands right now, this EWAH implementation will always be more
efficient storage-wise than its uncompressed alternative.

EWAH arrays will be used as the on-disk format to store reachability
bitmaps for all objects in a repository while keeping reasonable sizes,
in the same way that JGit does.

This EWAH implementation is a mostly straightforward port of the
original `javaewah` library that JGit currently uses. The library is
self-contained and has been embedded whole (4 files) inside the `ewah`
folder to ease redistribution.

The library is re-licensed under the GPLv2 with the permission of Daniel
Lemire, the original author. The source code for the C version can
be found on GitHub:

https://github.com/vmg/libewok

The original Java implementation can also be found on GitHub:

https://github.com/lemire/javaewah
---
 Makefile   |6 +
 ewah/bitmap.c  |  229 +
 ewah/ewah_bitmap.c |  703 
 ewah/ewah_io.c |  199 +++
 ewah/ewah_rlw.c|  124 +
 ewah/ewok.h|  194 +++
 ewah/ewok_rlw.h|  114 +
 7 files changed, 1569 insertions(+)
 create mode 100644 ewah/bitmap.c
 create mode 100644 ewah/ewah_bitmap.c
 create mode 100644 ewah/ewah_io.c
 create mode 100644 ewah/ewah_rlw.c
 create mode 100644 ewah/ewok.h
 create mode 100644 ewah/ewok_rlw.h

diff --git a/Makefile b/Makefile
index e01506d..e03c773 100644
--- a/Makefile
+++ b/Makefile
@@ -672,6 +672,8 @@ LIB_H += diff.h
 LIB_H += diffcore.h
 LIB_H += dir.h
 LIB_H += exec_cmd.h
+LIB_H += ewah/ewok.h
+LIB_H += ewah/ewok_rlw.h
 LIB_H += fetch-pack.h
 LIB_H += fmt-merge-msg.h
 LIB_H += fsck.h
@@ -802,6 +804,10 @@ LIB_OBJS += dir.o
 LIB_OBJS += editor.o
 LIB_OBJS += entry.o
 LIB_OBJS += environment.o
+LIB_OBJS += ewah/bitmap.o
+LIB_OBJS += ewah/ewah_bitmap.o
+LIB_OBJS += ewah/ewah_io.o
+LIB_OBJS += ewah/ewah_rlw.o
 LIB_OBJS += exec_cmd.o
 LIB_OBJS += fetch-pack.o
 LIB_OBJS += fsck.o
diff --git a/ewah/bitmap.c b/ewah/bitmap.c
new file mode 100644
index 000..75ca8fd
--- /dev/null
+++ b/ewah/bitmap.c
@@ -0,0 +1,229 @@
+/**
+ * Copyright 2013, GitHub, Inc
+ * Copyright 2009-2013, Daniel Lemire, Cliff Moon,
+ * David McIntosh, Robert Becho, Google Inc. and Veronika Zenz
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; either version 2
+ * of the License, or (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA  02110-1301, 
USA.
+ */
+#include assert.h
+#include stdlib.h
+#include string.h
+
+#include ewok.h
+
+#define MASK(x) ((eword_t)1  (x % BITS_IN_WORD))
+#define BLOCK(x) (x / BITS_IN_WORD)
+
+struct bitmap *bitmap_new(void)
+{
+   struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap));
+   bitmap-words = ewah_calloc(32, sizeof(eword_t));
+   bitmap-word_alloc = 32;
+   return bitmap;
+}
+
+void bitmap_set(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block = self-word_alloc) {
+   size_t old_size = self-word_alloc;
+   self-word_alloc = block * 2;
+   self-words = ewah_realloc(self-words, self-word_alloc * 
sizeof(eword_t));
+
+   memset(self-words + old_size, 0x0,
+   (self-word_alloc - old_size) * sizeof(eword_t));
+   }
+
+   self-words[block] |= MASK(pos);
+}
+
+void bitmap_clear(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+
+   if (block  self-word_alloc)
+   self-words[block] = ~MASK(pos);
+}
+
+bool bitmap_get(struct bitmap *self, size_t pos)
+{
+   size_t block = BLOCK(pos);
+   return block  self-word_alloc  (self-words[block]  MASK(pos)) != 
0;
+}
+
+extern size_t ewah_add_empty_words(struct ewah_bitmap *self, bool v, size_t 
number);
+extern size_t ewah_add(struct ewah_bitmap *self, eword_t word);
+
+struct ewah_bitmap *bitmap_to_ewah(struct bitmap *bitmap)
+{
+   struct ewah_bitmap *ewah = ewah_new();
+   size_t i, running_empty_words = 0;
+   eword_t 

Re: [PATCH 08/16] ewah: compressed bitmap implementation

2013-06-24 Thread Junio C Hamano
Vicent Marti tan...@gmail.com writes:

 The library is re-licensed under the GPLv2 with the permission of Daniel
 Lemire, the original author. The source code for the C version can
 be found on GitHub:

   https://github.com/vmg/libewok

 The original Java implementation can also be found on GitHub:

   https://github.com/lemire/javaewah
 ---

Please make sure that all patches are properly signed off.

  Makefile   |6 +
  ewah/bitmap.c  |  229 +
  ewah/ewah_bitmap.c |  703 
 
  ewah/ewah_io.c |  199 +++
  ewah/ewah_rlw.c|  124 +
  ewah/ewok.h|  194 +++
  ewah/ewok_rlw.h|  114 +

This is lovely.  A few comments after an initial quick scan-through.

 - The code and the headers are well commented, which is good.

 - What's __builtin_popcountll() doing there in a presumably generic
   codepath?

 - Two variants of bitmap are given different and easy to
   understand type names (vanilla one is bitmap, the clever one is
   ewah_bitmap), but at many places, a pointer to ewah_bitmap is
   simply called bitmap or bitmap_i without ewah anywhere,
   which waas confusing to read.  Especially, the NAND operation
   for bitmap takes two bitmaps, while OR takes one bitmap and
   ewah_bitmap.  That is fine as long as the combination is
   convenient for callers, but I wished the ewah variables be called
   with ewah somewhere in their names.

 - I compile with -Werror -Wdeclaration-after-statement; some
   places seem to trigger it.

 - Some extern declarations in *.c sources were irritating;
   shouldn't they be declared in *.h file and included?

 - There are some instances of if (condition) stmt; on a single
   line; looked irritating.   

 - bool is not a C type we use (and not a particularly good type
   in C++, either).

That is it for now. I am looking forward to read through the users
of the library ;-)

Thanks for working on this.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html