[PATCH 08/19] ewah: compressed bitmap implementation
From: Vicent Marti tan...@gmail.com EWAH is a word-aligned compressed variant of a bitset (i.e. a data structure that acts as a 0-indexed boolean array for many entries). It uses a 64-bit run-length encoding (RLE) compression scheme, trading some compression for better processing speed. The goal of this word-aligned implementation is not to achieve the best compression, but rather to improve query processing time. As it stands right now, this EWAH implementation will always be more efficient storage-wise than its uncompressed alternative. EWAH arrays will be used as the on-disk format to store reachability bitmaps for all objects in a repository while keeping reasonable sizes, in the same way that JGit does. This EWAH implementation is a mostly straightforward port of the original `javaewah` library that JGit currently uses. The library is self-contained and has been embedded whole (4 files) inside the `ewah` folder to ease redistribution. The library is re-licensed under the GPLv2 with the permission of Daniel Lemire, the original author. The source code for the C version can be found on GitHub: https://github.com/vmg/libewok The original Java implementation can also be found on GitHub: https://github.com/lemire/javaewah Signed-off-by: Vicent Marti tan...@gmail.com Signed-off-by: Jeff King p...@peff.net --- Makefile | 6 + ewah/bitmap.c | 226 + ewah/ewah_bitmap.c | 701 + ewah/ewah_io.c | 192 +++ ewah/ewah_rlw.c| 120 + ewah/ewok.h| 223 + ewah/ewok_rlw.h| 114 + 7 files changed, 1582 insertions(+) create mode 100644 ewah/bitmap.c create mode 100644 ewah/ewah_bitmap.c create mode 100644 ewah/ewah_io.c create mode 100644 ewah/ewah_rlw.c create mode 100644 ewah/ewok.h create mode 100644 ewah/ewok_rlw.h diff --git a/Makefile b/Makefile index 48ff0bd..eacb940 100644 --- a/Makefile +++ b/Makefile @@ -667,6 +667,8 @@ LIB_H += diff.h LIB_H += diffcore.h LIB_H += dir.h LIB_H += exec_cmd.h +LIB_H += ewah/ewok.h +LIB_H += ewah/ewok_rlw.h LIB_H += fetch-pack.h LIB_H += fmt-merge-msg.h LIB_H += fsck.h @@ -800,6 +802,10 @@ LIB_OBJS += dir.o LIB_OBJS += editor.o LIB_OBJS += entry.o LIB_OBJS += environment.o +LIB_OBJS += ewah/bitmap.o +LIB_OBJS += ewah/ewah_bitmap.o +LIB_OBJS += ewah/ewah_io.o +LIB_OBJS += ewah/ewah_rlw.o LIB_OBJS += exec_cmd.o LIB_OBJS += fetch-pack.o LIB_OBJS += fsck.o diff --git a/ewah/bitmap.c b/ewah/bitmap.c new file mode 100644 index 000..d248577 --- /dev/null +++ b/ewah/bitmap.c @@ -0,0 +1,226 @@ +/** + * Copyright 2013, GitHub, Inc + * Copyright 2009-2013, Daniel Lemire, Cliff Moon, + * David McIntosh, Robert Becho, Google Inc. and Veronika Zenz + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ +#include assert.h +#include stdlib.h +#include string.h + +#include ewok.h + +#define MASK(x) ((eword_t)1 (x % BITS_IN_WORD)) +#define BLOCK(x) (x / BITS_IN_WORD) + +struct bitmap *bitmap_new(void) +{ + struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap)); + bitmap-words = ewah_calloc(32, sizeof(eword_t)); + bitmap-word_alloc = 32; + return bitmap; +} + +void bitmap_set(struct bitmap *self, size_t pos) +{ + size_t block = BLOCK(pos); + + if (block = self-word_alloc) { + size_t old_size = self-word_alloc; + self-word_alloc = block * 2; + self-words = ewah_realloc(self-words, self-word_alloc * sizeof(eword_t)); + + memset(self-words + old_size, 0x0, + (self-word_alloc - old_size) * sizeof(eword_t)); + } + + self-words[block] |= MASK(pos); +} + +void bitmap_clear(struct bitmap *self, size_t pos) +{ + size_t block = BLOCK(pos); + + if (block self-word_alloc) + self-words[block] = ~MASK(pos); +} + +int bitmap_get(struct bitmap *self, size_t pos) +{ + size_t block = BLOCK(pos); + return block self-word_alloc (self-words[block] MASK(pos)) != 0; +} + +struct ewah_bitmap *bitmap_to_ewah(struct bitmap *bitmap) +{ + struct ewah_bitmap *ewah = ewah_new(); + size_t i, running_empty_words = 0; + eword_t last_word = 0; + + for (i = 0; i
Re: [PATCH 08/19] ewah: compressed bitmap implementation
Jeff King p...@peff.net writes: From: Vicent Marti tan...@gmail.com EWAH is a word-aligned compressed variant of a bitset (i.e. a data structure that acts as a 0-indexed boolean array for many entries). I think I've already pointed out some when the original series was posted, but this does not seem to compile with -Wdecl-after-stmt (possibly other violations may exist; I haven't looked very closely yet). It uses a 64-bit run-length encoding (RLE) compression scheme, trading some compression for better processing speed. The goal of this word-aligned implementation is not to achieve the best compression, but rather to improve query processing time. As it stands right now, this EWAH implementation will always be more efficient storage-wise than its uncompressed alternative. EWAH arrays will be used as the on-disk format to store reachability bitmaps for all objects in a repository while keeping reasonable sizes, in the same way that JGit does. This EWAH implementation is a mostly straightforward port of the original `javaewah` library that JGit currently uses. The library is self-contained and has been embedded whole (4 files) inside the `ewah` folder to ease redistribution. The library is re-licensed under the GPLv2 with the permission of Daniel Lemire, the original author. The source code for the C version can be found on GitHub: https://github.com/vmg/libewok The original Java implementation can also be found on GitHub: https://github.com/lemire/javaewah Signed-off-by: Vicent Marti tan...@gmail.com Signed-off-by: Jeff King p...@peff.net --- Makefile | 6 + ewah/bitmap.c | 226 + ewah/ewah_bitmap.c | 701 + ewah/ewah_io.c | 192 +++ ewah/ewah_rlw.c| 120 + ewah/ewok.h| 223 + ewah/ewok_rlw.h| 114 + 7 files changed, 1582 insertions(+) create mode 100644 ewah/bitmap.c create mode 100644 ewah/ewah_bitmap.c create mode 100644 ewah/ewah_io.c create mode 100644 ewah/ewah_rlw.c create mode 100644 ewah/ewok.h create mode 100644 ewah/ewok_rlw.h diff --git a/Makefile b/Makefile index 48ff0bd..eacb940 100644 --- a/Makefile +++ b/Makefile @@ -667,6 +667,8 @@ LIB_H += diff.h LIB_H += diffcore.h LIB_H += dir.h LIB_H += exec_cmd.h +LIB_H += ewah/ewok.h +LIB_H += ewah/ewok_rlw.h LIB_H += fetch-pack.h LIB_H += fmt-merge-msg.h LIB_H += fsck.h @@ -800,6 +802,10 @@ LIB_OBJS += dir.o LIB_OBJS += editor.o LIB_OBJS += entry.o LIB_OBJS += environment.o +LIB_OBJS += ewah/bitmap.o +LIB_OBJS += ewah/ewah_bitmap.o +LIB_OBJS += ewah/ewah_io.o +LIB_OBJS += ewah/ewah_rlw.o LIB_OBJS += exec_cmd.o LIB_OBJS += fetch-pack.o LIB_OBJS += fsck.o diff --git a/ewah/bitmap.c b/ewah/bitmap.c new file mode 100644 index 000..d248577 --- /dev/null +++ b/ewah/bitmap.c @@ -0,0 +1,226 @@ +/** + * Copyright 2013, GitHub, Inc + * Copyright 2009-2013, Daniel Lemire, Cliff Moon, + * David McIntosh, Robert Becho, Google Inc. and Veronika Zenz + * + * This program is free software; you can redistribute it and/or + * modify it under the terms of the GNU General Public License + * as published by the Free Software Foundation; either version 2 + * of the License, or (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. + */ +#include assert.h +#include stdlib.h +#include string.h + +#include ewok.h + +#define MASK(x) ((eword_t)1 (x % BITS_IN_WORD)) +#define BLOCK(x) (x / BITS_IN_WORD) + +struct bitmap *bitmap_new(void) +{ + struct bitmap *bitmap = ewah_malloc(sizeof(struct bitmap)); + bitmap-words = ewah_calloc(32, sizeof(eword_t)); + bitmap-word_alloc = 32; + return bitmap; +} + +void bitmap_set(struct bitmap *self, size_t pos) +{ + size_t block = BLOCK(pos); + + if (block = self-word_alloc) { + size_t old_size = self-word_alloc; + self-word_alloc = block * 2; + self-words = ewah_realloc(self-words, self-word_alloc * sizeof(eword_t)); + + memset(self-words + old_size, 0x0, + (self-word_alloc - old_size) * sizeof(eword_t)); + } + + self-words[block] |= MASK(pos); +} + +void bitmap_clear(struct bitmap *self, size_t pos) +{ + size_t block = BLOCK(pos); + + if (block self-word_alloc) + self-words[block] = ~MASK(pos); +} + +int bitmap_get(struct bitmap *self, size_t
Re: [PATCH 08/19] ewah: compressed bitmap implementation
On Thu, Oct 24, 2013 at 04:34:48PM -0700, Junio C Hamano wrote: Jeff King p...@peff.net writes: From: Vicent Marti tan...@gmail.com EWAH is a word-aligned compressed variant of a bitset (i.e. a data structure that acts as a 0-indexed boolean array for many entries). I think I've already pointed out some when the original series was posted, but this does not seem to compile with -Wdecl-after-stmt (possibly other violations may exist; I haven't looked very closely yet). Sorry about that. I typically compile with -Wall -Werror which does not catch this, and I didn't carefully go over the ewah code for style, as I considered it mostly a black-box import (though certainly it is worth fixing this case, as it's not just a style issue). It looks like there is also some void pointer arithmetic...I _thought_ that was handled by -Wall, but I suppose not. Maybe I need to beef up my warning settings. Note that there is also some use of cast-mmap-file-to-struct, only one instance of which uses __attribute__(packed). However, I notice that the regular pack code is also guilty of this (e.g., see check_packed_git_idx), so I don't know how careful we want to be (and whether we should be using the packed attribute when needed, or if it is sufficiently non-portable that we want to do it by hand in such cases). -Peff -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html