Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

2013-11-13 Thread Karsten Blees
Am 08.11.2013 18:08, schrieb Junio C Hamano:
> Karsten Blees  writes:
> 
>> What about this:
>>
>> #define HASHMAP_GROW_AT 80
>> #define HASHMAP_SHRINK_AT 16
> 
> I am not too enthused for three reasons. The fact that these are
> 100-based numbers is not written down anywhere other than the place
> they are used, 

Please forgive me that I didn't properly comment those lines when I tried to 
express my ideas in a mail :-)

> the places that use these need to consistently divide
> by 100, which invites unnecessary bugs, and compared to the
> original, you now require 16/100 but you didn't even want the exact
> 16% in the first plae (i.e. a simple 1/6 was good enough, and it
> still is).
> 

Actually, we're looking for a value slightly smaller than load-factor / 
resize-factor (i.e. 0.8 / 4 = 0.2), and 1/6 just happens to be close. However, 
if we modify the load-factor or resize-factor, we also need to adjust the 
shrink threshold in some non-obvious way. E.g. with load-factor 0.6, shrink-at 
must be 1/7. If we grow / shrink by 3 bits, shrink-at must be 1/11...

My current work-in-process version eliminates the _SHRINK_AT constant 
completely by basing the calculation on load-factor and resize-factor alone 
(i.e. it works for all values of load-factor and resize-factor without danger 
of introducing bugs):

/* grow / shrink by 2^2 */
#define HASHMAP_RESIZE_BITS 2
/* load factor in percent */
#define HASHMAP_LOAD_FACTOR 80

static void alloc_table(struct hashmap *map, unsigned int size)
{
map->tablesize = size;
map->table = xcalloc(size, sizeof(struct hashmap_entry *));

/* calculate resize thresholds for new size */
map->grow_at = (unsigned int) ((uint64_t) size * HASHMAP_LOAD_FACTOR / 
100);
if (size <= HASHMAP_INITIAL_SIZE)
map->shrink_at = 0;
else
/*
 * The shrink-threshold must be slightly smaller than
 * (grow-threshold / resize-factor) to prevent erratic resizing,
 * thus we divide by (resize-factor + 1).
 */
map->shrink_at = map->grow_at / ((1 << HASHMAP_RESIZE_BITS) + 
1);
}


>>> Perhaps
>>>
>>> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
>>> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
>>> #define HASHMAP_GROW(current) ((current) << 2)
>>> #define HASHMAP_SHRINK(current) ((current) >> 2)
>>>
>>> may alleviate my worries; I dunno.
> 
 +
 +void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
 +{
>>>
>>> Why is free_function not part of the constants defiend at
>>> hashmap_init() time?  Your API allows the same hashmap, depending on
>>> the way it has been used, to be cleaned up with different
>>> free_function, but I am not sure if that "flexibility" is intended
>>> (and in what application it would be useful).
>>>
>>
>> The free_function is a convenience so you don't have to loop over
>> the entries yourself. ...
>> ...a simple 'to free or not to free' boolean would suffice.
> 
> That is not the "flexibility" I was talking about. Your API allows
> omne to write a single program that does this:
> 
>   struct hashmap map;
> 
>   hashmap_init(&map, compare_fn);
> add/put/remove on map;
> 
>   if (phase_of_moon())
>   hashmap_free(&map, free_them_in_one_way);
>   else
>   hashmap_free(&map, free_them_in_another_way);
> 
> Just like your _init takes a comparison function to make it clear
> that all entries will be compared using the same function throughout
> the life of the map, if it takes a free function (and you can use
> NULL to mean "do not free, I am managing elements myself"), I would
> think that it will make it clear that the elements in that map will
> be freed the same way.
> 
> And it will allow your _put to call that free function when you
> replace an existing entry with a new one, if that becomes necessary.
> The API in the posted version seems to make it responsibility of the
> caller of _put to do whatever necessary clean-up to the returned
> value (which is the entry that was replaced and no longer in the
> hashmap), but within the context of a patch series whose later patch
> changes the API to replace or remove an entry from the index in such
> a way to shift the responsibility of freeing it from the caller to
> the callee, such a change to this API to make _put and _remove
> responsible for calling per-element free is a possiblity you may
> want to consider, no?
> 

Using an entry after it has been removed is a pretty common use case. E.g. a 
multi threaded application might want to remove the entry within a mutex and 
postpone any more expensive cleanup (e.g. updating a file, or even freeing the 
entry) until after leaving the mutex; A merge algorithm might want to move 
entries from source maps to a target map. If _remove/_put were also responsible 
for freeing the entry, you'd have to _get + copy + _remove/_put instead, wh

Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

2013-11-08 Thread Junio C Hamano
Karsten Blees  writes:

> What about this:
>
> #define HASHMAP_GROW_AT 80
> #define HASHMAP_SHRINK_AT 16

I am not too enthused for three reasons. The fact that these are
100-based numbers is not written down anywhere other than the place
they are used, the places that use these need to consistently divide
by 100, which invites unnecessary bugs, and compared to the
original, you now require 16/100 but you didn't even want the exact
16% in the first plae (i.e. a simple 1/6 was good enough, and it
still is).

>> Perhaps
>> 
>> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
>> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
>> #define HASHMAP_GROW(current) ((current) << 2)
>> #define HASHMAP_SHRINK(current) ((current) >> 2)
>> 
>> may alleviate my worries; I dunno.

>>> +
>>> +void hashmap_free(struct hashmap *map, hashmap_free_fn free_function)
>>> +{
>> 
>> Why is free_function not part of the constants defiend at
>> hashmap_init() time?  Your API allows the same hashmap, depending on
>> the way it has been used, to be cleaned up with different
>> free_function, but I am not sure if that "flexibility" is intended
>> (and in what application it would be useful).
>> 
>
> The free_function is a convenience so you don't have to loop over
> the entries yourself. ...
> ...a simple 'to free or not to free' boolean would suffice.

That is not the "flexibility" I was talking about. Your API allows
omne to write a single program that does this:

struct hashmap map;

hashmap_init(&map, compare_fn);
add/put/remove on map;

if (phase_of_moon())
hashmap_free(&map, free_them_in_one_way);
else
hashmap_free(&map, free_them_in_another_way);

Just like your _init takes a comparison function to make it clear
that all entries will be compared using the same function throughout
the life of the map, if it takes a free function (and you can use
NULL to mean "do not free, I am managing elements myself"), I would
think that it will make it clear that the elements in that map will
be freed the same way.

And it will allow your _put to call that free function when you
replace an existing entry with a new one, if that becomes necessary.
The API in the posted version seems to make it responsibility of the
caller of _put to do whatever necessary clean-up to the returned
value (which is the entry that was replaced and no longer in the
hashmap), but within the context of a patch series whose later patch
changes the API to replace or remove an entry from the index in such
a way to shift the responsibility of freeing it from the caller to
the callee, such a change to this API to make _put and _remove
responsible for calling per-element free is a possiblity you may
want to consider, no?

>>> +   if (map->tablesize > HASHMAP_INITIAL_SIZE &&
>>> +   map->size * HASHMAP_SHRINK_AT < map->tablesize)
>>> +   rehash(map, map->tablesize >> HASHMAP_GROW);
>> 
>> This "we shrink by the same amount" looks inconsistent with the use
>> of separate grow-at and shrink-at constants (see above for four
>> suggested #define's).
>> 
>
> These values account for a small hysteresis so that there is no size at which 
> a sequence of add, remove, add, remove (or put, put, put, put) results in 
> permanent resizes.

I was commenting on the two bottom lines of the above three line
quote from the patch.  You use SHIRNK_AT to decide if you want to
shrink, and you use >>GROW to do the actual shrinking.  Why isn't it
like this instead?

if (map->tablesize > HASHMAP_INITIAL_SIZE &&
HASHMAP_SHIRNK_AT(map->size) < map->tablesize)
rehash(map, map->tablesize >> HASHMAP_SHRINK);

The fact that constant used for shrinking was not called SHRINK but
GROW was what caught my attention.
--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

2013-11-08 Thread Philip Oakley

From: "Karsten Blees" 

However, defining the constants inversely is a bit unintuitive (i.e. 
1.25 instead of 0.8, 6 instead of 0.166). Perhaps the thresholds 
should also be calculated once on resize, not on every add / remove.


What about this:

#define HASHMAP_GROW_AT 80
#define HASHMAP_SHRINK_AT 16


mico-bikeshed adding a code comment...

#define HASHMAP_GROW_AT 80/* percent */
#define HASHMAP_SHRINK_AT 16/* percent */



...in rehash:

map->grow_at = (unsigned int)((uint64_t) map->tablesize * 
HASHMAP_GROW_AT / 100);
map->shrink_at = (unsigned int)((uint64_t) map->tablesize * 
HASHMAP_SHRINK_AT / 100);




[No need to add to cc...]

regards

Philip 


--
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

2013-11-08 Thread Karsten Blees
Am 07.11.2013 22:40, schrieb Junio C Hamano:
> Karsten Blees  writes:
> 
>> +`void hashmap_add(struct hashmap *map, void *entry)`::
>> +
>> +Adds a hashmap entry. This allows to add duplicate entries (i.e.
>> +separate values with the same key according to hashmap_cmp_fn).
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add.
>> +
>> +`void *hashmap_put(struct hashmap *map, void *entry)`::
>> +
>> +Adds or replaces a hashmap entry.
>> ++
>> +`map` is the hashmap structure.
>> ++
>> +`entry` is the entry to add or update.
>> ++
>> +Returns the previous entry, or NULL if not found (i.e. the entry was added).
> 
> What happens when there were duplicate entries previously added?
> All are replaced?  One of them is randomly chosen and gets replaced?
> 
> With s/replaced/removed/, the same question applies to hashmap_remove().
> 
> I am guessing that "we pick one at random and pretend as if other
> duplicates do not exist" is the answer, 

Exactly, however, you won't have duplicates if you use put exclusively.

> and I do not immediately
> have an objection to that semantics, but whatever the rule is, it
> needs to be spelled out.
> 

I'll clarify this.

>> +`void *hashmap_remove(struct hashmap *map, const void *key, const void 
>> *keydata)`::
>> +
>> +Removes a hashmap entry matching the specified key.
>> ...
>> +Usage example
>> +-
>> +
>> +Here's a simple usage example that maps long keys to double values.
>> +[source,c]
>> +
>> +struct hashmap map;
>> +
>> +struct long2double {
>> +struct hashmap_entry ent; /* must be the first member! */
>> +long key;
>> +double value;
>> +};
>> +
>> +static int long2double_cmp(const struct long2double *e1, const struct 
>> long2double *e2, const void *unused)
>> +{
>> +return !(e1->key == e2->key);
>> +}
>> +
>> +void long2double_init()
> 
> void long2double_init(void)
> 
>> +{
>> +hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
>> +}
>> +
>> +void long2double_free()
> 
> Likewise.
> 
>> diff --git a/hashmap.c b/hashmap.c
>> new file mode 100644
>> index 000..3a9f8c1
>> --- /dev/null
>> +++ b/hashmap.c
>> ...
>> +unsigned int memhash(const void *buf, size_t len)
>> +{
>> +unsigned int hash = FNV32_BASE;
>> +unsigned char *ucbuf = (unsigned char*) buf;
> 
> "(unsigned char *)buf"; pointer-asterisk does not stick to type.
> 

Ok, I'll recheck all casts.

>> +while (len--) {
>> +unsigned int c = *ucbuf++;
>> +hash = (hash * FNV32_PRIME) ^ c;
>> +}
>> +return hash;
>> +}
>> +
>> +unsigned int memihash(const void *buf, size_t len)
>> +{
>> +unsigned int hash = FNV32_BASE;
>> +unsigned char *ucbuf = (unsigned char*) buf;
>> +while (len--) {
>> +unsigned int c = *ucbuf++;
>> +if (c >= 'a' && c <= 'z')
>> +c -= 'a' - 'A';
>> +hash = (hash * FNV32_PRIME) ^ c;
>> +}
>> +return hash;
>> +}
>> +
>> +#define HASHMAP_INITIAL_SIZE 64
>> +/* grow / shrink by 2^2 */
>> +#define HASHMAP_GROW 2
>> +/* grow if > 80% full (to 20%) */
>> +#define HASHMAP_GROW_AT 1.25
>> +/* shrink if < 16.6% full (to 66.6%) */
>> +#define HASHMAP_SHRINK_AT 6
> 
> May be I am too old fashioned, but seeing a floating-point constant
> in our otherwise pretty-much integer-only code makes me feel uneasy.
> "gcc -S -O2" does seem to generate floating point instruction for
> "i" but not for "j":
> 
> extern void inspect(unsigned int i, unsigned int j);
> 
> void grow(unsigned int i, unsigned int j)
> {
> i *= 1.25;
> j += j >> 2;
> inspect(i, j);
> }
> 

I guess there's no more i486SXs around, so using floating point should not be a 
problem (at least performance-wise...).

However, defining the constants inversely is a bit unintuitive (i.e. 1.25 
instead of 0.8, 6 instead of 0.166). Perhaps the thresholds should also be 
calculated once on resize, not on every add / remove.

What about this:

#define HASHMAP_GROW_AT 80
#define HASHMAP_SHRINK_AT 16

...in rehash:

map->grow_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_GROW_AT / 
100);
map->shrink_at = (unsigned int)((uint64_t) map->tablesize * HASHMAP_SHRINK_AT / 
100);

...in add:

size++;
if (map->size > map->grow_at)

...in remove:

size--;
if (map->size < map->shrink_at)

> Perhaps
> 
> #define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
> #define HASHMAP_SHRINK_AT(current) ((current) * 6)
> #define HASHMAP_GROW(current) ((current) << 2)
> #define HASHMAP_SHRINK(current) ((current) >> 2)
> 
> may alleviate my worries; I dunno.
> 
>> +
>> +static inline int entry_equals(const struct hashmap *map,
>> +const struct hashmap_entry *e1, const struct hashmap_entry *e2,
>> +const void *keydata)
>> +{
>> +return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2, 
>> keydata));
> 
> Our code tends not to say "this is a pointer to a function"
> exp

Re: [PATCH v4 02/14] add a hashtable implementation that supports O(1) removal

2013-11-07 Thread Junio C Hamano
Karsten Blees  writes:

> +`void hashmap_add(struct hashmap *map, void *entry)`::
> +
> + Adds a hashmap entry. This allows to add duplicate entries (i.e.
> + separate values with the same key according to hashmap_cmp_fn).
> ++
> +`map` is the hashmap structure.
> ++
> +`entry` is the entry to add.
> +
> +`void *hashmap_put(struct hashmap *map, void *entry)`::
> +
> + Adds or replaces a hashmap entry.
> ++
> +`map` is the hashmap structure.
> ++
> +`entry` is the entry to add or update.
> ++
> +Returns the previous entry, or NULL if not found (i.e. the entry was added).

What happens when there were duplicate entries previously added?
All are replaced?  One of them is randomly chosen and gets replaced?

With s/replaced/removed/, the same question applies to hashmap_remove().

I am guessing that "we pick one at random and pretend as if other
duplicates do not exist" is the answer, and I do not immediately
have an objection to that semantics, but whatever the rule is, it
needs to be spelled out.

> +`void *hashmap_remove(struct hashmap *map, const void *key, const void 
> *keydata)`::
> +
> + Removes a hashmap entry matching the specified key.
> ...
> +Usage example
> +-
> +
> +Here's a simple usage example that maps long keys to double values.
> +[source,c]
> +
> +struct hashmap map;
> +
> +struct long2double {
> + struct hashmap_entry ent; /* must be the first member! */
> + long key;
> + double value;
> +};
> +
> +static int long2double_cmp(const struct long2double *e1, const struct 
> long2double *e2, const void *unused)
> +{
> + return !(e1->key == e2->key);
> +}
> +
> +void long2double_init()

void long2double_init(void)

> +{
> + hashmap_init(&map, (hashmap_cmp_fn) long2double_cmp, 0);
> +}
> +
> +void long2double_free()

Likewise.

> diff --git a/hashmap.c b/hashmap.c
> new file mode 100644
> index 000..3a9f8c1
> --- /dev/null
> +++ b/hashmap.c
> ...
> +unsigned int memhash(const void *buf, size_t len)
> +{
> + unsigned int hash = FNV32_BASE;
> + unsigned char *ucbuf = (unsigned char*) buf;

"(unsigned char *)buf"; pointer-asterisk does not stick to type.

> + while (len--) {
> + unsigned int c = *ucbuf++;
> + hash = (hash * FNV32_PRIME) ^ c;
> + }
> + return hash;
> +}
> +
> +unsigned int memihash(const void *buf, size_t len)
> +{
> + unsigned int hash = FNV32_BASE;
> + unsigned char *ucbuf = (unsigned char*) buf;
> + while (len--) {
> + unsigned int c = *ucbuf++;
> + if (c >= 'a' && c <= 'z')
> + c -= 'a' - 'A';
> + hash = (hash * FNV32_PRIME) ^ c;
> + }
> + return hash;
> +}
> +
> +#define HASHMAP_INITIAL_SIZE 64
> +/* grow / shrink by 2^2 */
> +#define HASHMAP_GROW 2
> +/* grow if > 80% full (to 20%) */
> +#define HASHMAP_GROW_AT 1.25
> +/* shrink if < 16.6% full (to 66.6%) */
> +#define HASHMAP_SHRINK_AT 6

May be I am too old fashioned, but seeing a floating-point constant
in our otherwise pretty-much integer-only code makes me feel uneasy.
"gcc -S -O2" does seem to generate floating point instruction for
"i" but not for "j":

extern void inspect(unsigned int i, unsigned int j);

void grow(unsigned int i, unsigned int j)
{
i *= 1.25;
j += j >> 2;
inspect(i, j);
}

Perhaps

#define HASHMAP_GROW_AT(current) ((current) + (current) >> 2)
#define HASHMAP_SHRINK_AT(current) ((current) * 6)
#define HASHMAP_GROW(current) ((current) << 2)
#define HASHMAP_SHRINK(current) ((current) >> 2)

may alleviate my worries; I dunno.

> +
> +static inline int entry_equals(const struct hashmap *map,
> + const struct hashmap_entry *e1, const struct hashmap_entry *e2,
> + const void *keydata)
> +{
> + return (e1 == e2) || (e1->hash == e2->hash && !(*map->cmpfn)(e1, e2, 
> keydata));

Our code tends not to say "this is a pointer to a function"
explicitly, i.e.

!map->cmpfn(e1, e2, keydata)

is more in-line with our code and should also be sufficient.

> +}
> +
> +static inline unsigned int bucket(const struct hashmap *map,
> + const struct hashmap_entry *key)
> +{
> + return key->hash & (map->tablesize - 1);
> +}
> +
> +static void rehash(struct hashmap *map, unsigned int newsize)
> +{
> + unsigned int i, oldsize = map->tablesize;
> + struct hashmap_entry **oldtable = map->table;
> +
> + map->tablesize = newsize;
> + map->table = xcalloc(sizeof(struct hashmap_entry*), map->tablesize);

Even though multiplication is commutative, please match the official
function signature, i.e.

calloc(size_t nmemb, size_t size)

where the number of elements comes first and size of an element
comes next.  I.e.

xcalloc(map->tablesize, sizeof(struct hashmap_entry *));

Also don't forget the SP between type and asterisk.

> + for (i = 0; i < oldsize; i++) {
> + struct hashmap_entry *e = oldtable[i];
> +