Re: Hash indexing of HFiles

Stack Mon, 18 Jul 2011 09:06:10 -0700

On Mon, Jul 18, 2011 at 4:04 AM, Claudio Martella
<[email protected]> wrote:
> No, you can have collisions, so the index is not perfect (which means
> you can have buckets for colliding keys and empty unused entries in the
> hashtable directory).


Well, if a perfect index is what you are after, you can generate
hashing functions that avoid collisions (You know this quality work by
your fellow countrymen:
http://sux.dsi.unimi.it/docs/it/unimi/dsi/sux4j/mph/MinimalPerfectHashFunction.html?)

> Also, the index stores only offsets, not the keys. So they overhead of
> the index is sizeof(long) * number of key-value pairs in the file (in
> the optimal case, the overhead is 16 bytes for each colliding key-value
> pair as it's managed through a linked-list). For this reason using a
> load-factor > 1 can actually save memory and increase performance.
>

OK

> Thank you very much for all this information. I'll try to implement a
> prototype in September, I'm pretty busy with deadlines right now.
>

Good stuff,
St.Ack

Re: Hash indexing of HFiles

Reply via email to