This makes it sound as if you might need to search two groups for a record, 
which is not correct.  If the initial hash is based on the larger modulo, and 
the group exists, then the key will be in the higher number group.  If the 
result of the first hash is larger than the modulus of the of the table, then 
you rehash with the smaller modulus.

And the modulo used for hashing is always a power of two. 

So if the initial hash function on a key is f(x), then the key will either be 
in f(x) mod 2**n or, if that group has not been created, then in f(x) mod 
2**(n-1).  n increases each time that the modulus grows to equal 2**n+1. So

For a modulus of 3 or 4, n = 2; for 5,6,7,8, n =3.

For instance:

If your groups are numbered 0-5 (6 groups), and your hash value is 5, then you 
are in the last (6th) group because the 5 mod 8 is 5.  Likewise 6 mod 8 is 6, 
but this would be beyond the highest group we have (5).  6 mod 4 is 2, and that 
is the group where 6 should fall. Likewise 7 should fall into group 3.  After 
two more splits (of groups 2 & 3) the modulus will be 8, and no rehashing is 
necessary until we next split group 0 and add a 9th group, at which point we 
start with a mod 16, and use 8 if the first result is over 8 (8 would go into 
the 9th group, 9 would hash into the second group, #1 -- 9 mod 4 -> 1.

Admittedly, this is probably at least as confusing as every other explanation 
of the process ;-)


On Jul 4, 2012, at 3:26 AM, Brian Leach wrote:

>> All the other groups effectively get 1 added to their number
> Not exactly.
> Sorry to those who already know this, but maybe it's time to go over linear
> hashing in theory ..
> Linear hashing was a system devised by Litwin and originally only for
> in-memory lists. In fact there's some good implementations in C# that
> provide better handling of Dictionary types. Applying it to a file system
> adds some complexity but it's basically the same theory.
> Let's start with a file that has 100 groups initially defined (that's 0
> through 99). That is your minimum starting point and should ensure that it
> never shrinks below that, so it doesn't begin it's life with loads of splits
> right from the start as you populate the file. You would size this similarly
> to the way you size a regular hashed file for your initial content: no point
> making work for yourself (or the database).
> As data gets added, because the content is allocated unevenly, some of that
> load will be in primary and some in overflow: that's just the way of the
> world. No hashing is perfect. Unlike a static file, the overflow can't be
> added to the end of the file as a linked list (* why nobody has done managed
> overflow is beyond me), it has to sit in a separate file.
> At some point the amount of data held in respect of the available space
> reaches a critical level and the file needs to reorganize. Rather than split
> the most heavily populated group - which would be the obvious thing - linear
> hashing works on the basis of a split pointer that moves incrementally
> through the file. So the first split breaks group 0 and adds group 100 to
> the end of the file, hopefully moving around half the content of group 0 to
> this new group. Of course, there is no guarantee that it will (depending on
> key structure) and also no guarantee that this will help anything, if group
> 0 isn't overflowed or populated anyway. So the next write may also cause a
> split, except now to split group 1 into a new group 101, and so forth.
> Eventually the pointer will reach the end and all the initial 100 groups
> will have been split, and the whole process restarts with the split pointer
> moving back to zero. You now have 200 groups and by this time everything
> should in theory have levelled out, but in the meantime there is still
> overloading and stuff will still be in overflow. The next split will create
> group 200 and split half of group 0 into it, and the whole process repeats
> for ever.
> Oversized records (> buffer size) also get moved out because they stuff up
> the block allocation.
> So why this crazy system, rather than hitting the filled groups as they get
> overstuffed? Because it makes finding a record easy. Because linear hashing
> is based on a power of 2, the maths is simple - if the group is after the
> split point, the record MUST be in that group (or its overflow). If it is
> before the split point, it could be in the original group or the split
> group: so you can just rehash with double the modulus to check which one
> without even having to scan the groups.
> What makes the implementation difficult is that Litwin et al were all
> assuming a single threaded access to an in-memory list. Concurrent access
> whilst maintaining the load factor, split pointer and splitting all add a
> lot more complexity, unless you lock the whole file for the duration of an
> IO operation and kill the performance.
> And coming back to the manual, storing large numbers of data items - even
> large ones - in a type 19 file is a bad idea. Traversing directories is
> slow, especially in Windows, and locking is done against the whole
> directory..
> Brian
> _______________________________________________
> U2-Users mailing list

U2-Users mailing list

Reply via email to