Re: [Swan-dev] hash table size and hash

D. Hugh Redelmeier Sun, 29 Oct 2017 10:45:07 -0700

| From: Andrew Cagney <[email protected]>

| I pushed the changes:
| 
| - first, the linked list and hash table were made generic


Good.

| (I needed a hash
| table to manage PIDs)

Really?  How many PIDS do we juggle?

[ASIDE] when I learned my chops, core memory systems did not have
caches.  So Knuth didn't take them into account.  And neither do I.

At first caches were thought of as opportunistic improvements for
program performance.

The reality now is that cache misses are so expensive that any serious
design for performance needs to take them into account.  Too bad that
cache parameters are so variable.  Too bad some cache behaviour is beyond 
control of the source code.
[/ASIDE]

If we expect only a handful of PIDs, a hash table is way
over-engineered.

If it actually is a performance issue, an array is faster than hashing
for sizes that are surprisingly large.

Perhaps what you want is an associative array abstraction.  Hashing is
only one possible implementation and not always the best.

On the other hand, from a maintenance standpoint, one mechanism is often 
better than two.

| The the code was forked before 8e047dd53da900445dda569041ac602740d66de0 so
| some of that didn't get copied over (I noticed it while merging this week
| and testing was already painful enough); looking through it I think the
| always-circular change could be re-merged but as part of code explicitly
| allocating initializing and freeing the heaps and lists - to do dynamic
| sizing of the heap we'll need this anyway.

I made those changes because I was reading the code.  Like a kid
"looking" at a toy, my hands are often involved.

You certainly can adopt or not any part of those changes.  I thought
they were improvements but I leave it to you.

| - the state contain 499 slots
| 
| There's a rule of thumb that says hash tables should be ~50% full (true?)

It all depends.  The rule you are talking about is way too
conservative in most cases.

1. A good hash performs decently until way closer to 100%.  I'd
   suspect 95% is safe.

2. "Good hashing" is not exposed to sinister opponents.  Sinister
   opponents can usually drive hashes into pathological behaviour.
   Like: arranging all traffic to hash to the same bucket.

   In previous mail I mentioned how we might be able to prevent such
   an attack -- adding salt to our hash.

3. open and closed hashing have different performance abstractions.

   In open, a full table inevitably causes large cluster formation
   causing preformance to fail towards linear search.  Cluster
   formation is a fascinating topic and is related to the phyical
   phenomenon known as phase change and for which several
   Nobel Prizes have been awarded (i.e. it is interesting and deep
   and non-linear).

   We used closed hashing (my term for the opposite of open).  The
   performance abstraction is: search is linear with a fractional
   constant factor (1/(size of table)).  So a full table isn't a
   catastrophe.  Still, the larger the table, the better the
   performance.

4. We (too often?) scan our hash tables exhaustively.  This makes a
   large hash table expensive.

   There was one reason I did things this way.  I didn't want a lot of
   different pointers to a state object to persist across
   transactions.  The one pointer that persisted is the one in the
   state hash table.

   Why did I want only one pointer?  Because it is really easy to get
   an inconsistent state (I guess that's a pun) by having inconsistent
   pointers.  Pointer maintenance is really hard.  Look at how often
   we get that kind of thing wrong and how hard it is to debug the
   resulting mess.

   The state table was scanned only when a loop needed to look at
   every state.  This could be considered a hack.  But it was quite
   practical back in the day because it was needed so rarely.  Oh, and
   the hash table was pretty small.

   This may no longer be the right tradeoff.  And you've changed it
   anyway.

   so_serial_t has several purposes.  One was to be a "weak reference"
   to a state.  The so_serial_t value would designate a state object
   but would fail nicely if that state had disappeared.  Unfortunately
   dereferencing it used to take a linear search of the whole hash
   table.  It did not get dereferenced much.  But as the code evolved
   (possibly sloppily) the number of dereferences balooned.

   Andrew added another hash table to make dereferencing so_serial_t
   faster.  Of course that means that there are now two pointers to a
   state.  Slightly scary.  The dangers should be manageable if the
   code to maintain all the pointers is embedded in one set of
   abstractions.

   Andrew also split the cookie-indexed hash table into two.  This
   means that scanning the one table no longer gets you to all states.
   I didn't see that such code was changed to scan both tables.
   Perhaps some bugs were created by this split.  I have not looked.

   I'm not sure that splitting these tables is a win.

   + fewer colisions

   - two tables to maintain

| but I've seen pluto with 1000 entries so we might need that dynamic resize
| code; hopefully I've structured things to be friendly to that change; later
| 
| - for icookies it uses:
| 
|          * 251 is a prime close to 256 (so like <<8).
|         for (unsigned j = 0; j < COOKIE_SIZE; j++) {
|                 hash = hash * 251 + icookie[j];
|         }

I would guess that the best way to do this is to combine as many bytes
of the cookie as possible in a lossless way (i.e. grab 4 or 8 byte
chunks, depending on the size of unsigned long), then combine the
chunks in an interesting hashy way (eg. a * 11 + b * 13), and then
map this hash into a table index.  BTW, that last mapping need not be
a pure modulus; it imagineable that discarding some of the low-order
information first might be a good idea.

In my experience, guessing about performance is usually inaccurate.
Measuring is required.  But only if it matters.

| - there's also an ordered list available

What's this used for?  Is the ordering useful?  Or is it free and
likely to make the (human) log reader less surprised?

Yet another pointer :-(

================

Much of the original code was written before that last few turns of 
Moore's law.  The cost of the hash table, per state, was one pointer.  Now 
we're up to twelve, each of which needs careful maintenance.
_______________________________________________
Swan-dev mailing list
[email protected]
https://lists.libreswan.org/mailman/listinfo/swan-dev

Re: [Swan-dev] hash table size and hash

Reply via email to