| From: Andrew Cagney <[email protected]>
| I pushed the changes:
|
| - first, the linked list and hash table were made generic
Good.
| (I needed a hash
| table to manage PIDs)
Really? How many PIDS do we juggle?
[ASIDE] when I learned my chops, core memory systems did not have
caches. So Knuth didn't take them into account. And neither do I.
At first caches were thought of as opportunistic improvements for
program performance.
The reality now is that cache misses are so expensive that any serious
design for performance needs to take them into account. Too bad that
cache parameters are so variable. Too bad some cache behaviour is beyond
control of the source code.
[/ASIDE]
If we expect only a handful of PIDs, a hash table is way
over-engineered.
If it actually is a performance issue, an array is faster than hashing
for sizes that are surprisingly large.
Perhaps what you want is an associative array abstraction. Hashing is
only one possible implementation and not always the best.
On the other hand, from a maintenance standpoint, one mechanism is often
better than two.
| The the code was forked before 8e047dd53da900445dda569041ac602740d66de0 so
| some of that didn't get copied over (I noticed it while merging this week
| and testing was already painful enough); looking through it I think the
| always-circular change could be re-merged but as part of code explicitly
| allocating initializing and freeing the heaps and lists - to do dynamic
| sizing of the heap we'll need this anyway.
I made those changes because I was reading the code. Like a kid
"looking" at a toy, my hands are often involved.
You certainly can adopt or not any part of those changes. I thought
they were improvements but I leave it to you.
| - the state contain 499 slots
|
| There's a rule of thumb that says hash tables should be ~50% full (true?)
It all depends. The rule you are talking about is way too
conservative in most cases.
1. A good hash performs decently until way closer to 100%. I'd
suspect 95% is safe.
2. "Good hashing" is not exposed to sinister opponents. Sinister
opponents can usually drive hashes into pathological behaviour.
Like: arranging all traffic to hash to the same bucket.
In previous mail I mentioned how we might be able to prevent such
an attack -- adding salt to our hash.
3. open and closed hashing have different performance abstractions.
In open, a full table inevitably causes large cluster formation
causing preformance to fail towards linear search. Cluster
formation is a fascinating topic and is related to the phyical
phenomenon known as phase change and for which several
Nobel Prizes have been awarded (i.e. it is interesting and deep
and non-linear).
We used closed hashing (my term for the opposite of open). The
performance abstraction is: search is linear with a fractional
constant factor (1/(size of table)). So a full table isn't a
catastrophe. Still, the larger the table, the better the
performance.
4. We (too often?) scan our hash tables exhaustively. This makes a
large hash table expensive.
There was one reason I did things this way. I didn't want a lot of
different pointers to a state object to persist across
transactions. The one pointer that persisted is the one in the
state hash table.
Why did I want only one pointer? Because it is really easy to get
an inconsistent state (I guess that's a pun) by having inconsistent
pointers. Pointer maintenance is really hard. Look at how often
we get that kind of thing wrong and how hard it is to debug the
resulting mess.
The state table was scanned only when a loop needed to look at
every state. This could be considered a hack. But it was quite
practical back in the day because it was needed so rarely. Oh, and
the hash table was pretty small.
This may no longer be the right tradeoff. And you've changed it
anyway.
so_serial_t has several purposes. One was to be a "weak reference"
to a state. The so_serial_t value would designate a state object
but would fail nicely if that state had disappeared. Unfortunately
dereferencing it used to take a linear search of the whole hash
table. It did not get dereferenced much. But as the code evolved
(possibly sloppily) the number of dereferences balooned.
Andrew added another hash table to make dereferencing so_serial_t
faster. Of course that means that there are now two pointers to a
state. Slightly scary. The dangers should be manageable if the
code to maintain all the pointers is embedded in one set of
abstractions.
Andrew also split the cookie-indexed hash table into two. This
means that scanning the one table no longer gets you to all states.
I didn't see that such code was changed to scan both tables.
Perhaps some bugs were created by this split. I have not looked.
I'm not sure that splitting these tables is a win.
+ fewer colisions
- two tables to maintain
| but I've seen pluto with 1000 entries so we might need that dynamic resize
| code; hopefully I've structured things to be friendly to that change; later
|
| - for icookies it uses:
|
| * 251 is a prime close to 256 (so like <<8).
| for (unsigned j = 0; j < COOKIE_SIZE; j++) {
| hash = hash * 251 + icookie[j];
| }
I would guess that the best way to do this is to combine as many bytes
of the cookie as possible in a lossless way (i.e. grab 4 or 8 byte
chunks, depending on the size of unsigned long), then combine the
chunks in an interesting hashy way (eg. a * 11 + b * 13), and then
map this hash into a table index. BTW, that last mapping need not be
a pure modulus; it imagineable that discarding some of the low-order
information first might be a good idea.
In my experience, guessing about performance is usually inaccurate.
Measuring is required. But only if it matters.
| - there's also an ordered list available
What's this used for? Is the ordering useful? Or is it free and
likely to make the (human) log reader less surprised?
Yet another pointer :-(
================
Much of the original code was written before that last few turns of
Moore's law. The cost of the hash table, per state, was one pointer. Now
we're up to twelve, each of which needs careful maintenance.
_______________________________________________
Swan-dev mailing list
[email protected]
https://lists.libreswan.org/mailman/listinfo/swan-dev