On Sun, Nov 5, 2023 at 9:50 PM Rob Landley <[email protected]> wrote: > > On 11/5/23 21:22, Ray Gardner wrote: > > On 11/3/23 12:36:30, Rob Landley wrote: > >> > As to the hash code, the only part I've taken from Python's > >> > implementation is the probe sequence > >> > >> So... NOT the "linked lists descending from each hash bucket" approach > >> then. > >> > >> > probe = (probe * 5 + 1 + (perturb >>= 5)) & hashmask > >> > where hashmask is 2**n-1 (for table size 2**n) and perturb is > >> > initially the entire original key hash. > >> > >> I had a class on this in college 20 years ago, and literally have not used > >> it since, because "the hash table can entirely fill up" and "once the hash > >> table is more than half full everything starts colliding just from > >> displacements", and "picking the right hash algorithm is now _extra_ > >> magic"... > > > > The chained hash table (linked lists from buckets) can grow indefinitely > > but will slow down as the chains get longer, unless there's code to > > rebuild it with more buckets. > > I.E. both approaches benefit from garbage collection style repacking, but one > degrades gracefully without it and the other locks up. (The non-list version > mostly seems to be about saving the extra pointer and memory fragmentation.) > > I didn't mention "I poked at this way back when" to claim nontrivial > expertise, > I was trying to say that my aversion to the area was not entirely based > entirely > on ignorance, and you're not convincing me I was wrong. > > > The other approach, "open addressing" (named in 1957) does the thing with > > an initial probe and then subsequent probes as needed until an open slot > > is found. The table size is either prime or a power of 2. Power of 2 is > > nice because you can do the initial probe by masking the original hash by > > 'and'ing with (2**n-1). An advantage is when growing it, you can double > > it, while growing a prime-size table needs finding a suitable larger prime > > (though now that I think of it, maybe a smallish table of primes would do > > fine, and allow growing by smaller increments. I'll have to try that...) > > > > The probe sequence can be j, j+1 mod m, j+2 mod m,... (for table size m) > > but that's terrible. In 1968 Guy de Balbine came up with double hashing, > > where the sequence is j+k mod m, j+2k mod m, ... where k is some different > > hash of the same key, but must also be non-zero and relatively prime to m. > > For prime table size, that's anything from 1 to m-1. For a power-of-two > > table, that's any odd number 1 to m-1. (The main reason I think that table > > size is always prime or 2**n). > > See, it's needing to know this many different things about the space that > drove > me away from it. > > I never did a survey of compression algorithms to see which was best for which > use, either. I'm aware of arj and arc and zoo and so on, but "which is better" > is NOT MY PROBLEM. Once a selection was externally made (because tar.gz and > tar.bz2 and tar.xz are out there on websites), I'd dig in as much as > necessary. > But "is this an approach worth taking" is a SEPARATE QUESTION. > > Nor did I dig deeply into the dozens of different kinds of sort algorithms. I > implemented a heap sort in college because that seemed the most general > purpose > one, and then got yelled at when I tried to use it by People With Opinions. > > I wrote the kernel's "Red Black Trees" documentation because somebody asked me > to, and despite that I never DID dig up an explanation of HOW you rebalance > the > darn things. (Even the documentation I wrote DOES NOT SAY because I DID NOT > KNOW, and yet people kept complimenting me on it...) I eventually dug through > the kernel code and worked out at least how THEY were doing it (unwrapping > nested macros if I recall; I don't remember the answer but I _do_ remember the > slog), and have an "arbys.c" that could become a lib/rbtree.c if I ever need a > "dictionary" data structure (thus avoiding the hash table question > entirely)... > but I have yet to wind up needing it. Even the recent tar thing, I told people > to poke me if the performance/scalability was insufficient and I'm still > waiting > to hear back: > > https://github.com/landley/toybox/issues/456#issuecomment-1742021534 > > In theory shell sort is Nlog(n) taking _advantage_ of partial sorting instead > of > being penalized by it like quicksort so that SOUNDS like a good general > purpose > sort algorithm, but between not being a stable sort and the whole gap nonsense > where wikipedia[citation needed] even says "The question of deciding which gap > sequence to use is difficult." at the start of > https://en.wikipedia.org/wiki/Shellsort#Gap_sequences and I just noped right > out > of there. > > It's not that I _can't_ get up to speed on these sort of things, it's that I > never saw the appeal. If a domain-specific problem needs domain-specific > tuning > then I wait to have that specific domain ready with test data. If there's a > consensus on generic infrastructure, I'll leverage it. If there ISN'T a clear > consensus, and it's been 80 years since the Harvard Mark I went online, I do > not > expect to be the one to pull the sword from the stone and clear up this area > of > confusion once and for all. > > Posix has quicksort in libc so I can just call that, and the fact posix has > quicksort means "there is a consensus" and it may not be GREAT but I can rely > on > them to have already done all the arguing, and just get on with it. I can also > do a bubble sort in like 3 lines of C which is fine if I know I'm never having > more than a couple hundred things in the array, and between the two of them > that's WELL beyond 80/20. > > My lack of familiarity with numerous sorting algorithms was actually the > problem > I had with bzip2 compression side because it does a fallback through several > types of sorts I'd never heard of, and I under understood WHY it was doing the > ones it was doing. I put it on the todo list and never got back around to > it... > > http://lists.busybox.net/pipermail/busybox/2003-November/043969.html > > And if tsort really does need a hash table on top of knuth's algorithm to > provide "acceptable" performance, toybox has no business implementing it and > should leave that to other packages. > > >> > This is brilliant and probably due to Tim Peters > >> > >> "probably"? (I'm happy to cite a primary source...) > > > > I tried for an hour or two to find a definitive statement that Tim came up > > with the perturb idea. The long comment in the Python source says he > > tested various shifts for perturb. A commit by Tim > > https://github.com/python/cpython/commit/81673fb5ca25e17a1c1b806c06b046669a566a88 > > is the best evidence I have that > > j = (j * 5 + 1 + (perturb >>= 5)) mod m > > is all Tim's. Lots of good info there too. > > I was assuming this Tim was somebody in the 1960s like knuth and friends, > wrestling with these problems back before I learned to type. > > I believe the grad student in 1993's example way to avoid everything chaining > together immediately was some largeish initial offset (quarter of the table > size > I think?) and then divided it by 2 and added 1 each iteration, which meant it > was usually an odd number that collapsed to 1 at the end. But that was just a > blackboard exercise. (With actual chalk, old school...) > > > Using the original hash shifted 1 byte right works pretty effectively as a > > second hash (de Balbine, Knuth's algo D); hash >> 23 was not good. > > Similarly, I tried hash*strlen(s) as a secondary hash and got pretty good > > results as well. The constant stride of 37 did badly and the linear probe > > (stride 1) was just terrible. (Yet linear probing is supposed to be decent > > up to 50% full, per Knuth? Maybe need to investigate further.) > > You are convincing me tsort does not belong in toybox. > > >> With >>5 you need a table size over a million to more than 4 nonlinear > >> displacements in a given chain. It's not THAT much protection. Still seems > >> input data dependent. A hash table without a good hash function goes > >> septic quick, and "good" is contextual. (...) > > > > The point of the Python approach was to work well with a "bad" hash > > function, as Python's is bad from the usual viewpoint. E.g. integers are > > their own hash: h(1) is 1, h(2) is 2 etc. So the best approach for Python > > may not be best with a different hash funtion. I'm using D. Bernstein's > > djb2a (djb2 modified). (BTW looks like the pending/diff.c tried to use that > > but there's a transposition in the initial constant: they have 5831, > > should be 5381.) > > My diff rewrite was in a separate tree: > > https://github.com/landley/toybox/commit/42ac5b2334bd > > And it ground to a halt because people were arguing with me about it until it > stopped being any fun at all to work on. > > I have no clue what the one in pending is doing. > > >> I didn't make it all the way through the 200+ lines of description at the > >> top, not because it's hard to read but because they thought it NEEDED that > >> much explanation. > >> > >> Remember my gripe at the end of > >> https://landley.net/notes-2023.html#12-09-2023 where I was complaining > >> that I'd felt the need to write a long explanation of the code as a > >> comment in the source file? That meant (to me) that the code was still > >> wrong. > > > > See > > https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01701.html > > for a different opinion on that, and on that comment in particular. > > No. > > > Yes, that's pretty much it. Mark the deleted entries, rebuild a bigger > > table when needed. Most systems expand when the table is about 60%-80% > > full, including deleted slots. > > So in order to keep tsort.c in toybox, I need to implement garbage collection > or > it's not good enough. > > > It's just kind of hard to get a decent generalized hash table system that > > handles expansion and deletions without having a lot of code. > > 80 years. Grace Hopper is dead. Dennis Ritchie is dead. Multiple lifetimes of > people smarter than me. And it's still "kind of hard to get a decent", but > isn't > in libc for me to just use. > > I am not wading into the Bog of Eternal Stench. > > > Some other interesting links: This guy > > Tsort is not worth this. > > > There's a good comparison of hashing algos (not the table handling, but > > the string hashing) at > > I am not interested in building domain expertise in algorithms I need to form > and defend an _opinion_ about. > > Rob > > P.S. Yes I checked if posix/libc just had one I could use, like with qsort(), > but man 3 hash says "This page documents interfaces provided in glibc up until > version 2.1. Since version 2.2, glibc no longer provides these interfaces.
musl has the POSIX <search.h>, so i assume glibc still does too? > Probably, you are looking for the APIs provided by the libdb library instead." > and I GUARANTEE you I am not looking for any separate gnu libdb, and as for > the > berkeley one here's Oratroll buying it https://lwn.net/Articles/171888/ and > here's the inevitable license fsckery https://lwn.net/Articles/557820/ and I > haven't checked back in since, not that I'd use an external dependency for the > same reason the project's not linking against curses and implements its own > hash > functions and gzip. But it would have been NICE if libc provided one... > _______________________________________________ > Toybox mailing list > [email protected] > http://lists.landley.net/listinfo.cgi/toybox-landley.net _______________________________________________ Toybox mailing list [email protected] http://lists.landley.net/listinfo.cgi/toybox-landley.net
