> Thank you for the pseudocode, I see what you're saying now, but it still > doesn't make sense to me. ISTM the algorithm implies m < k < n, because if > m > k, this will never terminate, and if k > n then shuffling is > deterministic and more efficient in both space and time. > > However, if k < n, then there exists at least one pair of unsigned > integers a < n and b < n such that a^x(mod k) = b^x(mod k), so this > algorith will never produce a set with both a and b in it - not truly > random.
This is what I was talking about with respect to hash "buckets" for collisions. Basically, if you replace the array in the set implementation with an array of binary trees, then you can have exactly the case where both a and bmare in the[1] set, even though hash(a) == hash(b). This would use space O(k+k*log m ) and a running time that is a pain in the ass to bound reasonably (I was trying to use a negative binomial distribution, with p=m/n, but that ends up being a very loose estimate). > Plus, what about the overhead of picking a k and x? IIUC finding large > primes is non-trivial. Picking large primes is easy, plus this is a one time cost that is made at compile time (any x and k pair will work for all m and n's). > There's one thing about this algorithm I really like though - O(m) space. > The best thing I've thought of so far that works in O(m) space looks like: - Rob . [1] Using a '+' in big O notation means "which ever is bigger".. it's probably obvious but I forget at what level of algorithms they start describing that.
