Turns out I’m rather bad at mailing lists. Hit “reply” (vs. "reply all") again.
On Tue, Mar 14, 2017 at 3:34 AM, Károly Lőrentey <[email protected]> wrote: > > On 2017. Mar 13., at 23:56, Vincent Esche <regexident.mailinglists@ > gmail.com> wrote: > > On Mon, Mar 13, 2017 at 9:56 PM, Károly Lőrentey <[email protected]> > wrote: > >> Yes please! I’ve a package on GitHub to implement roughly the same thing. >> I’ve been happily using it for months now, and I wouldn’t ever write a >> hashValue implementation by hand again. >> > >> > https://github.com/lorentey/SipHash >> >> I think the fact that we’ve both come up with essentially the same API is >> an interesting data point; it definitely underlines the need for a better >> Hashable protocol. >> > > It's not just us, actually. Rust does the very same thing > <https://doc.rust-lang.org/std/hash/>. > A very similar concept has been proposed > <https://isocpp.org/files/papers/n3980.html> and implemented > <https://github.com/HowardHinnant/hash_append> for C++ by Howard Hinnant. > > > Hm. It is possible I may have been influenced by one of these. > > My comments: >> >> * In an ideal world, this would be a replacement for Hashable, not a >> protocol with a different name. Once visitor-based hashing becomes >> available, nobody should implement a hashValue property. >> > > Same here. I fear however that turning a protocol upside down would do > more harm than good. > > >> * All standard Hashable types in standard library should implement the >> new hash function directly, rather than relying on the default >> implementation. >> > > Yes. > > * Why is the HashVisitable.hash a generic function? Hasher could just as >> well be a concrete type defined in stdlib. Making it generic may have some >> performance implications. >> > > Because we specifically don't want to > <https://blog.definiteloops.com/ha-r-sh-visitors-8c0c3686a46f#.6emnc0d5x> > be limited to a single hash implementation. > > > Allowing custom hashers indeed sounds nice in theory, but are there any > more practical examples where you'd want to actually do that? Are Bloom > Filters an important enough use case on their own to justify the extra > complexity? A concrete Hasher type would still allow stdlib to switch to a > different hash algorithm; this is already a long way beyond what people are > typically doing with Hashable.) > Bloom Filters are – while very interesting – an edge case, admittedly. Being able to choose a different hasher for certain Dictionaries, Sets, … however is a key goal for me. You want secure defaults, yet be able to exchange them with a weaker impl for safe bottlenecks. Especially in the scope of Swift going more and more into systems programming territory. You don’t want to wait yourself into a corner now. I'm also worried people would find clever ways to abuse the visitor > parameter in ways not related to hashing. (E.g., you can implement one half > of an amusingly broken serialization library in just two minutes by making > OutputStream conform to Hasher. Implementing deserialization might take a > *little* more work, though.) 😈 > While technically true that’s kind of a straw man argument, isn’t it? 😉 I could easily turn it around and argue that nothing hinders me from implementing half of my app in `var hashValue`. 😈 * I find that I prefer to hash components by calling a mutating method on >> the hasher, rather than directly calling the components' hash >> implementations. Putting the hasher first is much more readable to me, >> primarily because it gets rid of all the distracting &s. It also makes it >> possible to find slightly better names, eliminating the repetitiveness of >> "foo.hash(&hasher)": >> >> extension GridPoint: SipHashable { >> func appendHashes(to hasher: inout >> SipHasher) { >> hasher.append(x) >> hasher.append(y) >> } >> } >> > > That would require SipHasher to overload `append` for every type > imaginable. And it would require SipHasher know about the type's memory > layout, which violates encapsulation. "What should be hashed?" should be > specified in the type. "How should it be hashed?" however in the given > hasher. > > > Hasher's mutating method would be implemented as a one-line generic > function: > > https://github.com/lorentey/SipHash/blob/master/sources/ > SipHashable.swift#L56 > > The actual hashing would of course still be done by implementations of the > HashVisitable protocol. This is just a convenience thing. > Oh, I misread then. (Admittedly this would introduce a generic function, while at the same time > I'm also arguing for making the protocol requirement non-generic partly > due to (not very convincing) worries about performance. But this would be a > single fragile generic in stdlib, not many generics sprinkled over separate > user-defined modules.) > > > * I suggest using SipHash instead of FNV-1a. The standard library already >> contains an implementation for SipHash, as undocumented internal API, >> complete with a note that it should be made public. AFAICR, it is currently >> only used for String hashing, but it would be very much worth making it >> universal. (Accomodating SipHash's random key is one reason why hashValue’s >> documentation explicitly notes that its value is "not guaranteed to be >> equal across different executions of your program.”) >> > > I do not propose making FNV-1a part of stdlib. I just included FNV-1a > because it is trivial to implement and easily fits in a proposal as a > sample impl of Hasher. > > I do however propose to include SipHash-2-4, SipHash-4-8 and maybe even > SipHash-1-3. > > > Oh, indeed, I misread the proposal. (Sorry!) But if the API allows for > custom hashers, and stdlib comes with more than one, then which one would > Set and Dictionary use? Would their choice of hasher be user configurable? > stdlib has been remarkably free of such elaborations so far. The > alternative is to bake a particular hash into the standard hashed > collections, but then the overall API would be strangely lopsided, with an > inconsistent mix of flexible and rigid parts. > I would propose to make them generic, but default to a Hasher provided by stdlib (like `Set<Element, Hasher = SipHash13>`). > All these questions go away if there is but a single standard Hasher. > Well, except the question about which algorithm should be the One. But I > think any keyed hash function would be much better than the status quo. > > * ContiguouslyHashable seems like an unsafe construct to me. It is quite >> dangerous to base hashing on the raw byte sequence underlying a value: for >> example, struct values may include uninitialized gaps between some of their >> stored properties due to alignment constraints. So two otherwise identical >> values may very well have different in-memory representations. >> > > These type are clearly not ContiguouslyHashable then. Which was also > explained in detail in the accompanying blog post: > > "ContiguouslyHashable marks a special family of types which allow for > simply passing their contiguous memory to the given hasher en bloc, which > Float, Double and String would decidedly not conform to, while for types > matching these criteria then it would be a simple matter of tagging them > with the protocol." Blog Post > <https://blog.definiteloops.com/ha-r-sh-visitors-8c0c3686a46f#.6emnc0d5x> > > >> Therefore, I suggest ContiguouslyHashable should be removed from the >> proposal. Swift programmers who know what they’re doing would still be able >> to call withUnsafeBytes(of:) when they want to, but regular schmoes like >> myself will not be tempted to shoot themselves in the foot by using it >> inappropriately. >> > > Fine with me. I don't care to much about how those stdlib primitive types > gain their hashing powers. ;) > > > To be clear, I think it would be fine to keep ContiguouslyHashable as an > internal implementation detail in stdlib; it just shouldn't be a public, > documented protocol in its current form. (Perhaps as > UnsafeContiguouslyHashable? 🙂) I don't think Swift programmers in general > are aware of such subtle aspects of how their structs are laid out in > memory. It's certainly a fuzzy area for me, at least; for example, I > remember getting surprised by the fact that Float80 values end with six > uninitialized bytes; no other floating point type in stdlib is padded like > that. > I’d be absolutely fine with just auto-generating HashVisitable for all the types I had tagged as ContiguouslyHashVisitable instead. Thinking about it I’d actually prefer that now. Keep it an implementation detail and for all those who really want to hash a type from its contiguous memory slice the code necessary is easy to whip up yourself. > To pick another nit: is it necessary to put a label on the parameter of > Hasher.write(bytes:)? I may not be up to date on the naming convention, but > it seems to me it would read just as well without the label. (I also think > there are problems with the name HashVisitable.hash(_:), but my best > attempt at naming it is writeHashes(to:) which isn't much of an > improvement.) > I haven’t put too much effort into the naming, to be honest. As such I don’t really like HashVisitable either. But just changing the semantics of Hashable by 180° for the sake of a better name isn’t feasible. > > Karoly >
_______________________________________________ swift-evolution mailing list [email protected] https://lists.swift.org/mailman/listinfo/swift-evolution
