Re: [swift-evolution] [Draft] Hasher & HashVisitable

Xiaodi Wu via swift-evolution Mon, 13 Mar 2017 18:44:52 -0700

On Mon, Mar 13, 2017 at 8:30 PM, Tony Allevato via swift-evolution <
swift-evolution@swift.org> wrote:


> Adding the list back to this reply because I don't believe you meant to
> reply only to me, and I think it's an important discussion to have :)
>
>
> On Mon, Mar 13, 2017 at 4:32 PM Vincent Esche <regexident.mailinglists@
> gmail.com> wrote:
>
> Automatic generation of Hashable implementation code only "solves" the
> problem of having to implement `var hashValue`.
> It however only works for some types. By no means for all.
>
>
> Certainly, and I never suggested that it would work for every case. I was
> responding to Sean's point that the compiler should be able to do "good
> enough" in the common cases and I offered a way that that can be achieved.
>
>
> Take this hypothetical implementation of a HSB color:
>
> struct Color {
> let hue: UInt8
> let saturation: UInt8
> let brightness: UInt8
> }
>
> Let the semantic of this type be this:
> Any two colors with a `brightness` of `0` are to be considered equal
> regardless of their respective `hue` or `saturation`. At night all cats are
> gray.
>
>
> An auto-generated conformance impl for `Color` would most likely be wrong
> afaict.
> And those kind of semantics are everywhere.
>
>
> Of course, and in those cases, you wouldn't want to use an auto-derived
> Equatable or Hashable implementation. Those kinds of semantics are
> "everywhere" for certain definitions of "everywhere", but similarly,
> semantics where the hash value of a thing can be determined simply via
> combination of the hash values of its components are also "everywhere".
>
> I wouldn't suggest that auto-deriving Hashable solves *all* problems, and
> similarly, your proposal doesn't help in the situation you described
> either. Your API provides a different way to mix the hue, saturation, and
> brightness components in an implementation of hashValue, but it doesn't
> force the user to mix them *correctly* (by excluding hue/saturation if
> brightness is zero).
>
> So in both cases, users still have to provide custom implementations of ==
> and hashValue. But without auto-deriving, users have to provide them even
> for the cases where the auto-derived implementation *would have been
> correct.*  Auto-deriving is about reducing the number of types that need to
> have custom implementations, thereby reducing the chance that a user will
> do it wrong.
>
> When considering a type with auto-derived Hashable and Equatable, there
> are two ways it can be wrong:
>
> 1) The auto-generated implementations of both == and hashValue don't honor
> the semantic contract of the type (for example, don't ignore brightness
> when it's zero).
> 2) The user overrides the auto-generated implementation of one of
> ==/hashValue but not both, and violates the contracts between them.
>
> For #1, yes, the compiler generated an incorrect implementation in that
> case. However, I would argue it's the developer's responsibility to fix it
> by overriding it if they need different semantics. As I mentioned above,
> without derivation, the developer could still implement it incorrectly,
> just as they could with your API.
>
> For #2, this could be solved by requiring users to override both if they
> override one of them. Now they're back in the same situation as #1—they
> either did it right, or they did it wrong.
>
>
>
>
> Auto-generation clearly is no panacea when it comes to hashing. Especially
> if it leads to invalid and unsafe default implementations.
>
>
> Auto-deriving cannot produce "unsafe" implementations because it's defined
> in terms of a function operating over the hash values of its components. It
> can produce an implementation that does not match the intended semantics of
> the class that are defined outside of its component values, but that's not
> the compiler's job to decide; it's up to the user to provide those.
>
> Regarding unsafety, it's worth noting that your ContiguouslyHashable
> protocol is inherently unsafe and fragile. Imagine that a user implements a
> struct that they make conform to ContiguouslyHashable because at the time,
> it's a simple fixed layout type with primitive fields. If they take that
> type and add a new field to it that isn't contiguously hashable (say, a
> class reference) and they forget to replace the ContiguouslyHashable
> conformance, they now have a very broken type, and that behavior isn't
> caught until *runtime* when things go wrong.
>
> At least with derived conformances, in that situation the *compiler* would
> see that the type was no longer hashable and emit an error when it's used
> anywhere that Hashable/hashValue was expected.
>
> So if safety is your motivation, ContiguouslyHashable kind of fights back
> against that because it gives the user a door they can open—in some cases,
> accidentally—that produces invalid results.
>
>
>
>
> It would however be nice to be able to annotate types for which you want
> HashVisitable implementations to be generated.
>
>
> That's one of the still-open questions from the last discussion thread on
> the topic; whether auto-deriving should be automatic for any type where it
> is possible (as it is now for enums without associated values) or whether
> the user should have to opt-in.
>
>
>
>
> I actually don't see auto-generation as an alternative to a proper hashing
> API. But rather as an addition.
> HashVisitable and auto-deriving would actually work great together!
>
>
> I completely agree that these ideas complement each other very well! And I
> also agree that the language would do well to make it easier for users to
> easily do the right thing. I just feel that *the API proposed here
> specifically* adds too much complexity for the problem it's trying to solve.
>
> I'd find it more compelling if it was simplified a bit:
>
> * The standard library doesn't need to provide a number of hash
> implementations; it should just provide one that works "well enough" in
> most cases
> * Hashable doesn't have tie itself to a visitor pattern. It's not
> necessarily safer because users can still mix the wrong values; in that
> case, I'd rather the stdlib implementation of the "good enough" hash just
> be a standalone mixer that the language can encourage users to use.
>

I agree with this assessment. If SipHash is deemed "good enough," then
Swift should offer these as standalone facilities and encourage users to
call them in `Hashable`. I think the API design proposed here is much too
complex, but offering tried-and-true hash functions is certainly reasonable
and would be an improvement over xor.

Certainly also, the documentation for `Hashable` can be improved. In
general, however, it's not very convincing to motivate an entire proposal
for a new feature based on a documentation bug. Recognize that there are
(or at least have been, in past shipped versions of Swift) code snippets in
the documentation that _don't even compile_! If we accepted the logic put
forward here, we should give up on Swift entirely; after all, if even the
official Swift documentation can't provide code that compiles, what hope do
the rest of us have?!



>
> In fact they already do in other languages <https://is.gd/iy5770>.
>
>
> On Mon, Mar 13, 2017 at 8:51 PM, Tony Allevato via swift-evolution <
> swift-evolution@swift.org> wrote:
>
> I'm not convinced this is the right approach: it couples the fact that a
> type is hashable with a specific strategy for implementing the hash value
> computation.
>
> Instead, the language should make efforts to avoid requiring the user to
> think about hashing algorithms *at all*; to answer Sean's question a couple
> messages up, I've proposed in the past
> <https://gist.github.com/allevato/2fd10290bfa84accfbe977d8ac07daad> (rough
> working draft) adding derivation of Equatable and Hashable for structs and
> enums where it's possible to automatically do so (for example, if all of
> the enum's associated values are Equatable/Hashable). I've been
> experimenting with an implementation in the compiler and I have something
> that works for enums, except for recursive ones. I haven't started structs
> yet because I think there are more open questions there, and I hope to
> propose enums and structs independently so that the enum one doesn't get
> bogged down by the struct one.
>
> The core team seems to be aware of the need for this; the logic that
> derives Equatable/Hashable for enums without associated values also has
> TODO comments to handle those with associated values, and to handle structs
> as well. Slava Pestov mentioned a while back that they encouraged PRs on
> it, which is why I started, but my free time has been limited lately.
>
> That being said, I *do* think there would be value in having some kind of
> hash "mixer" as part of the standard library and strongly encouraging
> through documentation that hashValue implementors use it. Getting the
> function right is the hard part, and if Swift both (1) took care of it for
> you by default in many cases and (2) made the harder cases easier by
> providing a high quality canned implementation, I think we'd have a much
> cleaner solution than what is being proposed here.
>
>
> On Mon, Mar 13, 2017 at 12:18 PM Sean Heber via swift-evolution <
> swift-evolution@swift.org> wrote:
>
> Is there any reason the API couldn’t be expressed as something along these
> lines?
>
> func hash() -> [Hashable] {
>   return [x, y]
> }
>
> l8r
> Sean
>
>
> > On Mar 13, 2017, at 2:15 PM, David Hart <da...@hartbit.com> wrote:
> >
> >>
> >> On 13 Mar 2017, at 18:54, Sean Heber via swift-evolution <
> swift-evolution@swift.org> wrote:
> >>
> >> I’m dumb when it comes to proper hashing, but it’s such a tediously
> common thing in my experience to need to add Hashable to some kind of a
> struct so I can stash it in a set or use it as a dictionary key. Is there
> really no way to make this all more automatic? I have to be honest - this
> proposal seems *harder* to understand than the way it works now.
> >
> > It's not really harder: just call hash on each of your type's
> significant values:
> >
> > x.hash(&hasher)
> > y.hash(&hasher)
> >
> > How would you implement hashValue in a simpler way, remembering that 'x
> ^ y' is an incorrect implementation?
> >
> >> Of course the easiest would be if the language could just do this “good
> enough" for me using reflection or whatever and if I really did run into a
> problem where I wanted to do this myself, I could override something.
> >>
> >> Perfect is the enemy of good.
> >>
> >> l8r
> >> Sean
> >>
> >>
> >>> On Mar 13, 2017, at 10:38 AM, Vincent Esche via swift-evolution <
> swift-evolution@swift.org> wrote:
> >>>
> >>> Hi there,
> >>>
> >>> I've written up a proposal for an overhaul/replacement of the Hashable
> protocol and would love to hear your feedback on it!
> >>>
> >>> Rendered | Blog Post
> >>>
> >>> Cheers,
> >>> Vincent
> >>>
> >>> Ps: I'd like to thank David Hart (@hartbit) for his great editorial
> feedback on this proposal. 👍
> >>>
> >>> HashVisitable
> >>>
> >>>   • Proposal: SE-NNNN
> >>>   • Authors: Vincent Esche
> >>>   • Review Manager: TBD
> >>>   • Status: Awaiting review
> >>> Introduction
> >>>
> >>> Replace the Hashable protocol by two new procotols (Hasher and
> HashVisitable) to improve safety, versatility and learnability.
> >>>
> >>> Motivation
> >>>
> >>> Implementing Hashable is difficult and the consequences if not done
> well have dire performance and safety repercussions.
> >>>
> >>> The documentation of Hashable lists a sample implementation of var
> hashValue:
> >>>
> >>> /// A point in an x-y coordinate system.
> >>> struct GridPoint {
> >>>   var x: Int
> >>>   var y: Int
> >>> }
> >>>
> >>> extension GridPoint: Hashable {
> >>>   var hashValue: Int {
> >>>       return x.hashValue ^ y.hashValue
> >>>   }
> >>>
> >>>   static func == (lhs: GridPoint, rhs: GridPoint) -> Bool {
> >>>       return lhs.x == rhs.x && lhs.y == rhs.y
> >>>   }
> >>> }
> >>>
> >>> Calculating the hashes of all GridPoints (given the above
> implementation) on a 1000 × 1000 grid …
> >>>
> >>> let (width, height) = (1000, 1000)
> >>> let total = width * height
> >>> var hashes = Set<Int>()
> >>> for x in 0..<width {
> >>>   for y in 0..<height {
> >>>       hashes.insert(GridPoint(x: x, y: y).hashValue)
> >>>   }
> >>> }
> >>> print("\(hashes.count) unique hashes out of a total of \(total).")
> >>>
> >>> … results in just 1024 unique hash values for 1_000_000 unique values.
> >>>
> >>> In other words: The recommended implementation causes 99.9% of values
> to trigger a hash collision.
> >>>
> >>> Out of those 1_000_000 values the median collision count was 976 with
> min and max being 976 and 1000respectively.
> >>>
> >>> The collision rate will have negative impact in algorithms which
> heavily use hashValue like the ones in Dictionaryand Set. Furthermore, it
> increases the vulnerability to DDOS attacks when exposed to the web.
> >>>
> >>> If even the official Swift documentation gets the implementation of
> hashValue wrong, then who is to expect the average Swift programmer to do
> any better?
> >>>
> >>> In contrast, running the same snippet using HashVisitable and the
> semi-secure Fnv1aHash (see below) results in zero collisions!
> >>>
> >>> Finally, the design of the Hashable protocol forces the use of one
> implementation without the possibility of switching between multiple
> hashing algorithms.
> >>>
> >>> Proposed solution
> >>>
> >>> Instead of coupling the hashing algorithm with each and every Swift
> type, we should provide a hashing API based on the visitor-pattern. By
> freeing application developers from the burden of having to implement
> hashing algorithms, the Standard Library can provide default ones which
> fulfill collision, performance and security goals. Furthermore, it would
> allow developers to swap to different algorithms based on the use case.
> >>>
> >>> Detailed design
> >>>
> >>> The proposal deprecates the Hashable protocol and introduces the
> following two:
> >>>
> >>> protocol Hasher
> >>> {
> >>>
> >>> mutating func finish() -> Int
> >>>
> >>>
> >>> mutating func write(bytes
> >>> : UnsafeRawBufferPointer)
> >>> }
> >>>
> >>>
> >>> protocol HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H)
> >>> }
> >>>
> >>> Hasher is the protocol which represents a hashing algorithm, and
> HashVisitable replaces Hashable. For types entirely represented by their
> memory layout, the following protocol would provide a default
> implementation:
> >>>
> >>> protocol ContiguouslyHashable: HashVisitable {}
> >>>
> >>> extension ContiguouslyHashable {
> >>>   func hash<H: Hasher>(_ hasher: inout H) {
> >>>       var mutableSelf = self
> >>>       try! Swift.withUnsafeBytes(of: &mutableSelf) {
> >>>           hasher.write(bytes: $0)
> >>>       }
> >>>   }
> >>> }
> >>>
> >>> extension Bool : ContiguouslyHashable {}
> >>> extension UInt8 : ContiguouslyHashable {}
> >>> extension UInt16 : ContiguouslyHashable {}
> >>> extension UInt32 : ContiguouslyHashable {}
> >>> extension UInt64 : ContiguouslyHashable {}
> >>> extension UInt : ContiguouslyHashable {}
> >>> extension Int8 : ContiguouslyHashable {}
> >>> extension Int16 : ContiguouslyHashable {}
> >>> extension Int32 : ContiguouslyHashable {}
> >>> extension Int64 : ContiguouslyHashable {}
> >>> extension Int : ContiguouslyHashable {}
> >>>
> >>> The Standard-Library would then provide a set of hashing
> implementations specific to each purpose. A possible choice for hashing
> algorithms would be the reasonably fast SipHash-2-4, and the reasonably
> secure SipHash-4-8.
> >>>
> >>> FNV-1A is another popular semi-secure but blazingly fast hash
> algorithm, which – for the sake of demonstration – could be implemented as
> follows:
> >>>
> >>> struct Fnv1aHash
> >>> {
> >>>
> >>> fileprivate var state: UInt
> >>>
> >>>
> >>> init(seed: UInt
> >>> ) {
> >>>
> >>> self.state = seed &+ 14695981039346656037
> >>>
> >>>   }
> >>> }
> >>>
> >>>
> >>> extension Fnv1aHash: Hasher
> >>> {
> >>>
> >>> mutating func write(bytes
> >>> : UnsafeRawBufferPointer) {
> >>>
> >>> for byte in
> >>> bytes {
> >>>
> >>> self.state = (self.state ^ UInt(byte)) &* 1099511628211
> >>>
> >>>       }
> >>>   }
> >>>
> >>> mutating func finish() -> Int
> >>> {
> >>>
> >>> return unsafeBitCast(self.state, to: Int.self
> >>> )
> >>>   }
> >>> }
> >>>
> >>> Coming back to the sample code present in the Hashable documentation,
> the new implementation would look like:
> >>>
> >>> extension GridPoint: HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H) {
> >>>
> >>> self.x.hash(&
> >>> hasher)
> >>>
> >>> self.y.hash(&
> >>> hasher)
> >>>   }
> >>> }
> >>>
> >>> Source compatibility
> >>>
> >>> Making use of "extending protocols to conform to protocols":
> >>>
> >>> extension Hashable: HashVisitable
> >>> {
> >>>
> >>> func hash<H: Hasher>(_ hasher: inout
> >>> H) {
> >>>
> >>> self.hashValue.hash(&
> >>> hasher)
> >>>   }
> >>> }
> >>>
> >>> Effect on ABI stability
> >>>
> >>> n/a
> >>>
> >>> Effect on API resilience
> >>>
> >>> This feature should be possible to add/remove without breaking ABI.
> >>>
> >>> Alternatives considered
> >>>
> >>> n/a
> >>> _______________________________________________
> >>> swift-evolution mailing list
> >>> swift-evolution@swift.org
> >>> https://lists.swift.org/mailman/listinfo/swift-evolution
> >>
> >> _______________________________________________
> >> swift-evolution mailing list
> >> swift-evolution@swift.org
> >> https://lists.swift.org/mailman/listinfo/swift-evolution
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>
>
> _______________________________________________
> swift-evolution mailing list
> swift-evolution@swift.org
> https://lists.swift.org/mailman/listinfo/swift-evolution
>
>

_______________________________________________
swift-evolution mailing list
swift-evolution@swift.org
https://lists.swift.org/mailman/listinfo/swift-evolution

Re: [swift-evolution] [Draft] Hasher & HashVisitable

Reply via email to