Re: [Python-ideas] incremental hashing in __hash__

2016-12-31 Thread jab
On Sat, Dec 31, 2016 at 7:17 AM, Nick Coghlan  wrote:

> On 31 December 2016 at 15:13,  wrote:
>
>> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico 
>> wrote:
>>
>>> How likely is it that you'll have this form of collision, rather than
>>> some other? Remember, collisions *will* happen, so don't try to eliminate
>>> them all; just try to minimize the chances of *too many* collisions. So if
>>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>>> possible cause of a collision is no worse than any other. Keep your
>>> algorithm simple, and don't sweat the details that you aren't sure matter.
>>
>>
>> In the context of writing a collections library, and not an application,
>> it should work well for a diversity of workloads that your users could
>> throw at it. In that context, it's hard to know what to do with advice like
>> this. "Heck, just hash the first three items and call it a day!"
>>
>
> Yes, this is essentially what we're suggesting you do - start with a "good
> enough" hash that may have scalability problems (e.g. due to memory
> copying) or mathematical distribution problems (e.g. due to a naive
> mathematical combination of values), and then improve it over time based on
> real world usage of the library.
>
> Alternatively, you could take the existing tuple hashing algorithm and
> reimplement that in pure Python: https://hg.python.org/cpython/
> file/tip/Objects/tupleobject.c#l336
>
> Based on microbenchmarks, you could then find the size breakpoint where it
> makes sense to switch between "hash(tuple(self))" (with memory copying, but
> a more optimised implementation of the algorithm) and a pure Python
> "tuple_hash(self)". In either case, caching the result on the instance
> would minimise the computational cost over the lifetime of the object.
>
> Cheers,
> Nick.
>
> P.S. Having realised that the tuple hash *algorithm* can be re-used
> without actually creating a tuple, I'm more amenable to the idea of
> exposing a "hash.from_iterable" callable that produces the same result as
> "hash(tuple(iterable))" without actually creating the intermediate tuple.
>

Nice!

I just realized, similar to tupleobject.c's "tuplehash" routine, I think
the frozenset_hash algorithm (implemented in setobject.c) can't be reused
without actually creating a frozenset either. As mentioned, a set hashing
algorithm is exposed as collections.Set._hash() in _collections_abc.py,
which can be passed an iterable, but that routine is implemented in Python.
So here again it seems like you have to choose between either creating an
ephemeral copy of your data so you can use the fast C routine, or streaming
your data to a slower Python implementation. At least in this case the
Python implementation is built-in though.

Given my current shortage of information, for now I'm thinking of handling
this problem in my library by exposing a parameter that users can tune if
needed. See bidict/_frozen.py in https://github.com/jab/bidict/
commit/485bf98#diff-215302d205b9f3650d58ee0337f77297, and check out the
_HASH_NITEMS_MAX attribute.

I have to run for now, but thanks again everyone, and happy new year!
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/

Re: [Python-ideas] AtributeError inside __get__

2016-12-31 Thread Eric Snow
On Sat, Dec 31, 2016 at 12:33 AM, Nick Coghlan  wrote:
> On 31 December 2016 at 05:53, Ethan Furman  wrote:
>> On 12/30/2016 07:10 AM, Chris Angelico wrote:
>>> What module would be appropriate, though?
>>
>> Well, DynamicClassAttribute is kept in the types module, so that's
>> probably the place to put optionalproperty as well.
>
> I'd also be OK with just leaving it as a builtin.

FWIW, I've felt for a while that the "types" module is becoming a
catchall for stuff that would be more appropriate in a new
"classtools" module (a la functools).  I suppose that's what "types"
has become, but I personally prefer the separate modules that make the
distinction and would rather that "types" looked more like it does in
2.7.  Perhaps this would be a good time to get that ball rolling or
maybe it's just too late.  I'd like to think it's the former,
especially since I consider "classtools" a module that has room to
grow.

-eric
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2016-12-31 Thread Brendan Barnwell

On 2016-12-30 17:04, Ethan Furman wrote:

On 12/30/2016 04:31 PM,j...@math.brown.edu  wrote:

On Fri, Dec 30, 2016 at 7:20 PM, Ethan Furman wrote:

If you are relying on an identity check for equality then no
two FrozenOrderedCollection instances can be equal.  Was that
your intention?  It it was, then just hash the instance's
id() and you're done.


No, I was talking about the identity check done by a set or dict
when doing a lookup to check if the object in a hash bucket is
identical to the object being looked up. In that case, there is
no need for the set or dict to even call __eq__.  Right?

No.  It is possible to have two keys be equal but different -- an
easy example is 1 and 1.0; they both hash the same, equal the same,
but are not identical.  dict has to check equality when two different
objects hash the same but have non-matching identities.


	I think that is the same as what he said.  The point is that if they 
*are* the same object, you *don't* need to check equality.


--
Brendan Barnwell
"Do not follow where the path may lead.  Go, instead, where there is no
path, and leave a trail."
   --author unknown
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/


Re: [Python-ideas] incremental hashing in __hash__

2016-12-31 Thread Nick Coghlan
On 31 December 2016 at 15:13,  wrote:

> On Fri, Dec 30, 2016 at 10:35 PM, Chris Angelico  wrote:
>
>> How likely is it that you'll have this form of collision, rather than
>> some other? Remember, collisions *will* happen, so don't try to eliminate
>> them all; just try to minimize the chances of *too many* collisions. So if
>> you're going to be frequently dealing with (1,2,3) and (1,3,2) and (2,1,3)
>> and (3,1,2), then sure, you need to care about order; but otherwise, one
>> possible cause of a collision is no worse than any other. Keep your
>> algorithm simple, and don't sweat the details that you aren't sure matter.
>
>
> In the context of writing a collections library, and not an application,
> it should work well for a diversity of workloads that your users could
> throw at it. In that context, it's hard to know what to do with advice like
> this. "Heck, just hash the first three items and call it a day!"
>

Yes, this is essentially what we're suggesting you do - start with a "good
enough" hash that may have scalability problems (e.g. due to memory
copying) or mathematical distribution problems (e.g. due to a naive
mathematical combination of values), and then improve it over time based on
real world usage of the library.

Alternatively, you could take the existing tuple hashing algorithm and
reimplement that in pure Python:
https://hg.python.org/cpython/file/tip/Objects/tupleobject.c#l336

Based on microbenchmarks, you could then find the size breakpoint where it
makes sense to switch between "hash(tuple(self))" (with memory copying, but
a more optimised implementation of the algorithm) and a pure Python
"tuple_hash(self)". In either case, caching the result on the instance
would minimise the computational cost over the lifetime of the object.

Cheers,
Nick.

P.S. Having realised that the tuple hash *algorithm* can be re-used without
actually creating a tuple, I'm more amenable to the idea of exposing a
"hash.from_iterable" callable that produces the same result as
"hash(tuple(iterable))" without actually creating the intermediate tuple.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-ideas mailing list
Python-ideas@python.org
https://mail.python.org/mailman/listinfo/python-ideas
Code of Conduct: http://python.org/psf/codeofconduct/