subject:"\[issue29410\] Moving to SipHash\-1\-3"

[issue29410] Moving to SipHash-1-3

2021-10-10 Thread Inada Naoki



Change by Inada Naoki :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-10 Thread Inada Naoki



Inada Naoki  added the comment:


New changeset ad970e8623523a8656e8c1ff4e1dff3423498a5a by Inada Naoki in branch 
'main':
bpo-29410: Change the default hash algorithm to SipHash13. (GH-28752)
https://github.com/python/cpython/commit/ad970e8623523a8656e8c1ff4e1dff3423498a5a


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-09 Thread Christian Heimes



Christian Heimes  added the comment:

> But do they use them as dict keys? AFAIK strings aren't hashed until hash() 
> is called on them.

That's correct. The hash of str and bytes is calculated on demand and then 
cached.

Frozensets also cache their hash values while tuples don't have a cache. We ran 
experiments with hash caching in tuples many years ago. It turned out that the 
increased size had an overall negative effect on performance. This may have 
changed with modern hardware with more RAM and much larger CPU caches, though.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Guido van Rossum



Guido van Rossum  added the comment:

Victor:
> I expect even more interesting speedup with bytes string longer than 6k 
> bytes. And I'm quite sure that it's common that people manipulates long 
> strings in Python :-)

But do they use them as dict keys? AFAIK strings aren't hashed until hash() is 
called on them.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Serhiy Storchaka



Serhiy Storchaka  added the comment:

Do Rust and Ruby cache the hash of the string in the string object? If they do 
not then the hashing speed is more important to them.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

I suggest that you write a new PEP that discusses possible improvement for str 
and bytes hashing. This BPO is just about SipHash-1-3. As the author and 
implementer of PEP 456, I only approve SipHash-1-3 as new default algorithm.

All other change proposals need at least separate BPOs, preferable a PEP. I 
don't want to have this simple feature proposal turned into a major design epic.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 07.10.2021 15:29, Christian Heimes wrote:
> 
> Christian Heimes  added the comment:
> 
> JP got back to me
> 
> On 07/10/2021 14.34, Jean-Philippe Aumasson wrote:
>> xxHash is much faster indeed, but collisions seem trivial to find, which 
>> might allow hash-flood DoS again (see for example 
>> https://github.com/Cyan4973/xxHash/issues/180 
>> ). It's however unclear 
>> whether exploitable multicollisions can also be trivially found.
>>
>> If collisions don't matter and if the ~10x speed-up makes a difference, 
>> then probably a good option, but guess you'll need to keep SipHash (or 
>> some other safe hash) when DoS resistance is needed?
> 
> This information disqualifies xxHash for our use case.

The quoted issue was for an early version of XXH3.

Please see
https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison
as reference for collision analysis of the current xxHash versions.

The numbers are close to the expected case, meaning that
collisions are not more frequent than statistically to be
expected given the hash and the test sample size.

Looking at this older comparison, it would also make sense to
revisit the hybrid approach, e.g. use FNV for strings
up to 16 bytes, XXH128 for longer strings:

https://cglab.ca/~abeinges/blah/hash-rs/

Given that dictionaries often use relatively short string keys,
this should have a significant effect on applications where the
dictionaries don't just use interned strings.

It would also have an effect on Python startup time, since all
those interned strings need to have their hash calculated
during startup.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

JP got back to me

On 07/10/2021 14.34, Jean-Philippe Aumasson wrote:
> xxHash is much faster indeed, but collisions seem trivial to find, which 
> might allow hash-flood DoS again (see for example 
> https://github.com/Cyan4973/xxHash/issues/180 
> ). It's however unclear 
> whether exploitable multicollisions can also be trivially found.
> 
> If collisions don't matter and if the ~10x speed-up makes a difference, 
> then probably a good option, but guess you'll need to keep SipHash (or 
> some other safe hash) when DoS resistance is needed?

This information disqualifies xxHash for our use case.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

SipHash is a cryptographically secure PRF, not a cryptographic hashing 
algorithm, https://www.python.org/dev/peps/pep-0456/#siphash

I'm strongly against using a different algorithm when the rest of the world is 
using either SipHash-2-4 or SipHash-1-3. A minuscule performance gain is not 
worth the risk.

If users like to use an alternative algorithm, then they are free to do so. PEP 
456 gives them a hook point to override the algorithm on build time of CPython.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 07.10.2021 12:48, Christian Heimes wrote:
> 
>> I don't quite follow. Why is it fine that you discuss DoS, but it's not
> fine when others discuss DoS ?
> 
> But this BPO is not about discussing mitigations against DoS attacks in 
> general. It's about adding SipHash1-3- and following the example of Rust and 
> Ruby.
> 
> If you like to discuss DoS attacks on hashing of numeric types or other 
> mitigations, then please do this in a dedicated ticket. I like to keep this 
> BPO focused on a single topic.

The point that both Victor and I wanted to make is that we have
different views on the relevance of DoS attack mitigations
on selecting the default hash algorithm to use with Python strings
(and other objects which use pyhash.c).

The motivation for moving to siphash 1-3 is performance and we can
potentially get even better performance by looking at today's hash
algorithms and revisiting the decision to go with siphash.

This broadens the discussion, yes, but that can easily be addressed
by changing the title to e.g. "Revisiting the default hash algorithm
for strings".

Since siphash is a crypto hash function, whereas xxhash (and other
faster hash algorithms) are non-crypto hash functions, the topic of
hash collisions which can be used for DoS becomes relevant, so I
don't see why such discussions are off-topic.

With non-crypto hash algorithms available which exhibit good
collision stats and taking into account that DoS can be mitigated
using other ways (which is essential anyway, since Python doesn't
protect again hash based DoS in all cases), we get to a better Python.

More details on xxhash collision stats:
https://github.com/Cyan4973/xxHash/wiki/Collision-ratio-comparison#collision-study

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

BTW: We already use (a slight variant of) xxHash for tuples: 
https://bugs.python.org/issue34751

The issues is an interesting read, in particular on how xxHash was eventually 
chosen, with a whole set of other hash algorithms in between.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

I have contacted JP Aumasson to get his feedback on this proposal.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

> I don't quite follow. Why is it fine that you discuss DoS, but it's not
fine when others discuss DoS ?

But this BPO is not about discussing mitigations against DoS attacks in 
general. It's about adding SipHash1-3- and following the example of Rust and 
Ruby.

If you like to discuss DoS attacks on hashing of numeric types or other 
mitigations, then please do this in a dedicated ticket. I like to keep this BPO 
focused on a single topic.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

On 07.10.2021 12:16, Christian Heimes wrote:
> 
>> That's certainly true, but at the same time, just focusing on string
> hashes only doesn't really help either, e.g. it is very easy to
> create a DoS with numeric keys or other objects which use trivial
> hashing algorithms.
> 
> Marc-Andre, Victor, your postings are off-topic. Please move your discussion 
> to a new BPO.

I don't quite follow. Why is it fine that you discuss DoS, but it's not
fine when others discuss DoS ?

You're statement comes across as an attempt to shut down a possibly
fruitful discussion. Not sure if you intended it that way.

We could open a new issue to discuss faster alternatives to siphash
and then close this one, or simply rename this issue, if it bothers you
that we're not strictly focusing on siphash 1-3 :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

> I am not sure its worth enough. Adding algorithm increase some maintenance 
> cost... Is it really make someone happy?

siphash13 is a short function. I wrote and designed PEP 456 so we can easily 
add new hashing functions. IMHO the maintenance cost is minimal.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

> That's certainly true, but at the same time, just focusing on string
hashes only doesn't really help either, e.g. it is very easy to
create a DoS with numeric keys or other objects which use trivial
hashing algorithms.

Marc-Andre, Victor, your postings are off-topic. Please move your discussion to 
a new BPO.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg

Marc-Andre Lemburg  added the comment:

On 07.10.2021 11:49, Inada Naoki wrote:
> Hash DoS is not only for HTTP headers. Everywhere creating dict from 
> untrusted source can be attack vector.
> For example, many API servers receive JSON as HTTP request body. Limiting 
> HTTP header don't protect it.

That's certainly true, but at the same time, just focusing on string
hashes only doesn't really help either, e.g. it is very easy to
create a DoS with numeric keys or other objects which use trivial
hashing algorithms.

I wouldn't focus too much on this at the Python core level.
Server implementations have other ways to protect themselves against
DoS, e.g. by monitoring process memory, CPU load or runtime, applying
limits on incoming data.

IMO, it's much better to use application and use case specific methods
for this, than trying to fix basic data types in Python to address
the issue and making all Python application suffer as a result.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Inada Naoki



Inada Naoki  added the comment:

> I recommend that you add SipHash-1-3 as an additional algorithm and make it 
> the default. The removal of --with-hash-algorithm=siphash24 should go through 
> regular deprecation cycle of two Python versions.

I am not sure its worth enough. Adding algorithm increase some maintenance 
cost... Is it really make someone happy?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Marc-Andre Lemburg



Marc-Andre Lemburg  added the comment:

Since the days this was discussed, a lot of new and faster hash algorithms have 
been developed. It may be worthwhile looking at those instead.

E.g. xxHash is a lot more performant than siphash: 
https://github.com/Cyan4973/xxHash (the link also has a comparison of hash 
algorithms)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Inada Naoki



Inada Naoki  added the comment:

> I know that it's not a popular opinion, but I don't think that this denial of 
> service (DoS) is important. IMO there are enough other ways to crash a 
> server. Moreover, the initial attack vector was a HTTP request with tons of 
> header lines. In the meanwhile, the Python http module was modified to put 
> arbitrary limits on the number of HTTP headers and the maximum length of a 
> single HTTP header.


Hash DoS is not only for HTTP headers. Everywhere creating dict from untrusted 
source can be attack vector.
For example, many API servers receive JSON as HTTP request body. Limiting HTTP 
header don't protect it.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

I recommend that you add SipHash-1-3 as an additional algorithm and make it the 
default. The removal of --with-hash-algorithm=siphash24 should go through 
regular deprecation cycle of two Python versions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Christian Heimes



Christian Heimes  added the comment:

I support this change, too.

SipHash-1-3 is more than good enough for our use case. It provides sufficient 
diffusion of str and bytes hashes and has sufficient reliance against timing 
attacks on the PRF key.

Also 64bit platforms are less affected less by hash collision DoS attacks 
anyway. The issue was far greater on 32bit systems. Nowadays virtually all 
production systems are 64bit.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread STINNER Victor



STINNER Victor  added the comment:

"1.67 us +- 0.03 us: 1.78x faster" with a bytes string of 6k bytes sounds worth 
it to me.

When we talk about "security" here, we are talking about a denial of service 
attack on the dict worst case performance:
https://python-security.readthedocs.io/vuln/hash-dos.html

I know that it's not a popular opinion, but I don't think that this denial of 
service (DoS) is important. IMO there are enough other ways to crash a server. 
Moreover, the initial attack vector was a HTTP request with tons of header 
lines. In the meanwhile, the Python http module was modified to put arbitrary 
limits on the number of HTTP headers and the maximum length of a single HTTP 
header.

It's nice to limit the risk of a DoS, but I don't think that we should go too 
far. If it worked for Rust and Ruby, SipHash-1-3 should be good as well for 
Python.

I expect even more interesting speedup with bytes string longer than 6k bytes. 
And I'm quite sure that it's common that people manipulates long strings in 
Python :-)

I retarget this change to Python 3.11. Please don't backport it since it 
changes the Python build system (configure options).

--
versions: +Python 3.11 -Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Mark Shannon



Mark Shannon  added the comment:

Yes, this is worth doing, IMO.

It adds no more code and probably reduces maintenance costs as any 
improvements/bug-fixes to the rust/ruby versions can be easily ported.

Even if the benefit is small, the cost is basically zero.

--
nosy: +Mark.Shannon

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-07 Thread Inada Naoki



Inada Naoki  added the comment:

I am not sure this is worth doing.

Microbenchmarks:

## import time

```
$ main/opt/bin/pyperf command main/opt/bin/python3 -c 'import typing,asyncio'
.
command: Mean +- std dev: 49.6 ms +- 0.1 ms

$ siphash13/opt/bin/pyperf command siphash13/opt/bin/python3 -c 'import 
typing,asyncio'
.
command: Mean +- std dev: 49.3 ms +- 0.1 ms
```

0.6% faster.

## hash(s+'a')

```
# 6+1
$ siphash13/opt/bin/pyperf timeit --compare-to main/opt/bin/python3 
--python-names siphash24:siphash13 -s 'b=b"x"*6' -- 'hash(b+b"a")'
siphash24: . 87.7 ns +- 2.8 ns
siphash13: . 84.8 ns +- 1.6 ns

Mean +- std dev: [siphash24] 87.7 ns +- 2.8 ns -> [siphash13] 84.8 ns +- 1.6 
ns: 1.03x faster

# 6000+1
$ siphash13/opt/bin/pyperf timeit --compare-to main/opt/bin/python3 
--python-names siphash24:siphash13 -s 'b=b"x"*6000' -- 'hash(b+b"a")'
siphash24: . 2.98 us +- 0.03 us
siphash13: . 1.67 us +- 0.03 us

Mean +- std dev: [siphash24] 2.98 us +- 0.03 us -> [siphash13] 1.67 us +- 0.03 
us: 1.78x faster
```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-06 Thread Guido van Rossum



Guido van Rossum  added the comment:

Worth revisiting?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-06 Thread Guido van Rossum



Change by Guido van Rossum :


--
nosy: +gvanrossum

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-05 Thread Inada Naoki



Change by Inada Naoki :


--
pull_requests: +27097
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/28752

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2021-10-05 Thread Erlend E. Aasland



Change by Erlend E. Aasland :


--
nosy: +erlendaasland

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-03-06 Thread INADA Naoki


INADA Naoki added the comment:

microbench result: 
https://gist.github.com/methane/33c7b01c45ce23b67246f5ddaff9c9e7

Not significant ~ very little difference for small data.
For 4KB:
Median +- std dev: [python.default] 3.26 us +- 0.03 us -> [python.siphash13] 
2.03 us +- 0.02 us: 1.60x faster (-38%)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-03-01 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

And needed microbenchmarks for different sizes, because the result should be 
relied on fitting the data in caches of different levels.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-03-01 Thread STINNER Victor


STINNER Victor added the comment:

This issue is an optimization, but I still don't see any significant speedup 
:-) Would it be possible to write at least one microbenchmark showing a 
speedup? Maybe hash()?

--
nosy: +haypo

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-13 Thread Christian Heimes


Christian Heimes added the comment:

The old repos has a different name, https://github.com/tiran/cpython-mirror

I'm busy with work at the moment. I'll get to the patch later.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-13 Thread INADA Naoki


INADA Naoki added the comment:

Christian Heimes: Could you create pull request?

https://github.com/tiran/cpython/tree/siphash13 is 404 now.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

OK, let's forget Py_HASH_CUTOFF for now and focus on SipHash-13.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Christian Heimes


Christian Heimes added the comment:

Unless somebody is able to show a real-world example with a considerable 
performance boost (>5%), I'm strictly against any kind of special casing for 
short strings. I don't want to trade security for performance. You might be 
able to persuade me to go for a more complex system, when you can both proof 
security and performance boost.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

https://emboss.github.io/blog/2012/12/14/breaking-murmur-hash-flooding-dos-reloaded/

Murmur may be safe for <16 byte too.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

Py_HASH_CUTOFF uses different secret from siphash already.
The problem is the secret doesn't affects to collision at all.

Attacker can produce large number of collision, without knowing the secret.

BTW, we have FNV already.  Let's survey about FNV-1 short string collisions.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Christian Heimes


Christian Heimes added the comment:

We can and should still use hash randomization for short strings, but a 
different key than for hash randomization for longer strings.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

On 01.02.2017 13:03, INADA Naoki wrote:
> Maybe, we should remove Py_HASH_CUTOFF completely?

I think we ought to look for a better hash algorithm
for short strings, e.g. a CRC based one.

Some interesting investigations on this:
http://www.orthogonal.com.au/computers/hashstrings/
http://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed

PS: It may also be wise not to use the hash randomization
with these, since the secret would leak. Again, collision
counting comes to mind, since for short strings it is much
more important to have a good bucket distribution than
crypto security to protect hash secrets :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Christian Heimes


Christian Heimes added the comment:

Py_HASH_CUTOFF was an experimental feature for low-performance platforms. On 
modern hardware the performance increase is marginal to non-existing. I planned 
to deprecate the feature in 3.6 but forgot. We can deprecate it now and remove 
it in either 3.7 or 3.8.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

Current Py_HASH_CUTOFF implementation seems weak.
```
switch(len) {
/* ((hash << 5) + hash) + *p == hash * 33 + *p */
case 7: hash = ((hash << 5) + hash) + *p++; /* fallthrough */
...
case 1: hash = ((hash << 5) + hash) + *p++; break;
default:
assert(0);
}
hash ^= len;
hash ^= (Py_uhash_t) _Py_HashSecret.djbx33a.suffix;
x = (Py_hash_t)hash;
```

HashSecret used at last.  Conflicting pair without secret conflict with secrets 
too.

```
>>> def djbx33a(bs):
... h = 5381
... for i in bs:
... h = h*33+i
... h ^= len(bs)
... h ^~ secret
... return h & 0x___
...
>>> for i in [0, 3, 7]: # range(8)
...   for j in [0, 4, 7]: # range(8)
... for k in [0, 5, 7]: # range(8)
...   print(i,j,k, djbx33a([7-i, 7-j+33*i, 7-k+33*j, 33*k]))
...
0 0 0 6381700318
0 0 5 6381700318
0 0 7 6381700318
0 4 0 6381700318
...
7 4 7 6381700318
7 7 0 6381700318
7 7 5 6381700318
7 7 7 6381700318
>>>
```

So there are 8^(Py_HASH_CUTOFF-1) conflicting sequence can be built.

Maybe, we should remove Py_HASH_CUTOFF completely?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Marc-Andre Lemburg


Marc-Andre Lemburg added the comment:

On 01.02.2017 11:07, Serhiy Storchaka wrote:
> 
> The performance of hash algorithm shouldn't affect general benchmarks since 
> hash value is cached inside string object. Almost all dict lookups in 
> critical parts are lookups with interned strings. But in corner cases the 
> difference can be measurable. We should look not on results of 
> macrobenchmarks, but find worst cases.

Good point.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 01.02.2017 10:50, INADA Naoki wrote:
> 
>> it seems as if it would make sense to not use a fixed
>> hash algorithm for all strings lengths, but instead a
>> hybrid one to increase performance for short strings
>> (which are used a lot in Python).
>>
>> Is there a good hash algorithm with provides better
>> performance for short strings than siphash ?
> 
> There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string.
> 
> See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98

Thanks. I found a reference in the PEP 456:

"""
...
However a fast function like DJBX33A is not as secure as SipHash24. A
cutoff at about 5 to 7 bytes should provide a decent safety margin and
speed up at the same time. The PEP's reference implementation provides
such a cutoff with Py_HASH_CUTOFF . The optimization is disabled by
default for several reasons. For one the security implications are
unclear yet and should be thoroughly studied before the optimization is
enabled by default. Secondly the performance benefits vary. On 64 bit
Linux system with Intel Core i7 multiple runs of Python's benchmark
suite [pybench] show an average speedups between 3% and 5% for
benchmarks such as django_v2, mako and etree with a cutoff of 7.
Benchmarks with X86 binaries and Windows X86_64 builds on the same
machine are a bit slower with small string optimization.

The state of small string optimization will be assessed during the beta
phase of Python 3.4. The feature will either be enabled with appropriate
values or the code will be removed before beta 2 is released.
"""

The mentioned speedup sounds like enabling this by default
would make a lot of sense (keeping the compile time option of
setting Py_HASH_CUTOFF to 0).

Aside: I haven't been following this since the days I proposed the
collision counting method as solution to the vulnerability, which was
rejected at the time. It's interesting how much effort went into
trying to fix the string hashing in dicts over the years. Yet the
much easier to use integer key hash collisions have still not
been addressed.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Serhiy Storchaka


Serhiy Storchaka added the comment:

The performance of hash algorithm shouldn't affect general benchmarks since 
hash value is cached inside string object. Almost all dict lookups in critical 
parts are lookups with interned strings. But in corner cases the difference can 
be measurable. We should look not on results of macrobenchmarks, but find worst 
cases.

There is also an alignment issue with the implementation of SipHash-2-4 (and I 
suppose with SipHash-1-3). See issue28055.

--
nosy: +serhiy.storchaka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

> it seems as if it would make sense to not use a fixed
> hash algorithm for all strings lengths, but instead a
> hybrid one to increase performance for short strings
> (which are used a lot in Python).
>
> Is there a good hash algorithm with provides better
> performance for short strings than siphash ?

There is undocumented option "Py_HASH_CUTOFF" to use DJBX33A for short string.

See https://github.com/python/cpython/blob/master/Include/pyhash.h#L98

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Re: [issue29410] Moving to SipHash-1-3

2017-02-01 Thread M.-A. Lemburg

On 01.02.2017 10:14, Christian Heimes wrote:
> 
> PEP 456 defines an API to add more hashing algorithms and make the selection 
> of hash algorithm a compile time option. We can easily add SipHash-1-3 and 
> make it the default algorithm. Vendors then can select between FNV2, 
> SipHash-1-3 and SipHash-2-4.

+1 on adding the 1-3 and making it the default; the faster
the better. Hash speed for strings needs to be excellent in Python
due to the many dict lookups we use in the interpreter.

Reading up a bit on the Rust thread and looking at this benchmark
which is mentioned in the thread:

https://imgur.com/5dKecOW

it seems as if it would make sense to not use a fixed
hash algorithm for all strings lengths, but instead a
hybrid one to increase performance for short strings
(which are used a lot in Python).

Is there a good hash algorithm with provides better
performance for short strings than siphash ?

> On another note should we add SipHash-2-4 and 1-3 PRF to the hashlib mode?

+1 as well.

-- 
Marc-Andre Lemburg
eGenix.com

___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Christian Heimes


Christian Heimes added the comment:

I added SipHash13 as additional hash algorithm in 
https://github.com/tiran/cpython/tree/siphash13 . Still need to verify the 
finalizer.

For hashlib I'd need to move to a different implementation of SipHash. The 
implementation in pyhash.c is optimized for speed and has a fixed key.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

> PEP 456 defines an API to add more hashing algorithms and make the selection 
> of hash algorithm a compile time option. We can easily add SipHash-1-3 and 
> make it the default algorithm. Vendors then can select between FNV2, 
> SipHash-1-3 and SipHash-2-4.

OK, I'll add siphash13, instead of replacing siphash24.

> On another note should we add SipHash-2-4 and 1-3 PRF to the hashlib mode?

siphash implementation uses 64bit unit.
I don't know how to implement partial update like: m.update(b'foo'); 
b.update(b'bar')

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

$ ../python.default -m perf compare_to default.json patched4.json -G
Slower (2):
- scimark_sor: 479 ms +- 8 ms -> 485 ms +- 9 ms: 1.01x slower (+1%)
- genshi_xml: 196 ms +- 2 ms -> 196 ms +- 2 ms: 1.00x slower (+0%)

Faster (19):
- json_loads: 62.7 us +- 0.5 us -> 61.5 us +- 1.1 us: 1.02x faster (-2%)
- django_template: 402 ms +- 3 ms -> 395 ms +- 4 ms: 1.02x faster (-2%)
- logging_format: 36.6 us +- 0.5 us -> 36.1 us +- 0.4 us: 1.01x faster (-1%)
- xml_etree_generate: 277 ms +- 5 ms -> 274 ms +- 5 ms: 1.01x faster (-1%)
- sympy_str: 461 ms +- 5 ms -> 456 ms +- 3 ms: 1.01x faster (-1%)
- call_method_unknown: 16.6 ms +- 1.2 ms -> 16.4 ms +- 0.2 ms: 1.01x faster 
(-1%)
- raytrace: 1.31 sec +- 0.02 sec -> 1.30 sec +- 0.01 sec: 1.01x faster (-1%)
- python_startup_no_site: 9.81 ms +- 0.02 ms -> 9.74 ms +- 0.02 ms: 1.01x 
faster (-1%)
- python_startup: 16.2 ms +- 0.0 ms -> 16.1 ms +- 0.0 ms: 1.01x faster (-1%)
- logging_simple: 31.2 us +- 0.3 us -> 31.0 us +- 0.4 us: 1.01x faster (-1%)
- 2to3: 742 ms +- 4 ms -> 739 ms +- 4 ms: 1.00x faster (-0%)
- hexiom: 22.7 ms +- 0.2 ms -> 22.6 ms +- 0.2 ms: 1.00x faster (-0%)
- call_simple: 14.2 ms +- 0.5 ms -> 14.2 ms +- 0.2 ms: 1.00x faster (-0%)
- sympy_integrate: 46.2 ms +- 0.3 ms -> 46.1 ms +- 0.4 ms: 1.00x faster (-0%)
- regex_compile: 434 ms +- 4 ms -> 433 ms +- 4 ms: 1.00x faster (-0%)
- deltablue: 17.7 ms +- 0.2 ms -> 17.7 ms +- 0.2 ms: 1.00x faster (-0%)
- chaos: 299 ms +- 3 ms -> 299 ms +- 3 ms: 1.00x faster (-0%)
- call_method_slots: 14.6 ms +- 0.2 ms -> 14.6 ms +- 0.1 ms: 1.00x faster (-0%)
- regex_dna: 283 ms +- 3 ms -> 283 ms +- 2 ms: 1.00x faster (-0%)

Benchmark hidden because not significant (43)


I'm creating siphash13 from reference implementation to double check the output.

--
keywords: +patch
Added file: http://bugs.python.org/file46476/siphash13.patch

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread Christian Heimes


Christian Heimes added the comment:

PEP 456 defines an API to add more hashing algorithms and make the selection of 
hash algorithm a compile time option. We can easily add SipHash-1-3 and make it 
the default algorithm. Vendors then can select between FNV2, SipHash-1-3 and 
SipHash-2-4.

On another note should we add SipHash-2-4 and 1-3 PRF to the hashlib mode?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-02-01 Thread INADA Naoki


INADA Naoki added the comment:

I'm running pyperformance.

BTW, hashlib doesn't provide siphash24.  So can we simply replace siphash24 
with siphash13, like ruby?
https://github.com/ruby/ruby/commit/04c94f95d1a1c6a12f5412228a2bcdc00f5de3b2

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-01-31 Thread Raymond Hettinger


Raymond Hettinger added the comment:

+1 The discussions on the Rust and Ruby lists seem persuasive.

--
assignee:  -> christian.heimes
nosy: +christian.heimes, rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-01-31 Thread INADA Naoki


Changes by INADA Naoki :


--
components: +Interpreter Core
type:  -> performance

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue29410] Moving to SipHash-1-3

2017-01-31 Thread INADA Naoki


New submission from INADA Naoki:

Rust and Ruby moved from SipHash-2-4 to SipHash-1-3.

https://github.com/rust-lang/rust/issues/29754
https://bugs.ruby-lang.org/issues/13017

SipHash-1-3 seems faster than 2-4 and secure enough.

--
messages: 286590
nosy: inada.naoki
priority: normal
severity: normal
status: open
title: Moving to SipHash-1-3
versions: Python 3.7

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

55 matches

Mail list logo