[issue23505] Urlparse insufficient validation leads to open redirect
Paul McMillan added the comment: As Martin said, backporting a change for this wouldn't be appropriate for Python 2.7. The 2.7 docs could certainly be updated to make this clearer, but we can't introduce a breaking change like that to the stable release. -- ___ Python tracker <http://bugs.python.org/issue23505> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue23505] Urlparse insufficient validation leads to open redirect
Paul McMillan added the comment: While some websites may use urlunparse(urlparse(url)) to validate a url, this is far from standard practice. Django, for instance, does not use this method. While I agree we should clean this behavior up, this is not a vulnerability in core python, and we need to invalidate the assigned CVE. -- nosy: +PaulMcMillan ___ Python tracker <http://bugs.python.org/issue23505> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue14297] Custom string formatter doesn't work like builtin str.format
New submission from Paul McMillan : In Python 2.7, I can use an empty format string to indicate automatically numbered positional arguments when formatting the builtin string object. http://docs.python.org/release/2.7/library/string.html#format-string-syntax If I subclass string.Formatter, this expected functionality doesn't work. http://docs.python.org/release/2.7/library/string.html#string-formatting Python 2.7.2+ (default, Oct 4 2011, 20:06:09) [GCC 4.6.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from string import Formatter >>> class MyFormatter(Formatter): ... pass ... >>> fs_a = 'a{0}b{1}c' >>> fs_b = 'a{}b{}c' >>> fs_a.format(1, 2) 'a1b2c' >>> fs_b.format(1, 2) 'a1b2c' >>> fmt = MyFormatter() >>> fmt.format(fs_a, 1, 2) 'a1b2c' >>> fmt.format(fs_b, 1, 2) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/string.py", line 545, in format return self.vformat(format_string, args, kwargs) File "/usr/lib/python2.7/string.py", line 549, in vformat result = self._vformat(format_string, args, kwargs, used_args, 2) File "/usr/lib/python2.7/string.py", line 571, in _vformat obj, arg_used = self.get_field(field_name, args, kwargs) File "/usr/lib/python2.7/string.py", line 632, in get_field obj = self.get_value(first, args, kwargs) File "/usr/lib/python2.7/string.py", line 591, in get_value return kwargs[key] KeyError: '' -- messages: 155708 nosy: PaulMcMillan priority: normal severity: normal status: open title: Custom string formatter doesn't work like builtin str.format versions: Python 2.7 ___ Python tracker <http://bugs.python.org/issue14297> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > I think you're asking a bit much here :-) A broken app is a broken > app, no matter how nice Python tries to work around it. If an > app puts too much trust into user data, it will be vulnerable > one way or another and regardless of how the user data enters > the app. I notice your patch doesn't include fixes for the entire standard library to work around this problem. Were you planning on writing those, or leaving that for others? As a developer, I honestly don't know how I can state with certainty that input data is clean or not, until I actually see the error you propose. I can't check validity before the fact, the way I can check for invalid unicode before storing it in my database. Once I see the error (probably only after my application is attacked, certainly not during development), it's too late. My application can't know which particular data triggered the error, so it can't delete it. I'm reduced to trial and error to remove the offending data, or to writing code that never stores more than 1000 things in a dictionary. And I have to accept that the standard library may not work on any particular data I want to process, and must write code that detects the error state and somehow magically removes the offending data. The alternative, randomization, simply means that my dictionary ordering is not stable, something that is already the case. While I appreciate that the counting approach feels cleaner; randomization is the only solution that makes practical sense. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: On Sat, Jan 21, 2012 at 3:47 PM, Alex Gaynor wrote: > I'm able to put N pieces of data into the database on successive requests, > but then *rendering* that data puts it in a dictionary, which renders that > page unviewable by anyone. This and the problems Frank mentions are my primary concerns about the counting approach. Without the original suggestion of modifying the hash and continuing without an exception (which has its own set of problems), the "valid data python can't process" problem is a pretty big one. Allowing attackers to poison interactions for other users is unacceptable. The other thing I haven't seen mentioned yet is that while it is true that most web applications do have robust error handling to produce proper 500s, an unexpected error will usually result in restarting the server process - something that can carry significant weight by itself. I would consider it a serious problem if every attack request required a complete application restart, a la original cgi. I'm strongly in favor of randomization. While there are many broken applications in the wild that depend on dictionary ordering, if we ship with this feature disabled by default for security and bugfix branches, and enable it for 3.3, users can opt-in to protection as they need it and as they fix their applications. Users who have broken applications can still safely apply the security fix (without even reading the release notes) because it won't change the default behavior. Distro managers can make an appropriate choice for their user base. Most importantly, it negates the entire "compute once, attack everywhere" class of collision problems, even if we haven't explicitly discovered them. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > Christian Heimes added the comment: > Ouch, the startup impact is large! Have we reached a point where "one size > fits all" doesn't work any longer? It's getting harder to have just one > executable for 500ms scripts and server processes that last for weeks. This concerns me too, and is one reason I think the collision counting code might be the winning solution. Randomness is hard to do correctly and is expensive. If we can avoid it, we should try very hard to do so... > Christian Heimes said to Marc-Andre: > Have you profiled your suggestion? I'm interested in the speed implications. > My gut feeling is that your idea could be slower, since you have added more > instructions to a tight loop, that is execute on every lookup, insert, update > and deletion of a dict key. The hash modification could have a smaller > impact, since the hash is cached. I'm merely speculating here until we have > some numbers to compare. Interesting point, though I think we might be able to work it out so that we're only adding instructions when there's actually a detected collision. I'll be interested to see what the benchmarks (and real world) have to say about the impacts of randomization as compared to the existing black-magic optimization of the hash function. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > Alex, I agree the issue has to do with the origin of the data, but the > modules listed are the ones that deal with the data supplied by this > particular attack. They deal directly with the data. Do any of them pass the data further, or does the data stop with them? A short and very incomplete list of vulnerable standard lib modules includes: every single parsing library (json, xml, html, plus all the third party libraries that do that), all of numpy (because it processes data which probably came from a user [yes, integers can trigger the vulnerability]), difflib, the math module, most database adaptors, anything that parses metadata (including commonly used third party libs like PIL), the tarfile lib along with other compressed format handlers, the csv module, robotparser, plistlib, argparse, pretty much everything under the heading of "18. Internet Data Handling" (email, mailbox, mimetypes, etc.), "19. Structured Markup Processing Tools", "20. Internet Protocols and Support", "21. Multimedia Services", "22. Internationalization", TKinter, and all the os calls that handle filenames. The list is impossibly large, even if we completely ignore user code. This MUST be fixed at a language level. I challenge you to find me 15 standard lib components that are certain to never handle user-controlled input. > Note that changing the hash algorithm for a persistent process, even though > each process may have a different seed or randomized source, allows attacks > for the life of that process, if an attack vector can be created during its > lifetime. This is not a problem for systems where each request is handled by > a different process, but is a problem for systems where processes are > long-running and handle many requests. This point has been made many times now. I urge you to read the entire thread on the mailing list. Your implementation is impractical because your "safe" implementation completely ignores all hash caching (each entry must be re-hashed for that dict). Your implementation is still vulnerable in exactly the way you mentioned if you ever have any kind of long-lived dict in your program thread. > You have entered the class of people that claim lots of vulnerabilities, > without enumeration. I have enumerated. Stop making this argument. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > An attack can be based on trying to find many objects with the same > hash value, or trying to find many objects that, as they get inserted > into a dictionary, very often cause collisions due to the collision > resolution algorithm not finding a free slot. Yep. Allowing an attacker to produce very large dictionaries is also bad. > if the application > puts too much trust into large blobs of input data - which is > the actual security issues we're trying to work around here... To be very clear the issue is ANY large blob of data anywhere in the application, not just on input. An attack could happen after whatever transform your application runs on the data before returning it. > I'll upload a patch that demonstrates the collisions counting > strategy to show that detecting the problem is easy. Whether > just raising an exception is a good idea, is another issue. I'm in cautious agreement that collision counting is a better strategy. The dict implementation performance would suffer from randomization. > The dict implementation could then alter the hash parameter > and recreate the dict table in case the number of collisions > exceeds a certain limit, thereby actively taking action > instead of just relying on randomness solving the issue in > most cases. This is clever. You basically neuter the attack as you notice it but everything else is business as usual. I'm concerned that this may end up being costly in some edge cases (e.g. look up how many collisions it takes to force the recreation, and then aim for just that many collisions many times). Unfortunately, each dict object has to discover for itself that it's full of offending hashes. Another approach would be to neuter the offending object by changing its hash, but this would require either returning multiple values, or fixing up existing dictionaries, neither of which seems feasible. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > Those who use or advocate a simple randomized starting hash (Perl, Ruby, > perhaps MS, and the CCC presenters) are presuming that the randomized hash > values are kept private. Indeed, they should be (and the docs could note > this) unless an attacker has direct access to the interpreter. Except that this is patently untrue. Anytime any programmer iterates over a dictionary and returns the ordered result to the user in some form, they're leaking information about the hash value. I hope you're not suggesting that any programmer who is concerned about security will make sure to sort the results of every iteration before making it public in some fashion. > I do not think we, as Python developers, should be concerned about esoteric > timing attacks. Timing attacks are less esoteric than you think they are. This issue gets worse, not better, as the internet moves (for better or worse) towards virtualized computing. > And if hashing were randomized per process, and probes were randomly > distributed among processes, and processes were periodically killed and > restarted with new seeds, could such an attack get anywhere... You're suggesting that in order for a Python application to be secure, it's a requirement that we randomly kill and restart processes from time to time? I thought we were going for a good solution here, not a hacky workaround. > We could also consider, for 3.3, making the output of hash() be different > from the internal values used for dicts, perhaps by switching random seeds in > hash(). So even if someone does return hash(x) values to potential attackers, > they are not the values used in dicts. (This would require a slight change in > the doc.) This isn't a bad idea, but I'd be fine documenting that the output of hash() shouldn't be made public. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: As Alex said, Java has refused to fix the issue. I believe that Ruby 1.9 (at least the master branch code that I looked at) is using murmurhash2 with a random seed. In either case, yes, these functions are vulnerable to a number of attacks. We're solving the problem more completely than they did. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: Marc-Andre: Victor already pasted the relevant part of my code: http://bugs.python.org/issue13703#msg150568 The link to the fuller version, with revision history and a copy of the code before I modified it is here: https://gist.github.com/0a91e52efa74f61858b5 >Why? The attack doesn't work with short strings? What do you call a "short >string"? Well, the demonstrated collision is for 16 character ascii strings. Worst case UTF-8, we're looking at 3 manipulable bytes per character, but they may be harder to collide since some of those bytes are fixed. > only be making it harder for script kiddies, but as soon as someone > crypt-analysis the used hash algorithm, you're lost again. Not true. What I propose is to make the amount of information necessary to analyze and generate collisions impractically large. My proposed hash function is certainly broken if you brute force the lookup table. There are undoubtedly other problems with it too. The point is that it's hard enough. We aren't going for perfect security - we're going for enough to make this attack impractical. What are the downsides to counting collisions? For one thing, it's something that needs to be kept track of on a per-dict basis, and can't be cached the way the hash results are. How do you choose a good value for the limit? If you set it to something conservative, you still pay the collision price every time a dict is created to discover that the keys collide. This means that it's possible to feed to bad data up to exactly the limit, and suddenly the python app is inexplicably slow. If you set the limit too aggressively, then sometimes valid data gets caught, and python randomly dies in hard to debug ways with an error the programmer has never seen in testing and cannot reproduce. It adds a new way to kill most python applications, and so programs are going to have to be re-written to cope with it. It also introduces a new place to cause errors - if the WSGI server dies, it's hard for my application to catch that and recover gracefully. >... not in Python itself, but if you consider all the types in Python > extensions and classes implementing __hash__ in user code, the number > of hash functions to fix quickly becomes unmanageable. When we looked at the Django project, we wouldn't have anything to fix since ours end up relying on the python internal values eventually. I suspect a lot of other code is similar. Mark said: >What is the mechanism by which the attacker can determine the seeds? The biggest information leak is probably the ordering in which dict entries are returned. This can be used to deduce the underlying hash values. This is much easier than trying to do it via timing. > But that's not the issue we are supposed to be dealing with. > A single (genuinely random) seed will deal with the attack described in > the talk and it is (almost) as fast as using 0 as a seed. This is not true. A single random seed shifts the hash table, but does not actually prevent an attacker from generating collisions. Please see my other posts on the topic here and on the mailing list. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > - for small strings we could use a different seed than for larger strings Or just leave them unseeded with our existing algorithm. Shifting them into a different part of the hash space doesn't really gain us much. > - for larger strings we could use Paul's algorithm but limit the XOR op to > the first and last 16 elements instead of all elements. Agreed. It does have to be both the first and the last though. We can't just do one or the other. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: > My proposition only adds two XOR to hash(str) (outside the loop on Unicode > characters), so I expect a ridiculous overhead. I don't know yet how hard it > is to guess the secret from hash(str) output. It doesn't work much better than a single random seed. Calculating the hash of a null byte gives you the xor of your two seeds. An attacker can still cause collisions inside the vulnerable hash function, your change doesn't negate those internal collisions. Also, strings of all null bytes collide trivially. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: This is not something that can be fixed by limiting the size of POST/GET. Parsing documents (even offline) can generate these problems. I can create books that calibre (a Python-based ebook format shifting tool) can't convert, but are otherwise perfectly valid for non-python devices. If I'm allowed to insert usernames into a database and you ever retrieve those in a dict, you're vulnerable. If I can post things one at a time that eventually get parsed into a dict (like the tag example), you're vulnerable. I can generate web traffic that creates log files that are unparsable (even offline) in Python if dicts are used anywhere. Any application that accepts data from users needs to be considered. Even if the web framework has a dictionary implementation that randomizes the hashes so it's not vulnerable, the entire python standard library uses dicts all over the place. If this is a problem which must be fixed by the framework, they must reinvent every standard library function they hope to use. Any non-trivial python application which parses data needs the fix. The entire standard library needs the fix if is to be relied upon by applications which accept data. It makes sense to fix Python. Of course we must fix all the basic hashing functions in python, not just the string hash. There aren't that many. Marc-Andre: If you look at my proposed code, you'll notice that we do more than simply shift the period of the hash. It's not trivial for an attacker to create colliding hash functions without knowing the key. Since speed is a concern, I think that the proposal to avoid using the random hash for short strings is a good idea. Additionally, randomizing only some of the characters in longer strings will allow us to improve security without compromising speed significantly. I suggest that we don't randomize strings shorter than 6 characters. For longer strings, we randomize the first and last 5 characters. This means we're only adding additional work to a max of 10 rounds of the hash, and only for longer strings. Collisions with the hash from short strings should be minimal. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: A couple of things here: First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be). I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again. I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs. Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good. Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key. Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then. Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution. When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r. -- ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue13703] Hash collision security issue
Paul McMillan added the comment: I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate). I'll link my proposed algorithm change here, for the record: https://gist.github.com/0a91e52efa74f61858b5 I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack. -- nosy: +PaulMcMillan ___ Python tracker <http://bugs.python.org/issue13703> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com