[issue43882] [security] CVE-2022-0391: urllib.parse should sanitize urls containing ASCII newline and tabs.
Mike Lissner added the comment: Looks like that CVE isn't public yet. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2022-0391 Any chance I can get access (I originally reported this vuln.). My email is m...@free.law, if it's possible and my email is needed. Thanks! -- ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.
Mike Lissner added the comment: > With the fix for this bug, urlsplit silently removes (some of) those > characters before we can replace them, modifying the output of our > sanitisation code I don't have any good solutions for 3.9.5, but going forward, this feels like another example of why we should just do parsing right (the way browsers do). That'd maintain tabs and whatnot in your output, and it'd fix the security issue by putting `java\nscript` into the scheme attribute instead of the path. > One solution that presents itself to me: add a `strip_insecure_characters: > bool = True` parameter. Doesn't this lose sight of what this tool is supposed to do? It's not supposed to have a good (new, correct) and a bad (old, obsolete) way of parsing. Woe unto whoever has to write the documentation for that parameter. Also, I should reiterate that these aren't "insecure" characters so if we did have a parameter for this, it'd be more like `do_rfc_3986_parsing` or maybe `do_naive_parsing`. The chars aren't insecure in themselves. They're fine. Python just gets tripped up on them. -- ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.
Mike Lissner added the comment: > I'd wonder how to pass through valid exceptions without urlparse raising > something. Oops, meant to say "valid URLs", not valid exceptions, sorry. -- ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.
Mike Lissner added the comment: > Instead of the patches as you see them, we could've raised an exception. In my mind the definition of a valid URL is what browsers recognize. They're moving towards the WHATWG definition, and so too must we. If we make python raise an exception when a URL has a newline in the scheme (e..g: "htt\np"), we'd be raising exceptions for *valid* URLs as browsers define them. That doesn't seem right at all to me. I'd be frustrated to have to catch such an exception, and I'd wonder how to pass through valid exceptions without urlparse raising something. > Making the output 'sanitized' means that invalid input is converted into > valid output. This goes against the principle of least surprise. Well, not quite, right? The URLs this fixes *are* valid according to browsers. Browsers say these tabs and newlines are OK. I agree though that there's an issue with the approach of stripping input in a way that affects output. That doesn't seem right. I think the solution I'd favor (and I imagine what's coming in 43883) is to do this properly so that newlines are preserved in the output, but so that the scheme is also placed properly in the scheme attribute. So instead of this (from the initial report): > In [9]: from urllib.parse import urlsplit > In [10]: urlsplit("java\nscript:alert('bad')") > Out[10]: SplitResult(scheme='', netloc='', path="java\nscript:alert('bad')", > query='', fragment='') We get something like this: > In [10]: urlsplit("java\nscript:alert('bad')") > Out[10]: SplitResult(scheme='java\nscript', netloc='', path="alert('bad')", > query='', fragment='') In other words, keep the funky characters and parse properly. -- ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43882] [security] urllib.parse should sanitize urls containing ASCII newline and tabs.
Mike Lissner added the comment: I haven't watched that Blackhat presentation yet, but from the slides, it seems like the fix is to get all languages parsing URLs the same as the browsers. That's what @orsenthil has been doing here and plans to do in https://bugs.python.org/issue43883. Should we get a bug filed with requests/urllib3 too? Seems like a good idea if it suffers from the same problems. -- ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43882] urllib.parse should sanitize urls containing ASCII newline and tabs.
Change by Mike Lissner : -- nosy: +Mike.Lissner ___ Python tracker <https://bugs.python.org/issue43882> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue43883] Making urlparse WHATWG conformant
Change by Mike Lissner : -- nosy: +Mike.Lissner ___ Python tracker <https://bugs.python.org/issue43883> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue29315] \b requires raw strings or to be escaped. Update docs with that hint?
New submission from Mike Lissner: I just ran into a funny corner case I imagine others are aware of. When you write "\b" in Python, it is a single character: "\x08". So if you try to write a regex like: words = '\b(.*)\b' That won't work. But using a raw string will: words = r'\b(.*)\b' As will escaping it in this horrible fashion: words = '\\b(.*)\\b' I believe this doesn't affect any of the other regex flags, so I wonder if it's worth adding something to the docs to warn about this. I just spent a bunch of time trying to figure out why it seemed like \b wasn't working. A little tip in the docs would have gone a LONG way. -- assignee: docs@python components: Documentation messages: 285751 nosy: Mike.Lissner, docs@python priority: normal severity: normal status: open title: \b requires raw strings or to be escaped. Update docs with that hint? type: enhancement versions: Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6, Python 3.7 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue29315> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10682] With '*args' or even bare '*' in def/call argument list, trailing comma causes SyntaxError
Mike Lissner added the comment: This is an old issue, but where I run into it frequently is when I use the format function and string interpolation. For example, this throws a SyntaxError: "The name of the person is {name_first} {name_last}".format( **my_obj.__dict__, ) Because strings tend to be fairly long, it's pretty common that the arguments to format end up on their own line. I was always taught to use trailing commas in Python, and I'm fanatical about ensuring they're there. It's a smart part of the language that saves you from many bugs and much typing when copy/pasting/tweaking. This is the first time I've ever run into an implementation bug in CPython, and at least from the post on StackOverflow, this looks like the parser isn't obeying the grammar: https://stackoverflow.com/questions/16950394/python-why-is-this-invalid-syntax -- nosy: +Mike.Lissner ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue10682> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22118] urljoin fails with messy relative URLs
Mike Lissner added the comment: Just hopping in here to say that the work going down here is beautiful. I've filed a lot of bugs. This one's not particularly difficult, but damn, I appreciate the speed and quality going into fixing it. Glad to see the Python language is a happy place with fast, quality fixes. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22118 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22118] urljoin fails with messy relative URLs
Mike Lissner added the comment: @pitrou, I haven't delved into URLs in a long while, but the general idea is: scheme://domain:port/path?query_string#fragment_id When would it ever make sense to have something up a level from the root of the domain? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22118 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22118] urljoin fails with messy relative URLs
Mike Lissner added the comment: @demian.brecht, that'd make me very pleased if you took this over. I won't have time to devote to it, unfortunately. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22118 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22118] urljoin fails with messy relative URLs
New submission from Mike Lissner: Not sure if this is desired behavior, but it's making my code break, so I figured I'd get it filed. I'm trying to crawl this website: https://www.appeals2.az.gov/ODSPlus/recentDecisions2.cfm Unfortunately, most of the URLs in the HTML are relative, taking the form: '../../some/path/to/some/pdf.pdf' I'm using lxml's make_links_absolute() function, which calls urljoin creating invalid urls like: https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf If you put that into Firefox or wget or whatever, it works, despite being invalid and making no sense. **It works because those clients fix the problem,** joining the invalid path and the URL into: https://www.appeals2.az.gov/Decisions/CR20130096OPN.pdf I know this will mean making urljoin have a workaround to fix bad HTML, but this seems to be what wget, Chrome, Firefox, etc. all do. I've never filed a Python bugs before, but is this something we could consider? -- components: Library (Lib) messages: 224500 nosy: Mike.Lissner priority: normal severity: normal status: open title: urljoin fails with messy relative URLs type: behavior versions: Python 2.7 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22118 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue22118] urljoin fails with messy relative URLs
Mike Lissner added the comment: FWIW, the workaround that I've just created for this problem is this: u = 'https://www.appeals2.az.gov/../Decisions/CR20130096OPN.pdf' # Split the url and rejoin it, nuking any '/..' patterns at the # beginning of the path. s = urlsplit(u) urlunsplit(s[:2] + (re.sub('^(/\.\.)+', '', s.path),) + s[3:]) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue22118 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com