[issue1583] Patch for signal.set_wakeup_fd
Adam Olsen added the comment: signalmodule.c has a hack to limit it to the main thread. Otherwise there's all sorts of platform-specific behaviour. -- ___ Python tracker <https://bugs.python.org/issue1583> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1583] Patch for signal.set_wakeup_fd
Adam Olsen added the comment: signal-safe is different from thread-safe (despite conceptual similarities), but regardless it's been a long time since I last delved into this so I'm quite rusty. I could be doing it all wrong. -- ___ Python tracker <https://bugs.python.org/issue1583> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1583] Patch for signal.set_wakeup_fd
Adam Olsen added the comment: Converting to/from sig_atomic_t could have a compile time check on currently supported platforms and isn't buggy for them. For platforms with a different size you could do a runtime check, only allowing a fd in the range of 0-254 (with 255 reserved); that could sometimes fail, yes, but at least it's explicit, easily understood failure. Just using int would fail in undefined ways down the road, likely writing to a random fd instead (corrupting whatever it was doing), with no way to trace it back. Unpacking the int would mean having one sig_atomic_t for 'invalid', using that instead of INVALID_FD, plus an array of sig_atomic_t for the fd itself. Every time you want to change the fd you first set the 'invalid' flag, then the individual bytes, then clear 'invalid'. -- ___ Python tracker <https://bugs.python.org/issue1583> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1583] Patch for signal.set_wakeup_fd
Adam Olsen added the comment: Disagree; if you're writing signal-handling code you should be very careful to do it properly, even if that's only proper for your current platform. If you can't do it properly you should find an alternative that doesn't involve signals. The fact that sig_atomic_t is only 1 byte on VxWorks strongly implies using int WILL fail in strange ways on that platform. I can see three options: 1) use pycore_atomic.h, implementing it for VxWorks if you haven't already. This also implies sig_atomic_t could have been int but wasn't for some reason, such as performance. 2) disable wakeup_fd entirely. It's obscure, GNOME being the biggest user I can think of. 3) unpack the int into an array of sig_atomic_t. Only the main thread writes to it so this method is ugly but viable. -- ___ Python tracker <https://bugs.python.org/issue1583> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1583] Patch for signal.set_wakeup_fd
Adam Olsen added the comment: The fd field may be written from the main thread simultaneous with the signal handler activating and reading it out. Back in 2007 the only POSIX-compliant type allowed for that was sig_atomic_t, anything else was undefined. Looks like pycore_atomic.h should have alternatives now but I'm not at all familiar with it. -- ___ Python tracker <https://bugs.python.org/issue1583> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue10046] Correction to atexit documentation
Adam Olsen rha...@gmail.com added the comment: Signals can directly kill a process. Try SIGTERM to see this. SIGINT is caught and handled by Python, which just happens to default to a graceful exit (unless stuck in a lib that prevents that.) Try pasting your script into an interactive interpreter session and you'll see that it doesn't exit at all. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue10046 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1441] Cycles through ob_type aren't freed
Adam Olsen rha...@gmail.com added the comment: As far as I know. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1441 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1736792] dict reentrant/threading request
Adam Olsen rha...@gmail.com added the comment: I don't believe there's anything to debate on this, so all it really needs is a patch, followed by getting someone to review and commit it. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1736792 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6643] Throw away more radioactive locks that could be held across a fork in threading.py
Adam Olsen rha...@gmail.com added the comment: I don't have any direct opinions on this, as it is just a bandaid. fork, as defined by POSIX, doesn't allow what we do with it, so we're reliant on great deal of OS and library implementation details. The only portable and robust solution would be to replace it with a unified fork-and-exec API that's implemented directly in C. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6643 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9200] str.isprintable() is always False for large code points
Adam Olsen rha...@gmail.com added the comment: There should be a way to walk the unicode string in Python too. Afaik there isn't. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9200 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue9198] Should repr() print unicode characters outside the BMP?
Changes by Adam Olsen rha...@gmail.com: -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue9198 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8188] Unified hash for numeric types.
Adam Olsen rha...@gmail.com added the comment: Why aren't you using 64-bit hashes on 64-bit architectures? -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8188 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue8188] Unified hash for numeric types.
Adam Olsen rha...@gmail.com added the comment: I assume you mean 63. ;) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue8188 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue7784] patch for making list/insert at the top of the list avoid memmoves
Adam Olsen rha...@gmail.com added the comment: $ ./python -m timeit -s 'from collections import deque; c = deque(range(100))' 'c.append(c.popleft())' 100 loops, best of 3: 0.29 usec per loop $ ./python -m timeit -s 'c = range(100)' 'c.append(c.pop(0))' 100 loops, best of 3: 0.424 usec per loop Using flox's issue7784_listobject_perf.diff. Significantly slower, but it does scale linearly. $ ./python -m timeit -s 'c = range(100)' 'c.insert(0, c.pop())' 100 loops, best of 3: 3.39 msec per loop Unfortunately inserting does not. Will future patches attempt to address this? Note that, if it ends up slower than list and slower than deque there isn't really a use case for it. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue7784 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1943] improved allocation of PyUnicode objects
Adam Olsen rha...@gmail.com added the comment: On Sun, Jan 10, 2010 at 14:59, Marc-Andre Lemburg rep...@bugs.python.org wrote: BTW, I'm not aware of any changes to the PyUnicodeObject by some fastsearch implementation. Could you point me to this ? /* We allocate one more byte to make sure the string is Ux terminated. The overallocation is also used by fastsearch, which assumes that it's safe to look at str[length] (without making any assumptions about what it contains). */ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1943 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1943] improved allocation of PyUnicode objects
Adam Olsen rha...@gmail.com added the comment: Points against the subclassing argument: * We have a null-termination invariant. For byte strings this was part of the public API, and I'm not sure that's changed for unicode strings; aren't you arguing that we should maximize how much of our implementation is a public API? This prevents lazy slicing. * UTF-16 and UTF-32 are rarely used encodings, especially for longer strings (ie files). For shorter strings (APIs) the unicode object overhead is more significant and we'd need a way to slave to the buffer's lifetime to that of the unicode object (hard to do). For longer strings UTF-8 would be much more useful, but that's been shot down before. * subclassing unicode so you can change the meaning of the fields (ie allocating your own buffer) is a gross hack. It relies far too much on fine details of the implementation and is fragile (what if you miss the dummy byte needed by fastsearch?) Most of the possible options could be, if they function correctly, applied directly to the basetype as a patch, so it's moot. * If you dislike PyVarObject in general (I think the API is ugly too) you should argue for a general policy discouraging future use of it, not just get in the way of the one place where it's most appropriate Terry: PyVarObjects would be much easier to subclass if the type object stored an offset to the beginning of the variable section, so it could be automatically recalculated for subclasses based on the size of the struct. This'd mean the PyBytesObject struct would no longer end with a char ob_sval[1]. The down side is a tiny bit more math when accessing the variable section (as the offset is no longer constant). -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1943 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Adam Olsen rha...@gmail.com added the comment: The real, OS signal does not get propagated to the main thread. Only the python-level signal handler runs from the main thread. Correctly written programs are supposed to let select block indefinitely. This allows them to have exactly 0 CPU usage, especially important on laptops and other limited power devices. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Adam Olsen rha...@gmail.com added the comment: You forget that the original report is about ctrl-C. Should we abandon support of it for threaded programs? Close as won't-fix? We could also just block SIGINT, but why? That means we don't support python signal handlers in threaded programs (signals sent to the process, not ones sent direct to threads), and IMO threads expecting a specific signal should explicitly unblock it anyway. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Adam Olsen rha...@gmail.com added the comment: A better solution would be to block all signals by default, then unblock specific ones you expect. This avoids races (as undeliverable signals are simply deferred.) Note that readline is not threadsafe anyway, so it doesn't necessarily need to allow calls from the non-main thread. Maybe somebody is using that way, dunno. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3999] Real segmentation fault handler
Adam Olsen rha...@gmail.com added the comment: That's fine, but please provide a link to the new issue once you create it. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3999 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1722344] Thread shutdown exception in Thread.notify()
Adam Olsen rha...@gmail.com added the comment: Nope, no access. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1722344 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 03:03, Marc-Andre Lemburg rep...@bugs.python.org wrote: We use UCS2 on narrow Python builds, not UTF-16. We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. That has always been the case. UCS2 doesn't support surrogates. However, we have been slowly moving into the direction of making the UCS2 storage appear like UTF-16 to the Python programmer. This process is not yet complete and will likely never complete since it must still be possible to create things line lone surrogates for processing purposes, so care has to be taken when using non-BMP code points on narrow builds. Balderdash. We expose UTF-16 code units, not UCS-2. Guido has made this quite clear. UTF-16 was designed as an easy transition from UCS-2. Indeed, if your code only does searches or joins existing strings then it will Just Work; declare it UTF-16 and you are done. We have a lot more work to do than that (as in this bug report), and we can't reasonably prevent the user from splitting surrogate pairs via poor code, but a 95% solution doesn't mean we suddenly revert all the way back to UCS-2. If the intent really was to use UCS-2 then a correctly functioning UTF-16 codec would join a surrogate pair into a single scalar value, then raise an error because it's outside the range representable in UCS-2. That's not very helpful though; obviously, it's much better to use UTF-16 internally. The alternative (no matter what the configure flag is called) is UTF-16, not UCS-2 though: there is support for surrogate pairs in various places, including the \U escape and the UTF-8 codec. http://mail.python.org/pipermail/python-dev/2008-July/080892.html If you find places where the Python core or standard library is doing Unicode processing that would break when surrogates are present you should file a bug. However this does not mean that every bit of code that slices a string at an arbitrary point (and hence risks slicing in the middle of a surrogate) is incorrect -- it all depends on what is done next with the slice. http://mail.python.org/pipermail/python-dev/2008-July/080900.html -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg rep...@bugs.python.org wrote: All this is just nitpicking, really. UCS2 is a character set, UTF-16 an encoding. UCS is a character set, for most purposes synonymous with the Unicode character set. UCS-2 and UTF-16 are both encodings of that character set. However, UCS-2 can only represent the BMP, while UTF-16 can represent the full range. If we were to implement Unicode using UTF-16 as storage format, we would not be able to store single lone surrogates, since these are not allowed in UTF-16. Ditto for unassigned ordinals, invalid code points, etc. No. Internal usage may become temporarily ill-formed, but this is a compromise, and acceptable so long as we never export them to other systems. Not that I wouldn't *prefer* a system that wouldn't store lone surrogates, but.. pragmatics prevail. Note that I wrote the PEP and worked on the implementation at a time when Unicode 2.x was still in use wide-spread use (mostly on Windows) and 3.0 was just being release: http://www.unicode.org/history/publicationdates.html I think you hit the nail on the head there. 10 years ago, unicode meant something different than it does today. That's reflected in PEP 100 and in the code. Now it's time to move on, switch to the modern terminology, modern usage, and modern specs. But all that is off-topic for this ticket, so please let's just stop such discussions. It needs to be discussed somewhere. It's a distraction from fixing the bug, but at least it's more private here. Would you prefer email? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5127] UnicodeEncodeError - I can't even see license
Adam Olsen rha...@gmail.com added the comment: Surrogates aren't optional features of UTF-16, we really need to get this fixed. That includes .isalpha(). We might keep the old public API for compatibility, but it should be clearly marked as broken for non-BMP scalar values. I don't see a problem with changing 2.x. The existing behaviour is broken for non-BMP scalar values, so surely nobody can claim dependence on it. -- nosy: +Rhamphoryncus type: - behavior ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5127 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen rha...@gmail.com added the comment: Patch, which uses UTF-32-BE as indicated in my last comment. Test included. -- keywords: +patch Added file: http://bugs.python.org/file15043/py3k-nonBMP-literal.diff ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen rha...@gmail.com added the comment: With some further prodding I've noticed that although the test behaves as expected in the py3k branch (fails on UTF-32 builds before the patch), it doesn't fail using python 3.0. I'm guessing there's interactions with compile() vs import and the issue 3672 fix. Still good enough though, IMO. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue7045] utf-8 encoding error
Adam Olsen rha...@gmail.com added the comment: I believe this is a duplicate of issue #3297. When given a high unicode scalar value directly in the source (rather than in escaped form) python will split it into surrogates, even on a UTF-32 build where those surrogates are nonsensical and ill-formed. Patches for Issue #3672 probably made this more visible. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue7045 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen rha...@gmail.com added the comment: Looks like the failure mode has changed here, presumably due to issue #3672 patches. It now always fails, even after loading from a .pyc. This is using py3k via bzr, which reports itself as 3.2a0 $ rm unicodetest.pyc $ ./python -c 'import unicodetest' Result: False Len: 2 1 Repr: '\ud800\udd23' '\U00010123' [28877 refs] $ ./python -c 'import unicodetest' Result: False Len: 2 1 Repr: '\ud800\udd23' '\U00010123' [28708 refs] -- versions: +Python 2.7, Python 3.1, Python 3.2 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen rha...@gmail.com added the comment: I've traced down the biggest problem to decode_unicode in ast.c. It needs to convert everything into a form of escapes so it becomes pure ascii, which then become evaluated back into a unicode object. Unfortunately, it uses UTF-16-BE to do so, which always split surrogates. Switching it to UTF-32-BE is fairly straightforward, and works even on UTF-16 (or narrow) builds. Incidentally, there's no point using the surrogatepass error handler once we actually support surrogates. Unfortunately there's a second problem in repr(). '\U0001010F'.isprintable() returns True on UTF-32 builds and False on UTF-16 builds. This causes repr() to escape it unnecessarily on UTF-16 builds. repr() at least joins surrogate pairs before its internally printable test (unlike .isprintable() or any other str method), but it turns out all of the APIs in unicodectype.c only accept a single 16-bit int in UTF-16 builds anyway. That'll be a bigger patch than the first part. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue992389] attribute error after non-from import
Adam Olsen rha...@gmail.com added the comment: The key distinction between this and a bad circular import is that this is lazy. You may list the import at the top of your module, but you never touch it until after you've finished importing yourself (and they feel the same about you.) An ugly fix could be done today for module imports by creating a proxy that triggers the import upon the first attribute access. A more general solution could be done with a lazyimport statement, triggered when the target module finishes importing; only problem there is the confusing error messages and other oddities if you reassign that name. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue992389 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue992389] attribute error after non-from import
Adam Olsen rha...@gmail.com added the comment: It'd probably be sufficient if we raised NameError: lazy import 'foo' not yet complete. That should require a set of what names this module is lazy importing, which is checked in the failure paths of module attribute lookup and global/builtin lookup. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue992389 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue6326] Add a swap method to list
Adam Olsen rha...@gmail.com added the comment: Fix it at its source: patch your database engine to use the type you want. Or wrap the list without subclassing (__iter__ may be the only method you need to wrap). Obscure performance hacks don't warrant language extensions. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue6326 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: Adding a Par construct to Python?
On May 19, 5:05 am, jer...@martinfamily.freeserve.co.uk wrote: Thanks for explaining a few things to me. So it would seem that replacing the GIL with something which allows better scalability of multi-threaded applications, would be very complicated. The paper by Jesse Nolle which I referenced in my original posting includes the following: In 1999 Greg Stein created a patch set for the interpreter that removed the GIL, but added granular locking around sensitive interpreter operations. This patch set had the direct effect of speeding up threaded execution, but made single threaded execution two times slower. Source:http://jessenoller.com/2009/02/01/python-threads-and-the-global-inter... That was ten years ago - do you have any idea as to how things have been progressing in this area since then? https://launchpad.net/python-safethread -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 5:30 am, Tim Wintle tim.win...@teamrubber.com wrote: On Thu, 2009-04-16 at 21:44 -0700, Adam Olsen wrote: The Wayback Machine has 150 billion pages, so 2**37. Google's index is a bit larger at over a trillion pages, so 2**40. A little closer than I'd like, but that's still 56294995000 to 1 odds of having *any* collisions between *any* of the files. Step up to SHA-256 and it becomes 1915619400 to 1. Sadly, I can't even give you the odds for SHA-512, Qalculate considers that too close to infinite to display. :) That might be true as long as your data is completely uniformly distributed. For the example you give there's: a) a high chance that there's html near the top b) a non-uniform distribution of individual words within the text. c) a non-unifom distribution of all n-grams within the text (as there is in natural language) So it's very far from uniformly distributed. Just about the only situation where I could imagine that holding would be where you are hashing uniformly random data for the sake of testing the hash. I believe the point being made is that comparing hash values is a probabilistic algorithm anyway, which is fine if you're ok with that, but for mission critical software it's crazy. Actually, *cryptographic* hashes handle that just fine. Even for files with just a 1 bit change the output is totally different. This is known as the Avalanche Effect. Otherwise they'd be vulnerable to attacks. Which isn't to say you couldn't *construct* a pattern that it would be vulnerable to. Figuring that out is pretty much the whole point of attacking a cryptographic hash. MD5 has significant vulnerabilities by now, and other will in the future. That's just a risk you need to manage. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 9:59 am, norseman norse...@hughes.net wrote: The more complicated the math the harder it is to keep a higher form of math from checking (or improperly displacing) a lower one. Which, of course, breaks the rules. Commonly called improper thinking. A number of math teasers make use of that. Of course, designing a hash is hard. That's why the *recommended* ones get so many years of peer review and attempted attacks first. I'd love of Nigel provided evidence that MD5 was broken, I really would. It'd be quite interesting to investigate, assuming malicious content can be ruled out. Of course even he doesn't think that. He claims that his 42 trillion trillion to 1 odds happened not just once, but multiple times. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 17, 9:59 am, SpreadTooThin bjobrie...@gmail.com wrote: You know this is just insane. I'd be satisfied with a CRC16 or something in the situation i'm in. I have two large files, one local and one remote. Transferring every byte across the internet to be sure that the two files are identical is just not feasible. If two servers one on one side and the other on the other side both calculate the CRCs and transmit the CRCs for comparison I'm happy. Definitely use a hash, ignore Nigel. SHA-256 or SHA-512. Or, if you might need to update one of the files, look at rsync. Rsync still uses MD4 and MD5 (optionally!), but they're fine in a trusted environment. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: On Apr 15, 12:56 pm, Nigel Rantor wig...@wiggly.org wrote: Adam Olsen wrote: The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. Not when you're using them to compare lots of files. Trust me. Been there, done that, got the t-shirt. Using hash functions to tell whether or not files are identical is an error waiting to happen. But please, do so if it makes you feel happy, you'll just eventually get an incorrect result and not know it. Please tell us what hash you used and provide the two files that collided. MD5 If your hash is 256 bits, then you need around 2**128 files to produce a collision. This is known as a Birthday Attack. I seriously doubt you had that many files, which suggests something else went wrong. Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), and it has substantial known vulnerabilities, the scenario you suggest where you *accidentally* find collisions (and you imply multiple collisions!) would be a rather significant finding. Please help us all by justifying your claim. Mind you, since you use MD5 I wouldn't be surprised if your files were maliciously produced. As I said before, you need to consider upgrading your hash every few years to avoid new attacks. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 8:59 am, Grant Edwards inva...@invalid wrote: On 2009-04-16, Adam Olsen rha...@gmail.com wrote: I'm afraid you will need to back up your claims with real files. Although MD5 is a smaller, older hash (128 bits, so you only need 2**64 files to find collisions), You don't need quite that many to have a significant chance of a collision. With only something on the order of 2**61 files, you still have about a 1% chance of a collision. Aye, 2**64 is more of the middle of the curve or so. You can still go either way. What's important is the order of magnitude required. For a few million files (we'll say 4e6), the probability of a collision is so close to 0 that it can't be calculated using double-precision IEEE floats. ≈ 0.023509887 Or 4253529600 to 1. Or 42 trillion trillion to 1. Here's the Python function I'm using: def bp(n, d): return 1.0 - exp(-n*(n-1.)/(2.*d)) I haven't spent much time studying the numerical issues of the way that the exponent is calculated, so I'm not entirely confident in the results for small n values such that p(n) == 0.0. Try using Qalculate. I always resort to it for things like this. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 11:15 am, SpreadTooThin bjobrie...@gmail.com wrote: And yes he is right CRCs hashing all have a probability of saying that the files are identical when in fact they are not. Here's the bottom line. It is either: A) Several hundred years of mathematics and cryptography are wrong. The birthday problem as described is incorrect, so a collision is far more likely than 42 trillion trillion to 1. You are simply the first person to have noticed it. B) Your software was buggy, or possibly the input was maliciously produced. Or, a really tiny chance that your particular files contained a pattern that provoked bad behaviour from MD5. Finding a specific limitation of the algorithm is one thing. Claiming that the math is fundamentally wrong is quite another. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 16, 4:27 pm, Rhodri James rho...@wildebst.demon.co.uk wrote: On Thu, 16 Apr 2009 10:44:06 +0100, Adam Olsen rha...@gmail.com wrote: On Apr 16, 3:16 am, Nigel Rantor wig...@wiggly.org wrote: Okay, before I tell you about the empirical, real-world evidence I have could you please accept that hashes collide and that no matter how many samples you use the probability of finding two files that do collide is small but not zero. I'm afraid you will need to back up your claims with real files. So that would be a no then. If the implementation of dicts in Python, say, were to assert as you are that the hashes aren't going to collide, then I'd have to walk away from it. There's no point in using something that guarantees a non-zero chance of corrupting your data. Python's hash is only 32 bits on a 32-bit box, so even 2**16 keys (or 65 thousand) will give you a decent chance of a collision. In contrast MD5 needs 2**64, and a *good* hash needs 2**128 (SHA-256) or 2**256 (SHA-512). The two are at totally different extremes. There is *always* a non-zero chance of corruption, due to software bugs, hardware defects, or even operator error. It is only in that broader context that you can realize just how minuscule the risk is. Can you explain to me why you justify great lengths of paranoia, when the risk is so much lower? Why are you advocating a solution to the OP's problem that is more computationally expensive than a simple byte-by-byte comparison and doesn't guarantee to give the correct answer? For single, one-off comparison I have no problem with a byte-by-byte comparison. There's a decent chance the files won't be in the OS's cache anyway, so disk IO will be your bottleneck. Only if you're doing multiple comparisons is a hash database justified. Even then, if you expect matching files to be fairly rare I won't lose any sleep if you're paranoid and do a byte-by-byte comparison anyway. New vulnerabilities are found, and if you don't update promptly there is a small (but significant) chance of a malicious file leading to collision. That's not my concern though. What I'm responding to is Nigel Rantor's grossly incorrect statements about probability. The chance of collision, in our life time, is *insignificant*. The Wayback Machine has 150 billion pages, so 2**37. Google's index is a bit larger at over a trillion pages, so 2**40. A little closer than I'd like, but that's still 56294995000 to 1 odds of having *any* collisions between *any* of the files. Step up to SHA-256 and it becomes 1915619400 to 1. Sadly, I can't even give you the odds for SHA-512, Qalculate considers that too close to infinite to display. :) You should worry more about your head spontaneously exploding than you should about a hash collision on that scale. To do otherwise is irrational paranoia. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 15, 11:04 am, Nigel Rantor wig...@wiggly.org wrote: The fact that two md5 hashes are equal does not mean that the sources they were generated from are equal. To do that you must still perform a byte-by-byte comparison which is much less work for the processor than generating an md5 or sha hash. If you insist on using a hashing algorithm to determine the equivalence of two files you will eventually realise that it is a flawed plan because you will eventually find two files with different contents that nonetheless hash to the same value. The more files you test with the quicker you will find out this basic truth. This is not complex, it's a simple fact about how hashing algorithms work. The only flaw on a cryptographic hash is the increasing number of attacks that are found on it. You need to pick a trusted one when you start and consider replacing it every few years. The chance of *accidentally* producing a collision, although technically possible, is so extraordinarily rare that it's completely overshadowed by the risk of a hardware or software failure producing an incorrect result. -- http://mail.python.org/mailman/listinfo/python-list
Re: binary file compare...
On Apr 13, 8:39 pm, Grant Edwards gra...@visi.com wrote: On 2009-04-13, Peter Otten __pete...@web.de wrote: But there's a cache. A change of file contents may go undetected as long as the file stats don't change: Good point. You can fool it if you force the stats to their old values after you modify a file and you don't clear the cache. The timestamps stored on the filesystem (for ext3 and most other filesystems) are fairly coarse, so it's quite possible for a check/ update/check sequence to have the same timestamp at the beginning and end. -- http://mail.python.org/mailman/listinfo/python-list
Re: Returning different types based on input parameters
On Apr 8, 8:09 am, George Sakkis george.sak...@gmail.com wrote: On Apr 7, 3:18 pm, Adam Olsen rha...@gmail.com wrote: On Apr 6, 3:02 pm, George Sakkis george.sak...@gmail.com wrote: For example, it is common for a function f(x) to expect x to be simply iterable, without caring of its exact type. Is it ok though for f to return a list for some types/values of x, a tuple for others and a generator for everything else (assuming it's documented), or it should always return the most general (iterator in this example) ? For list/tuple/iterable the correlation with the argument's type is purely superficial, *because* they're so compatible. Why should only tuples and lists get special behaviour? Why shouldn't every other argument type return a list as well? That's easy; because the result might be infinite. In which case you may ask why shouldn't every argument type return an iterator then, and the reason is usually performance; if you already need to store the whole result sequence (e.g. sorted()), why return just an iterator to it and force the client to copy it to another list if he needs anything more than iterating once over it ? You've got two different use cases here. sorted() clearly cannot be infinite, so it might as well always return a list. Other functions that can be infinite should always return an iterator. A counter example is python 3.0's str/bytes functions. They're mutually incompatible and there's no default. As already mentioned, another example is filter() that tries to match the input sequence type and falls back to list if it fails. That's fixed in 3.0. It's always an iterator now. To take it further, what if f wants to return different types, differing even in a duck-type sense? At a minimum it's highly undesirable. You lose a lot of readability/ maintainability. solve2/solve_ex is a little ugly, but that's less overall, so it's the better option. That's my feeling too, at least in a dynamic language. For a static language that allows overloading, that should be a smaller (or perhaps no) issue. Standard practices may encourage it in a static language, but it's still fairly confusing. Personally, I consider python's switch to a different operator for floor division (//) to be a major step forward over C-like languages. -- http://mail.python.org/mailman/listinfo/python-list
Re: Returning different types based on input parameters
On Apr 6, 3:02 pm, George Sakkis george.sak...@gmail.com wrote: For example, it is common for a function f(x) to expect x to be simply iterable, without caring of its exact type. Is it ok though for f to return a list for some types/values of x, a tuple for others and a generator for everything else (assuming it's documented), or it should always return the most general (iterator in this example) ? For list/tuple/iterable the correlation with the argument's type is purely superficial, *because* they're so compatible. Why should only tuples and lists get special behaviour? Why shouldn't every other argument type return a list as well? A counter example is python 3.0's str/bytes functions. They're mutually incompatible and there's no default. To take it further, what if f wants to return different types, differing even in a duck-type sense? That's easier to illustrate in a API-extension scenario. Say that there is an existing function `solve (x)` that returns `Result` instances. Later someone wants to extend f by allowing an extra optional parameter `foo`, making the signature `solve(x, foo=None)`. As long as the return value remains backward compatible, everything's fine. However, what if in the extended case, solve() has to return some *additional* information apart from `Result`, say the confidence that the result is correct ? In short, the extended API would be: def solve(x, foo=None): ''' @rtype: `Result` if foo is None; (`Result`, confidence) otherwise. ''' Strictly speaking, the extension is backwards compatible; previous code that used `solve(x)` will still get back `Result`s. The problem is that in new code you can't tell what `solve(x,y)` returns unless you know something about `y`. My question is, is this totally unacceptable and should better be replaced by a new function `solve2 (x, foo=None)` that always returns (`Result`, confidence) tuples, or it might be a justifiable cost ? Any other API extension approaches that are applicable to such situations ? At a minimum it's highly undesirable. You lose a lot of readability/ maintainability. solve2/solve_ex is a little ugly, but that's less overall, so it's the better option. If your tuple gets to 3 or more I'd start wondering if you should return a single instance, with the return values as attributes. If Result is already such a thing I'd look even with a tuple of 2 to see if that's appropriate. -- http://mail.python.org/mailman/listinfo/python-list
[issue1683908] PEP 361 Warnings
Adam Olsen rha...@gmail.com added the comment: Aye. 2.6 has come and gone, with most or all warnings applied using (I believe) a different patch. If any future work is needed it can get a new ticket. -- status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1683908 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5564] os.symlink/os.link docs should say old/new, not src/dst
New submission from Adam Olsen rha...@gmail.com: destination is ambiguous. It means opposite things, depending on if it's the symlink creation operation or if it's the symlink itself. In contrast, old is clearly what existed before the operation, and new is what the operation creates. This terminology is already in use by os.rename. -- assignee: georg.brandl components: Documentation messages: 84171 nosy: Rhamphoryncus, georg.brandl severity: normal status: open title: os.symlink/os.link docs should say old/new, not src/dst ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5564 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: removing duplication from a huge list.
On Feb 27, 9:55 am, Falcolas garri...@gmail.com wrote: If order did matter, and the list itself couldn't be stored in memory, I would personally do some sort of hash of each item (or something as simple as first 5 bytes, last 5 bytes and length), keeping a reference to which item the hash belongs, sort and identify duplicates in the hash, and using the reference check to see if the actual items in question match as well. Pretty brutish and slow, but it's the first algorithm which comes to mind. Of course, I'm assuming that the list items are long enough to warrant using a hash and not the values themselves. Might as well move all the duplication checking to sqlite. Although it seems tempting to stick a layer in front, you will always require either a full comparison or a full update, so there's no potential for a fast path. -- http://mail.python.org/mailman/listinfo/python-list
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Adam Olsen rha...@gmail.com added the comment: issue 960406 broke this as part of a fix for readline. I believe that was motivated by fixing ctrl-C in the main thread, but non-main threads were thrown in as a why not measure. msg 46078 is the mention of this. You can go into readlingsigs7.patch and search for SET_THREAD_SIGMASK. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Adam Olsen rha...@gmail.com added the comment: The readline API just sucks. It's not at all designed to be used simultaneously from multiple threads, so we shouldn't even try. Ban using it in non-main threads, restore the blocking of signals, and go on with our merry lives. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1975] signals not always delivered to main thread, since other threads have the signal unmasked
Changes by Adam Olsen rha...@gmail.com: -- versions: +Python 2.6, Python 2.7, Python 3.0, Python 3.1 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue1975 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: more on unescaping escapes
On Feb 23, 7:18 pm, bvdp b...@mellowood.ca wrote: Gabriel Genellina wrote: En Mon, 23 Feb 2009 23:31:20 -0200, bvdp b...@mellowood.ca escribió: Gabriel Genellina wrote: En Mon, 23 Feb 2009 22:46:34 -0200, bvdp b...@mellowood.ca escribió: Chris Rebert wrote: On Mon, Feb 23, 2009 at 4:26 PM, bvdp b...@mellowood.ca wrote: [problem with Python and Windows paths using backslashes] Is there any particular reason you can't just internally use regular forward-slashes for the paths? [...] you are absolutely right! Just use '/' on both systems and be done with it. Of course I still need to use \x20 for spaces, but that is easy. Why is that? \x20 is exactly the same as . It's not like %20 in URLs, that becomes a space only after decoding. I need to use the \x20 because of my parser. I'm reading unquoted lines from a file. The file creater needs to use the form foo\x20bar without the quotes in the file so my parser can read it as a single token. Later, the string/token needs to be decoded with the \x20 converted to a space. So, in my file foo bar (no quotes) is read as 2 tokens; foo\x20bar is one. So, it's not really a problem of what happens when you assign a string in the form foo bar, rather how to convert the \x20 in a string to a space. I think the \\ just complicates the entire issue. Just thinking, if you was reading the string from a file, why were you worried about \\ and \ in the first place? (Ok, you moved to use / so this is moot now). Just cruft introduced while I was trying to figure it all out. Having to figure the \\ and \x20 at same time with file and keyboard input just confused the entire issue :) Having the user set a line like c:\\Program\x20File ... works just fine. I'll suggest he use c:/program\x20files to make it bit simple for HIM, not my parser. Unfortunately, due to some bad design decisions on my part about 5 years ago I'm afraid I'm stuck with the \x20. Thanks. You're confusing the python source with the actual contents of the string. We already do one pass at decoding, which is why \x20 is quite literally no different from a space: '\x20' ' ' However, the interactive interpreter uses repr(x), so various characters that are considered formatting, such as a tab, get reescaped when printing: '\t' '\t' len('\t') 1 It really is a tab that gets stored there, not the escape for one. Finally, if you give python an unknown escape it passes it leaves it as an escape. Then, when the interactive interpreter uses repr(x), it is the backslash itself that gets reescaped: '\P' '\\P' len('\P') 2 list('\P') ['\\', 'P'] What does this all mean? If you want to test your parser with python literals you need to escape them twice, like so: 'c:Program\\x20Filestest' 'c:Program\\x20Filestest' list('c:Program\\x20Filestest') ['c', ':', '\\', '\\', 'P', 'r', 'o', 'g', 'r', 'a', 'm', '\\', 'x', '2', '0', 'F', 'i', 'l', 'e', 's', '\\', '\\', 't', 'e', 's', 't'] 'c:Program\\x20Filestest'.decode('string-escape') 'c:\\Program Files\\test' list('c:Program\\x20Filestest'.decode('string-escape')) ['c', ':', '\\', 'P', 'r', 'o', 'g', 'r', 'a', 'm', ' ', 'F', 'i', 'l', 'e', 's', '\\', 't', 'e', 's', 't'] However, there's an easier way: use raw strings, which prevent python from unescaping anything: r'c:\\Program\x20Files\\test' 'c:Program\\x20Filestest' list(r'c:\\Program\x20Files\\test') ['c', ':', '\\', '\\', 'P', 'r', 'o', 'g', 'r', 'a', 'm', '\\', 'x', '2', '0', 'F', 'i', 'l', 'e', 's', '\\', '\\', 't', 'e', 's', 't'] -- http://mail.python.org/mailman/listinfo/python-list
Re: What encoding does u'...' syntax use?
On Feb 21, 10:48 am, a...@pythoncraft.com (Aahz) wrote: In article 499f397c.7030...@v.loewis.de, =?ISO-8859-15?Q?=22Martin_v=2E_L=F6wis=22?= mar...@v.loewis.de wrote: Yes, I know that. But every concrete representation of a unicode string has to have an encoding associated with it, including unicode strings produced by the Python parser when it parses the ascii string u'\xb5' My question is: what is that encoding? The internal representation is either UTF-16, or UTF-32; which one is a compile-time choice (i.e. when the Python interpreter is built). Wait, I thought it was UCS-2 or UCS-4? Or am I misremembering the countless threads about the distinction between UTF and UCS? Nope, that's partly mislabeling and partly a bug. UCS-2/UCS-4 refer to Unicode 1.1 and earlier, with no surrogates. We target Unicode 5.1. If you naively encode UCS-2 as UTF-8 you really end up with CESU-8. You miss the step where you combine surrogate pairs (which only exist in UTF-16) into a single supplementary character. Lo and behold, that's actually what current python does in some places. It's not pretty. See bugs #3297 and #3672. -- http://mail.python.org/mailman/listinfo/python-list
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: Antoine, x ^= x4 has a higher collision rate than just a rotate. However, it's still lower than a statistically random hash. If you modify the benchmark to randomly discard 90% of its contents this should give you random addresses, reflecting a long-running program. Here's the results I got (I used shift, too lazy to rotate): XOR, sequential: 20.174627065692999 XOR, random: 30.460708379770004 shift, sequential:19.148091554626003 shift, random:30.495631933229998 original, sequential: 23.73646926877 original, random: 33.53617715837 Not massive, but still worth fixing the hash. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: The alignment requirements (long double) make it impossible to have anything in those bits. Hypothetically, a custom allocator could lower the alignment requirements to sizeof(void *). However, rotating to the high bits is pointless as they're the least likely to be used — impossible in this case, as only the 2 highest bits would contain anything, and for that you'd need a dictionary with at least 2 billion entries on 32bit, which is more than the 32bit address space. 64-bit is similar. Note that mixing the bits back in, via XOR or similar, is actually more likely to hurt than help. It's just like ints and strings, who's hash values are very sequential, a simple shift tends to get us sequential hashes. This gives us a far lower collision rate than a statistically random hash. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: Antoine, I only meant list() and dict() to be an example of objects with a larger allocation pattern. We get a substantial benefit from the sequentially increasing memory addresses, and I wanted to make sure that benefit wasn't lost on larger allocations than object(). Mark, I concede the point about rotating; I believe the cost on x86 is the same regardless. Why are you still only rotating 3 bits? My results were better with 4 bits, and that should be the sweet spot for the typical use cases. Also, would the use of size_t make this code simpler? It should be the size of the pointer even on windows. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: At four bits, you may be throwing away information and I don't think that's cool. Even if some selected timings are better with more bits shifted, all you're really showing is that there is more randomness in the upper bits than the lower ones. But that doesn't mean than the lower one contribute nothing at all. On the contrary, the expected collision rate for a half-full dictionary is about 21%, whereas I'm getting less than 5%. I'm taking advantage of the sequentiality of addresses, just as int and str hashes do for their values. However, you're right that it's only one use case. Although creating a burst of objects for a throw-away set may itself be common, it's typically with int or str, and doing it with custom objects is presumably fairly rare; certainly not a good microbenchmark for the rest of the interpreter. Creating a list of 10 objects, then shuffling and picking a few increases my collision rate back up to 21%. That should more accurately reflect a long-running program using custom objects as keys in a dict. That said, I still prefer the simplicity of a rotate. Adding an arbitrary set of OR, XOR, or add makes me uneasy; I know enough to do them wrong (reduce entropy), but not enough to do them right. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: Testing with a large set of ids is a good demonstration, but not proof. Forming a set of *all* possible values within a certain range is proof. However, XOR does work (OR definitely does not) — it's a 1-to-1 transformation (reversible as you say.) Additionally, it still gives the unnaturally low collision rate when using sequential addresses, so there's no objection there. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: On my 64-bit linux box there's nothing in the last 4 bits: [id(o)%16 for o in [object() for i in range(128)]] [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] And with a bit more complicated functions I can determine how much shift gives us the lowest collision rate: def a(size, shift): return len(set((id(o) shift) % (size * 2) for o in [object() for i in range(size)])) def b(size): return [a(size, shift) for shift in range(11)] def c(): for i in range(1, 9): size = 2**i x = ', '.join('% 3s' % count for count in b(size)) print('% 3s: %s' % (size, x)) c() 2: 1, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2 4: 1, 1, 2, 3, 4, 3, 2, 4, 4, 3, 2 8: 1, 2, 4, 6, 6, 7, 8, 6, 4, 3, 2 16: 2, 4, 7, 9, 12, 13, 12, 8, 5, 3, 2 32: 4, 8, 14, 23, 30, 25, 19, 12, 7, 4, 2 64: 8, 16, 32, 55, 64, 38, 22, 13, 8, 4, 2 128: 16, 32, 64, 114, 128, 71, 39, 22, 12, 6, 3 256: 32, 64, 128, 242, 242, 123, 71, 38, 20, 10, 5 The fifth column (ie 4 bits of shift, a divide of 16) works the best. Although it varies from run to run, probably more than half the results in that column have no collisions at all. .. although, if I replace object() with list() I get best results with a shift of 6 bits. Replacing it with dict() is best with 8 bits. We may want something more complicated. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue5186] Reduce hash collisions for objects with no __hash__ method
Adam Olsen rha...@gmail.com added the comment: Upon further inspection, although a shift of 4 (on a 64-bit linux box) isn't perfect for dict, it's fairly close to it and well beyond random hash values. Mixing things more is just gonna lower it towards random values. c() 2: 1, 1, 1, 2, 2, 1, 1, 1, 1, 1, 2 4: 1, 1, 2, 3, 4, 3, 3, 2, 2, 2, 3 8: 1, 2, 4, 7, 8, 7, 5, 6, 7, 5, 5 16: 2, 4, 7, 11, 16, 15, 12, 14, 15, 9, 7 32: 3, 5, 10, 18, 31, 30, 30, 30, 31, 20, 12 64: 8, 14, 23, 36, 47, 54, 59, 59, 61, 37, 21 128: 16, 32, 58, 83, 118, 100, 110, 114, 126, 73, 41 256: 32, 64, 128, 195, 227, 197, 203, 240, 253, 150, 78 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue5186 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3959] Add Google's ipaddr.py to the stdlib
Changes by Adam Olsen rha...@gmail.com: -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue3959 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4074] Building a list of tuples has non-linear performance
Adam Olsen rha...@gmail.com added the comment: I didn't test it, but the patch looks okay to me. -- nosy: +Rhamphoryncus ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue4074 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3999] Real segmentation fault handler
Changes by Adam Olsen [EMAIL PROTECTED]: -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3999 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1215] Python hang when catching a segfault
Adam Olsen [EMAIL PROTECTED] added the comment: I'm in favour of just the doc change now. It's less work and we don't really need to disable that usage. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1215 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue4006] os.getenv silently discards env variables with non-UTF-8 values
Changes by Adam Olsen [EMAIL PROTECTED]: -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue4006 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
Re: 2.6, 3.0, and truly independent intepreters
On Fri, Oct 24, 2008 at 4:48 PM, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 10/24/2008 2:15 PM, came the following characters from the keyboard of Rhamphoryncus: On Oct 24, 2:59 pm, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 10/24/2008 1:09 PM, came the following characters from the keyboard of Rhamphoryncus: PyE: objects are reclassified as shareable or non-shareable, many types are now only allowed to be shareable. A module and its classes become shareable with the use of a __future__ import, and their shareddict uses a read-write lock for scalability. Most other shareable objects are immutable. Each thread is run in its own private monitor, and thus protected from the normal threading memory module nasties. Alas, this gives you all the semantics, but you still need scalable garbage collection.. and CPython's refcounting needs the GIL. Hmm. So I think your PyE is an instance is an attempt to be more explicit about what I said above in PyC: PyC threads do not share data between threads except by explicit interfaces. I consider your definitions of shared data types somewhat orthogonal to the types of threads, in that both PyA and PyC threads could use these new shared data items. Unlike PyC, there's a *lot* shared by default (classes, modules, function), but it requires only minimal recoding. It's as close to have your cake and eat it too as you're gonna get. Yes, but I like my cake frosted with performance; Guido's non-acceptance of granular locks in the blog entry someone referenced was due to the slowdown acquired with granular locking and shared objects. Your PyE model, with highly granular sharing, will likely suffer the same fate. No, my approach includes scalable performance. Typical paths will involve *no* contention (ie no locking). classes and modules use shareddict, which is based on a read-write lock built into the interpreter, so it's uncontended for read-only usage patterns. Pretty much everything else is immutable. Of course that doesn't include the cost of garbage collection. CPython's refcounting can't scale. The independent threads model, with only slight locking for a few explicitly shared objects, has a much better chance of getting better performance overall. With one thread running, it would be the same as today; with multiple threads, it should scale at the same rate as the system... minus any locking done at the higher level. So use processes with a little IPC for these expensive-yet-shared objects. multiprocessing does it already. I think/hope that you meant that many types are now only allowed to be non-shareable? At least, I think that should be the default; they should be within the context of a single, independent interpreter instance, so other interpreters don't even know they exist, much less how to share them. If so, then I understand most of the rest of your paragraph, and it could be a way of providing shared objects, perhaps. There aren't multiple interpreters under my model. You only need one. Instead, you create a monitor, and run a thread on it. A list is not shareable, so it can only be used within the monitor it's created within, but the list type object is shareable. The python interpreter code should be sharable, having been written in C, and being/becoming reentrant. So in that sense, there is only one interpreter. Similarly, any other reentrant C extensions would be that way. On the other hand, each thread of execution requires its own interpreter context, so that would have to be independent for the threads to be independent. It is the combination of code+context that I call an interpreter, and there would be one per thread for PyC threads. Bytecode for loaded modules could potentially be shared, if it is also immutable. However, that could be in my mental phase 2, as it would require an extra level of complexity in the interpreter as it creates shared bytecode... there would be a memory savings from avoiding multiple copies of shared bytecode, likely, and maybe also a compilation performance savings. So it sounds like a win, but it is a win that can deferred for initial simplicity, to prove the concept is or is not workable. A monitor allows a single thread to run at a time; that is the same situation as the present GIL. I guess I don't fully understand your model. To use your terminology, each monitor is a context. Each thread operates in a different monitor. As you say, most C functions are already thread-safe (reentrant). All I need to do is avoid letting multiple threads modify a single mutable object (such as a list) at a time, which I do by containing it within a single monitor (context). -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list
Re: 2.6, 3.0, and truly independent intepreters
On Fri, Oct 24, 2008 at 5:38 PM, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 10/24/2008 2:16 PM, came the following characters from the keyboard of Rhamphoryncus: On Oct 24, 3:02 pm, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 10/23/2008 2:24 PM, came the following characters from the keyboard of Rhamphoryncus: On Oct 23, 11:30 am, Glenn Linderman [EMAIL PROTECTED] wrote: On approximately 10/23/2008 12:24 AM, came the following characters from the keyboard of Christian Heimes Andy wrote: I'm very - not absolute, but very - sure that Guido and the initial designers of Python would have added the GIL anyway. The GIL makes Python faster on single core machines and more stable on multi core machines. Actually, the GIL doesn't make Python faster; it is a design decision that reduces the overhead of lock acquisition, while still allowing use of global variables. Using finer-grained locks has higher run-time cost; eliminating the use of global variables has a higher programmer-time cost, but would actually run faster and more concurrently than using a GIL. Especially on a multi-core/multi-CPU machine. Those globals include classes, modules, and functions. You can't have *any* objects shared. Your interpreters are entirely isolated, much like processes (and we all start wondering why you don't use processes in the first place.) Indeed; isolated, independent interpreters are one of the goals. It is, indeed, much like processes, but in a single address space. It allows the master process (Python or C for the embedded case) to be coded using memory references and copies and pointer swaps instead of using semaphores, and potentially multi-megabyte message transfers. It is not clear to me that with the use of shared memory between processes, that the application couldn't use processes, and achieve many of the same goals. On the other hand, the code to create and manipulate processes and shared memory blocks is harder to write and has more overhead than the code to create and manipulate threads, which can, when told, access any memory block in the process. This allows the shared memory to be resized more easily, or more blocks of shared memory created more easily. On the other hand, the creation of shared memory blocks shouldn't be a high-use operation in a program that has sufficient number crunching to do to be able to consume multiple cores/CPUs. Or use safethread. It imposes safe semantics on shared objects, so you can keep your global classes, modules, and functions. Still need garbage collection though, and on CPython that means refcounting and the GIL. Sounds like safethread has 35-40% overhead. Sounds like too much, to me. The specific implementation of safethread, which attempts to remove the GIL from CPython, has significant overhead and had very limited success at being scalable. The monitor design proposed by safethread has no inherent overhead and is completely scalable. -- Adam Olsen, aka Rhamphoryncus -- http://mail.python.org/mailman/listinfo/python-list
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: Marc, I don't understand what you're saying. UTF-16's surrogates are not optional. Unicode 2.0 and later require them, and Python is supposed to support it. Likewise, UCS-4 originally allowed a much larger range of code points, but it no longer does; allowing them would mean supporting only old, archaic versions of the standards (which is clearly not desirable.) You are right in that I shouldn't have said a pair of ill-formed code units. I should have said a pair of unassigned code points, which is how UCS-2 always have and always will classify them. Although python may allow ill-formed sequences to be created internally (primarily lone surrogates on UTF-16 builds), it cannot encode or decode them. The standard is clear that these are to be treated as errors, which the .decode()'s errors argument controls. You could add a new value for errors to pass-through the garbage, but I fail to see a use case for it. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: I've got another report open about the codecs not properly reporting errors relating to surrogates: issue 3672 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding
New submission from Adam Olsen [EMAIL PROTECTED]: The Unicode FAQ makes it quite clear that any surrogates in UTF-8 or UTF-32 should be treated as errors. Lone surrogates in UTF-16 should probably be treated as errors too (but only during encoding/decoding; unicode objects on UTF-16 builds should allow them to be created through slicing). http://unicode.org/faq/utf_bom.html#30 http://unicode.org/faq/utf_bom.html#42 http://unicode.org/faq/utf_bom.html#40 Lone surrogate in UTF-8 (effectively CESU-8): '\xED\xA0\x81'.decode('utf-8') u'\ud801' Surrogate pair in UTF-8: '\xED\xA0\x81\xED\xB0\x80'.decode('utf-8') u'\ud801\udc00' On a UTF-32 build, encoding a surrogate pair with UTF-16, then decoding again will produce the proper non-surrogate scalar value. This has security implications, although rare as characters outside the BMP are rare: u'\ud801\udc00'.encode('utf-16').decode('utf-16') u'\U00010400' Also on a UTF-32 build, decoding of a lone surrogate in UTF-16 fails (correctly), but encoding one does not: u'\ud801'.encode('utf-16') '\xff\xfe\x01\xd8' I have gotten a report of a user decoding bad data using x.decode('utf-8', 'replace'), then getting an error from Gtk+ when the ill-formed surrogates reached it. Fixing this would cause issue 3297 to blow up loudly, rather than silently. -- messages: 71889 nosy: Rhamphoryncus severity: normal status: open title: Ill-formed surrogates not treated as errors during encoding/decoding ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3672 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3672] Ill-formed surrogates not treated as errors during encoding/decoding
Changes by Adam Olsen [EMAIL PROTECTED]: -- components: +Unicode type: - behavior ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3672 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1758146] Crash in PyObject_Malloc
Adam Olsen [EMAIL PROTECTED] added the comment: Graham, I appreciate the history of sub-interpreters and how entrenched they are. Changing those practises requires a significant investment. This is an important factor to consider. The other factor is the continuing maintenance and development cost. Subinterpreters add substantial complexity, which I can personally vouch for. This is exhibited in the GIL API not supporting them properly and in the various bugs that have been found over the years. Imagine, for a moment, that the situation were reversed; that everything were built on threading. Would you consider even for a moment adding sub-interpreters? How could you justify it? It's not a decision to be taken lightly, but my preference is clear: bite the bullet, make the change. It's easier in the long run. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1758146 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3299] invalid object destruction in re.finditer()
Changes by Adam Olsen [EMAIL PROTECTED]: -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3299 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: Marc, perhaps Unicode has refined their definitions since you last looked? Valid UTF-8 *cannot* contain surrogates[1]. If it does, you have CESU-8[2][3], not UTF-8. So there are two bugs: first, the UTF-8 codec should refuse to load surrogates. Second, since the original bug showed up before the .pyc is created, something in the parse/compilation/whatever stage is producing CESU-8. [1] 4th bullet point of D92 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf [2] http://unicode.org/reports/tr26/ [3] http://en.wikipedia.org/wiki/CESU-8 ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: Err, to clarify, the parse/compile/whatever stages is producing broken UTF-32 (surrogates are ill-formed there too), and that gets transformed into CESU-8 when the .pyc is saved. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: Simpler way to reproduce this (on linux): $ rm unicodetest.pyc $ $ python -c 'import unicodetest' Result: False Len: 2 1 Repr: u'\ud800\udd23' u'\U00010123' $ $ python -c 'import unicodetest' Result: True Len: 1 1 Repr: u'\U00010123' u'\U00010123' Storing surrogates in UTF-32 is ill-formed[1], so the first part definitely shouldn't be failing on linux (with a UTF-32 build). The repr could go either way, as unicode doesn't cover escape sequences. We could allow u'\ud800\udd23' literals to magically become u'\U00010123' on UTF-32 builds. We already allow repr(u'\ud800\udd23') to magically become u'\U00010123' on UTF-16 builds (which is why the repr test always passes there, rather than always failing). The bigger problem is how much we prohibit ill-formed character sequences. We already prevent values above U+10, but not inappropriate surrogates. [1] Search for D90 in http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf -- nosy: +Rhamphoryncus Added file: http://bugs.python.org/file10880/unicodetest.py ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3297] Python interpreter uses Unicode surrogate pairs only before the pyc is created
Adam Olsen [EMAIL PROTECTED] added the comment: No, the configure options are wrong - we do use UTF-16 and UTF-32. Although modern UCS-4 has been restricted down to the range of UTF-32 (it used to be larger!), UCS-2 still doesn't support the supplementary planes (ie no surrogates.) If it really was UCS-2, the repr wouldn't be u'\U00010123' on windows. It'd be a pair of ill-formed code units instead. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3297 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3329] API for setting the memory allocator used by Python
Adam Olsen [EMAIL PROTECTED] added the comment: Basically you just want to kick the malloc implementation into doing some housekeeping, freeing its caches? I'm kinda surprised you don't add the hook directly to your libc's malloc. IMO, there's no use-case for this until Py_Finalize can completely tear down the interpreter, which requires a lot of special work (killing(!) daemon threads, unloading C modules, etc), and nobody intends to do that at this point. The practical alternative, as I said, is to run python in a subprocess. Let the OS clean up after us. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3329 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue874900] threading module can deadlock after fork
Adam Olsen [EMAIL PROTECTED] added the comment: In general I suggest replacing the lock with a new lock, rather than trying to release the existing one. Releasing *might* work in this case, only because it's really a semaphore underneath, but it's still easier to think about by just replacing. I also suggest deleting _active and recreating it with only the current thread. I don't understand how test_join_on_shutdown could succeed. The main thread shouldn't be marked as done.. well, ever. The test should hang. I suspect test_join_in_forked_process should call os.waitpid(childpid) so it doesn't exit early, which would cause the original Popen.wait() call to exit before the output is produced. The same problem of test_join_on_shutdown also applies. Ditto for test_join_in_forked_from_thread. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue874900 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue874900] threading module can deadlock after fork
Adam Olsen [EMAIL PROTECTED] added the comment: Looking over some of the other platforms for thread_*.h, I'm sure replacing the lock is the right thing. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue874900 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3329] API for setting the memory allocator used by Python
Adam Olsen [EMAIL PROTECTED] added the comment: How would this allow you to free all memory? The interpreter will still reference it, so you'd have to have called Py_Finalize already, and promise not to call Py_Initialize afterwords. This further supposes the process will live a long time after killing off the interpreter, but in that case I recommend putting python in a child process instead. -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3329 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue874900] threading module can deadlock after fork
Changes by Adam Olsen [EMAIL PROTECTED]: -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue874900 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1758146] Crash in PyObject_Malloc
Adam Olsen [EMAIL PROTECTED] added the comment: Apparently modwsgi uses subinterpreters because some third-party packages aren't sufficiently thread-safe - modwsgi can't fix those packages, so subinterpreters are the next best thing. http://groups.google.com/group/modwsgi/browse_frm/thread/988bf560a1ae8147/2f97271930870989 This is a weak argument for language design. Subinterpreters should be deprecated, the problems with third-party packages found and fixed, and ultimately subinterpreters ripped out. If you wish to improve the situation, I suggest you help fix the problems in the third-party packages. For example, http://code.google.com/p/modwsgi/wiki/IntegrationWithTrac implies trac is configured with environment variables - clearly not thread-safe. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1758146 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1758146] Crash in PyObject_Malloc
Adam Olsen [EMAIL PROTECTED] added the comment: Ahh, I did miss that bit, but it doesn't really matter. Tell modwsgi to only use the main interpreter (PythonInterpreter main_interpreter), and if you want multiple modules of the same name put them in different packages. Any other problems (trac using env vars for configuration) should be fixed directly. (My previous comment about building your own import mechanism was overkill. Writing a package that uses relative imports is enough - in fact, that's what relative imports are for.) ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1758146 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1758146] Crash in PyObject_Malloc
Adam Olsen [EMAIL PROTECTED] added the comment: Franco, you need to look at the line above that check: PyThreadState *check = PyGILState_GetThisThreadState(); if (check check-interp == newts-interp check != newts) Py_FatalError(Invalid thread state for this thread); PyGILState_GetThisThreadState returns the original tstate *for that thread*. What it's asserting is that, if there's a second tstate *in that thread*, it must be in a different subinterpreter. It doesn't prevent your second and third tstate from sharing the same subinterpreter, but it probably should, as this check implies it's an invariant. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1758146 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue1758146] Crash in PyObject_Malloc
Adam Olsen [EMAIL PROTECTED] added the comment: It's only checking that the original tstate *for the current thread* and the new tstate have a different subinterpreter. A subinterpreter can have multiple tstates, so long as they're all in different threads. The documentation is referring specifically to the PyGILState_Ensure and PyGILState_Release functions. Calling these says I want a tstate, and I don't know if I had one already. The problem is that, with subinterpreters, you may not get a tstate with the subinterpreter you want. subinterpreter references saved in globals may lead to obscure crashes or other errors - some of these have been fixed over the years, but I doubt they all have. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue1758146 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3268] Cleanup of tp_basicsize inheritance
New submission from Adam Olsen [EMAIL PROTECTED]: inherit_special contains logic to inherit the base type's tp_basicsize if the new type doesn't have it set. The logic was spread over several lines, but actually does almost nothing (presumably an artifact of previous versions), so here's a patch to clean it up. There was also an incorrect comment which I've removed. A new one should perhaps be added explaining what the other code there does, but it's not affected by what I'm changing, and I'm not sure why it's doing what it's doing anyway, so I'll leave that to someone else. -- files: python-inheritsize.diff keywords: patch messages: 69169 nosy: Rhamphoryncus, nnorwitz severity: normal status: open title: Cleanup of tp_basicsize inheritance Added file: http://bugs.python.org/file10798/python-inheritsize.diff ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3268 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3088] test_multiprocessing hangs on OS X 10.5.3
Adam Olsen [EMAIL PROTECTED] added the comment: On Wed, Jul 2, 2008 at 3:44 PM, Mark Dickinson [EMAIL PROTECTED] wrote: Mark Dickinson [EMAIL PROTECTED] added the comment: Mark, can you try commenting out _TestCondition and seeing if you can still get it to hang?; I removed the _TestCondition class entirely from test_multiprocessing, and did make test again. It didn't hang! :-) It crashed instead. :-( Try running ulimit -c unlimited in the shell before running the test (from the same shell). After it aborts it should dump a core file, which you can then inspect using gdb ./python core, to which bt will give you a stack trace (backtrace). On a minor note, I'd suggest running ./python -m test.regrtest explicitly, rather than make test. The latter runs the test suite twice, deleting all .pyc files before the first run, to detect problems in their creation. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3088 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3088] test_multiprocessing hangs on OS X 10.5.3
Adam Olsen [EMAIL PROTECTED] added the comment: On Wed, Jul 2, 2008 at 5:08 PM, Mark Dickinson [EMAIL PROTECTED] wrote: Mark Dickinson [EMAIL PROTECTED] added the comment: Okay. I just got about 5 perfect runs of the test suite, followed by: Macintosh-3:trunk dickinsm$ ./python.exe -m test.regrtest [...] test_multiprocessing Assertion failed: (bp != NULL), function PyObject_Malloc, file Objects/obmalloc.c, line 746. Abort trap (core dumped) I then did: gdb -c /cores/core.16235 I've attached the traceback as traceback.txt Are you sure that's right? That traceback has no mention of PyObject_Malloc or obmalloc.c. Try checking the date. Also, if you use gdb ./python.exe corefile to start gdb it should print a warning if the program doesn't match the core. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3088 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3088] test_multiprocessing hangs on OS X 10.5.3
Adam Olsen [EMAIL PROTECTED] added the comment: That looks better. It crashed while deleting an exception, who's args tuple has a bogus refcount. Could be a refcount issue of the exception or the args, or of something that that references them, or a dangling pointer, or a buffer overrun, etc. Things to try: 1) Run pystack in gdb, from Misc/gdbinit 2) Print the exception type. Use up until you reach BaseException_clear, then do print self-ob_type-tp_name. Also do print *self and make sure the ob_refcnt is at 0 and the other fields look sane. 3) Compile using --without-pymalloc and throw it at a real memory debugger. I'd suggest starting with your libc's own debugging options, as they tend to be less invasive: http://developer.apple.com/documentation/Performance/Conceptual/ManagingMemory/Articles/MallocDebug.html . If that doesn't work, look at Electric Fence, Valgrind, or your tool of choice. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3088 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3088] test_multiprocessing hangs on OS X 10.5.3
Adam Olsen [EMAIL PROTECTED] added the comment: Also, make sure you do a make clean since you last updated the tree or touched any file or ran configure. The automatic dependency checking isn't 100% reliable. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3088 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3154] Quick search box renders too long on FireFox 3
Adam Olsen [EMAIL PROTECTED] added the comment: I've checked it again, using the font preferences rather than the zoom setting, and I can reproduce the problem. Part of the problem stems from using pixels to set the margin, rather than ems (or whatever the text box is based on). However, although the margin (at least visually) scales up evenly, the fonts themselves do not. Arguably this is a defect in Firefox, or maybe even the HTML specs themselves. Additionally, that only seems to control the visual margin. I've yet to figure out what controls the layout (such as wrapping the Go button). ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3154 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: On Sun, Jun 22, 2008 at 2:56 PM, Antoine Pitrou [EMAIL PROTECTED] wrote: Le dimanche 22 juin 2008 à 20:40 +, Adam Olsen a écrit : Passing in e.args is probably sufficient. I think it's very optimistic :-) Some exception objects can hold dynamic state which is simply not stored in the args tuple. See Twisted's Failure objects for an extreme example: http://twistedmatrix.com/trac/browser/trunk/twisted/python/failure.py (yes, it is used an an exception: see raise self in the trap() method) Failure doesn't have an args tuple and doesn't subclass Exception (or BaseException) - it already needs modification in 3.0. It's heaped full of complexity and implementation details. I wouldn't be surprised if your changes break it in subtle ways too. In short, if forcing Failure to be rewritten is the only consequence of using .args, it's an acceptable tradeoff of not corrupting exception contexts. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: * cause/context cycles should be avoided. Naive traceback printing could become confused, and I can't think of any accidental way to provoke it (besides the problem mentioned here.) * I suspect PyErr_Display handled string exceptions in 2.x, and this is an artifact of that * No opinion on PyErr_DisplaySingle * PyErr_Display is used by PyErr_Print, and it must end up with no active exception. Additionally, third party code may depend on this semantic. Maybe PyErr_DisplayEx? * +1 on standardizing tracebacks -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: On Sun, Jun 22, 2008 at 8:07 AM, Antoine Pitrou [EMAIL PROTECTED] wrote: You mean they should be detected when the exception is set? I was afraid that it may make exception raising slower. Reporting is not performance sensitive in comparison to exception raising. (the problem mentioned here is already avoided in the patch, but the detection of other cycles is deferred to exception reporting for the reason given above) I meant only that trivial cycles should be detected. However, I hadn't read your patch, so I didn't realize you already knew of a way to create a non-trivial cycle. This has placed a niggling doubt in my mind about chaining the exceptions, rather than the tracebacks. Hrm. * PyErr_Display is used by PyErr_Print, and it must end up with no active exception. Additionally, third party code may depend on this semantic. Maybe PyErr_DisplayEx? I was not proposing to change the exception swallowing semantics, just to add a return value indicating if any errors had occurred while displaying the exception. Ahh, harmless then, but to what benefit? Wouldn't the traceback module be better suited to any possible error reporting? ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: On Sun, Jun 22, 2008 at 1:04 PM, Antoine Pitrou [EMAIL PROTECTED] wrote: Antoine Pitrou [EMAIL PROTECTED] added the comment: Le dimanche 22 juin 2008 à 17:17 +, Adam Olsen a écrit : I meant only that trivial cycles should be detected. However, I hadn't read your patch, so I didn't realize you already knew of a way to create a non-trivial cycle. This has placed a niggling doubt in my mind about chaining the exceptions, rather than the tracebacks. Hrm. Chaining the tracebacks rather than the exceptions loses important information: what is the nature of the exception which is the cause or context of the current exception? I assumed each leg of the traceback would reference the relevant exception. Although.. this is effectively the same as creating a new exception instance when reraised, rather than modifying the old one. Reusing the old is done for performance I believe. It is improbable to create such a cycle involuntarily, it means you raise an old exception in replacement of a newer one caused by the older, which I think is quite contorted. It is also quite easy to avoid creating the cycle, simply by re-raising outside of any except handler. I'm not convinced. try: ... # Lookup except A as a: # Lookup failed try: ... # Fallback except B as b: # Fallback failed raise a # The original exception is of the type we want For this behaviour, this is the most natural way to write it. Conceptually, there shouldn't be a cycle - the traceback should be the lookup, then the fallback, then whatever code is about this - exactly the order the code executed in. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: On Sun, Jun 22, 2008 at 1:48 PM, Antoine Pitrou [EMAIL PROTECTED] wrote: Antoine Pitrou [EMAIL PROTECTED] added the comment: Le dimanche 22 juin 2008 à 19:23 +, Adam Olsen a écrit : For this behaviour, this is the most natural way to write it. Conceptually, there shouldn't be a cycle I agree your example is not far-fetched. How about avoiding cycles for implicit chaining, and letting users shoot themselves in the foot with explicit recursive chaining if they want? Detection would be cheap enough, just a simple loop without any memory allocation. That's still O(n). I'm not so easily convinced it's cheap enough. And for that matter, I'm not convinced it's correct. The inner exception's context becomes clobbered when we modify the outer exception's traceback. The inner's context should reference the traceback as it was at that point. This would all be a lot easier if reraising always created a new exception. Can you think of a way to skip that only when we can be sure its safe? Maybe as simple as counting the references to it? ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3112] implement PEP 3134 exception reporting
Adam Olsen [EMAIL PROTECTED] added the comment: On Sun, Jun 22, 2008 at 2:20 PM, Antoine Pitrou [EMAIL PROTECTED] wrote: Antoine Pitrou [EMAIL PROTECTED] added the comment: Le dimanche 22 juin 2008 à 19:57 +, Adam Olsen a écrit : That's still O(n). I'm not so easily convinced it's cheap enough. O(n) when n will almost never be greater than 5 (and very often equal to 1 or 2), and when the unit is the cost of a pointer dereference plus the cost of a pointer comparison, still sounds cheap. We could bench it anyway. Indeed. And for that matter, I'm not convinced it's correct. The inner exception's context becomes clobbered when we modify the outer exception's traceback. The inner's context should reference the traceback as it was at that point. Yes, I've just thought about that, it's a bit annoying... We have to decide what is more annoying: that, or a reference cycle that can delay deallocation of stuff attached to an exception (including local variables attached to the tracebacks)? The cycle is only created by broken behaviour. The more I think about it, the more I want to fix it (by not reusing the exception). This would all be a lot easier if reraising always created a new exception. How do you duplicate an instance of an user-defined exception? Using an equivalent of copy.deepcopy()? It will probably end up much more expensive than the above-mentioned O(n) search. Passing in e.args is probably sufficient. All this would need to be discussed on python-dev (or python-3000?) though. Can you think of a way to skip that only when we can be sure its safe? Maybe as simple as counting the references to it? I don't think so, the exception can be referenced in an unknown number of local variables (themselves potentially referenced by tracebacks). Can be, or will be? Only the most common behaviour needs to be optimized. ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3112 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3155] Python should expose a pthread_cond_timedwait API for threading
Changes by Adam Olsen [EMAIL PROTECTED]: -- nosy: +Rhamphoryncus ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3155 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue3153] sqlite leaks on error
New submission from Adam Olsen [EMAIL PROTECTED]: Found in Modules/_sqlite/cursor.c: self-statement = PyObject_New(pysqlite_Statement, pysqlite_StatementTy pe); if (!self-statement) { goto error; } rc = pysqlite_statement_create(self-statement, self-connection, operation); if (rc != SQLITE_OK) { self-statement = 0; goto error; } Besides the ugliness of allocating the object before passing it to the create function, if pysqlite_statement_create fails, the object is leaked. -- components: Extension Modules messages: 68478 nosy: Rhamphoryncus severity: normal status: open title: sqlite leaks on error type: resource usage ___ Python tracker [EMAIL PROTECTED] http://bugs.python.org/issue3153 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com