Re: [Python-Dev] A new dict for Xmas?
Am 22.12.2011 19:15, schrieb Maciej Fijalkowski: >> - I wonder whether the shared keys could be computed at compile >> time, considering all attribute names that get assigned for >> self. The compiler could list those in the code object, and >> class creation could iterate over all methods (taking base >> classes into account). > > This is hard, because sometimes you don't quite know what the self > *is* even, especially if __init__ calls some methods or there is any > sort of control flow. You can however track what gets assigned at > runtime at have shapes associated with objects. Actually, it's fairly easy, as it only needs to be heuristical. I am proposing the exact heuristics as specified above ("attribute names that get assigned for self"). I don't think that __init__ calling methods is much of an issue here, since these methods then still have attributes assigned to self. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
Martin v. Löwis wrote: The current dict implementation is getting pretty old, isn't it time we had a new one (for xmas)? I like the approach, and I think something should be done indeed. If you don't contribute your approach, I'd like to drop at least ma_smalltable for 3.3. A number of things about your branch came to my mind: - it would be useful to have a specialized representation for all-keys-are-strings. In that case, me_hash could be dropped from the representation. You would get savings compared to the status quo even in the non-shared case. It might tricky switching key tables and I dont think it would save much memory as keys that are widely shared take up very little memory anyway, and not many other dicts are long-lived. (It might improve performance for dicts used for keyword arguments) - why does _dictkeys need to be a full-blown Python object? We need refcounting and the size, but not the type slot. It doesn't. It's just a hangover from my original HotPy implementation where all objects needed a type for the GC. So yes, the type slot could be removed. - I wonder whether the shared keys could be computed at compile time, considering all attribute names that get assigned for self. The compiler could list those in the code object, and class creation could iterate over all methods (taking base classes into account). It probably wouldn't be that hard to make a guess at compile time as to what the shared keys would be, but it doesn't really matter. The generation of intermediate shared keys will only happen once per class, so the overhead would be negligible. To cut down on that overhead, we could use a ref-count trick: If the instance being updating and its class hold the only two refs to an immutable keys(-set -table -vector?) then just treat it as mutable. I'll modify the repo to incorporate these changes when I have a chance. Cheers, Mark. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
>> - it would be useful to have a specialized representation for >> all-keys-are-strings. In that case, me_hash could be dropped >> from the representation. You would get savings compared to >> the status quo even in the non-shared case. > It might tricky switching key tables and I dont think it would save much > memory as keys that are widely shared take up very little memory anyway, > and not many other dicts are long-lived. Why do you say that? In a plain 3.3 interpreter, I counted 595 dict objects (see script below). Of these, 563 (so nearly of them) had only strings as keys. Among those, I found 286 different key sets, where 231 key sets occurred only once (i.e. wouldn't be shared). Together, the string dictionaries had 13282 keys, and you could save as many pointers (actually more, because there will be more key slots than keys). I'm not sure why you think the string dicts with unshared keys would be short-lived. But even if they were, what matters is the steady-state number of dictionaries - if for every short-lived dictionary that gets released another one is created, any memory savings from reducing the dict size would still materialize. >> - I wonder whether the shared keys could be computed at compile >> time, considering all attribute names that get assigned for >> self. The compiler could list those in the code object, and >> class creation could iterate over all methods (taking base >> classes into account). >> > > It probably wouldn't be that hard to make a guess at compile time as to > what the shared keys would be, but it doesn't really matter. > The generation of intermediate shared keys will only happen once per > class, so the overhead would be negligible. I'm not so much concerned about overhead, but about correctness/ effectiveness of the heuristics. For a class with dynamic attributes, you may well come up with a very large key set. With source analysis, you wouldn't attempt to grow the keyset beyond what likely is being shared. Regards, Martin import sys d = sys.getobjects(0,dict) print(len(d), "dicts") d2 = [] for o in d: keys = o.keys() if not keys:continue types = tuple(set(type(k) for k in keys)) if types != (str,): continue d2.append(tuple(sorted(keys))) print(len(d2), "str dicts") freq = {} for keys in d2: freq[keys] = freq.get(keys,0)+1 print(len(freq), "different key sets") freq = sorted(freq.items(), key=lambda t:t[1]) print(len([o for o in freq if o[1]==1]), "unsharable") print(sum(len(o[0]) for o in freq), "keys") print(freq[-10:]) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
Martin v. Löwis wrote: - it would be useful to have a specialized representation for all-keys-are-strings. In that case, me_hash could be dropped from the representation. You would get savings compared to the status quo even in the non-shared case. It might tricky switching key tables and I dont think it would save much memory as keys that are widely shared take up very little memory anyway, and not many other dicts are long-lived. Why do you say that? In a plain 3.3 interpreter, I counted 595 dict objects (see script below). Of these, 563 (so nearly of them) had only strings as keys. Among those, I found 286 different key sets, where 231 key sets occurred only once (i.e. wouldn't be shared). Together, the string dictionaries had 13282 keys, and you could save as many pointers (actually more, because there will be more key slots than keys). The question is how much memory needs to be saved to be worth adding the complexity, 10kb: No, 100Mb: yes. So data from "real" benchmarks would be useful. Also, I'm assuming that it would be tricky to implement correctly due to implicit assumptions in the rest of the code. If I'm wrong and its easy to implement then please do. I'm not sure why you think the string dicts with unshared keys would be short-lived. Not all, but most. Most dicts with unshared keys would most likely be for keyword parameters. Explicit dicts tend to be few in number. (When I say few I mean up to 1k, not 100k or 1M). Module dicts are very likely to have unshared keys; they number in the 10s or 100s, but they do tend to be large. But even if they were, what matters is the steady-state number of dictionaries - if for every short-lived dictionary that gets released another one is created, any memory savings from reducing the dict size would still materialize. But only a few kb? - I wonder whether the shared keys could be computed at compile time, considering all attribute names that get assigned for self. The compiler could list those in the code object, and class creation could iterate over all methods (taking base classes into account). It probably wouldn't be that hard to make a guess at compile time as to what the shared keys would be, but it doesn't really matter. The generation of intermediate shared keys will only happen once per class, so the overhead would be negligible. I'm not so much concerned about overhead, but about correctness/ effectiveness of the heuristics. For a class with dynamic attributes, you may well come up with a very large key set. With source analysis, you wouldn't attempt to grow the keyset beyond what likely is being shared. I agree some sort of heuristic is required to limit excessive growth and prevent pathological behaviour. The current implementation just has a cut off at a certain size; it could definitely be improved. As I said, I'll update the code soon and then, well what's the phase... Oh yes, "patches welcome" ;) Thanks for the feedback. Cheers, Mark. Regards, Martin import sys d = sys.getobjects(0,dict) print(len(d), "dicts") d2 = [] for o in d: keys = o.keys() if not keys:continue types = tuple(set(type(k) for k in keys)) if types != (str,): continue d2.append(tuple(sorted(keys))) print(len(d2), "str dicts") freq = {} for keys in d2: freq[keys] = freq.get(keys,0)+1 print(len(freq), "different key sets") freq = sorted(freq.items(), key=lambda t:t[1]) print(len([o for o in freq if o[1]==1]), "unsharable") print(sum(len(o[0]) for o in freq), "keys") print(freq[-10:]) ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
Mark Shannon, 23.12.2011 12:21: Martin v. Löwis wrote: - it would be useful to have a specialized representation for all-keys-are-strings. In that case, me_hash could be dropped from the representation. You would get savings compared to the status quo even in the non-shared case. It might tricky switching key tables and I dont think it would save much memory as keys that are widely shared take up very little memory anyway, and not many other dicts are long-lived. Why do you say that? In a plain 3.3 interpreter, I counted 595 dict objects (see script below). Of these, 563 (so nearly of them) had only strings as keys. Among those, I found 286 different key sets, where 231 key sets occurred only once (i.e. wouldn't be shared). Together, the string dictionaries had 13282 keys, and you could save as many pointers (actually more, because there will be more key slots than keys). The question is how much memory needs to be saved to be worth adding the complexity, 10kb: No, 100Mb: yes. So data from "real" benchmarks would be useful. Consider taking a parsed MiniDOM tree as a benchmark. It contains so many instances of just a couple of different classes that it just has to make a huge difference if each of those instances is even just a bit smaller. It should also make a clear difference for plain Python ElementTree. I attached a benchmark script that measures the parsing speed as well as the total memory usage of the in-memory tree. You can get data files from the following places, just download them and pass their file names on the command line: http://gnosis.cx/download/hamlet.xml http://www.ibiblio.org/xml/examples/religion/ot/ot.xml Here are some results from my own machine for comparison: http://blog.behnel.de/index.php?p=197 Stefan # $Id: benchmark.py 3248 2007-09-02 15:01:26Z fredrik $ # simple elementtree benchmark program from xml.etree import ElementTree try: from xml.etree import cElementTree except ImportError: cElementTree = None try: from lxml import etree except ImportError: etree = None try: from elementtree import XMLTreeBuilder # xmllib except ImportError: XMLTreeBuilder = None try: from elementtree import SimpleXMLTreeBuilder # xmllib except ImportError: SimpleXMLTreeBuilder = None try: from elementtree import SgmlopXMLTreeBuilder # sgmlop except ImportError: SgmlopXMLTreeBuilder = None try: from xml.dom import minidom # pyexpat+minidom except ImportError: minidom = None try: import resource except ImportError: resource = None import os, sys import traceback from time import time FORK=True def fork(func): if not hasattr(os, 'fork'): return func def wrap(*args, **kwargs): if not FORK: return func(*args, **kwargs) cid = os.fork() if cid: os.waitpid(cid, 0) else: try: func(*args, **kwargs) except Exception: traceback.print_exc() finally: os._exit(0) return wrap def measure_mem(old=0): if resource is None: return used = resource.getrusage(resource.RUSAGE_SELF) print('Memory usage: %s%s' % (used.ru_maxrss, (' (+%s)' % (used.ru_maxrss - old)) if old > 0 else '')) return used.ru_maxrss @fork def benchmark(file, builder_module): oldmem = measure_mem() with open(file, "rb") as source: t = time() try: builder = builder_module.XMLParser except AttributeError: builder = builder_module.TreeBuilder parser = builder() while 1: data = source.read(32768) if not data: break parser.feed(data) tree = parser.close() t = time() - t print("%s.%s.feed(): %d nodes read in %.3f seconds" % ( builder_module.__name__, builder.__name__, len(list(tree.getiterator())), t )) measure_mem(oldmem) del tree @fork def benchmark_parse(file, driver): oldmem = measure_mem() t = time() tree = driver.parse(file) t = time() - t print(driver.__name__ + ".parse done in %.3f seconds" % t) measure_mem(oldmem) del tree @fork def benchmark_minidom(file): oldmem = measure_mem() t = time() dom = minidom.parse(file) t = time() - t print("minidom tree read in %.3f seconds" % t) measure_mem(oldmem) del dom class configure_parser(object): def __init__(self, etree, name, **config): self.__name__ = name self.etree = etree self.parser = etree.XMLParser(**config) def parse(self, input): return self.etree.parse(input, self.parser) def run_benchmark(file): benchmark_parse(file, ElementTree) if cElementTree is not None: benchmark_parse(file, cElementTree) benchmark(file, cElementTree) if etree is not None: benchmark_parse(file, etree) benchmark_parse(file, configure_parser(
Re: [Python-Dev] A new dict for Xmas?
> If I'm wrong and its easy to implement then please do. Ok, so I take it that you are not interested in the idea. No problem. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
> Consider taking a parsed MiniDOM tree as a benchmark. It contains so > many instances of just a couple of different classes that it just has to > make a huge difference if each of those instances is even just a bit > smaller. It should also make a clear difference for plain Python > ElementTree. Of course, for minidom, Mark's current implementation should already save quite a lot of memory, since all elements and text nodes have the same attributes. Still, it would be good to see how Mark's implementation deals with that. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] A new dict for Xmas?
Martin v. Löwis wrote: If I'm wrong and its easy to implement then please do. Ok, so I take it that you are not interested in the idea. No problem. Its just that I don't think it would yield results commensurate with the effort. Also I think its worth keeping the initial version as simple as reasonably possible. Refinements can be added later. Cheers, Mark. Regards, Martin ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
[Python-Dev] Summary of Python tracker Issues
ACTIVITY SUMMARY (2011-12-16 - 2011-12-23) Python tracker at http://bugs.python.org/ To view or respond to any of the issues listed below, click on the issue. Do NOT respond to this message. Issues counts and deltas: open3168 ( -7) closed 22272 (+52) total 25440 (+45) Open issues with patches: 1358 Issues opened (27) == #13614: `setup.py register` fails if long_description contains RST http://bugs.python.org/issue13614 opened by techtonik #13615: `setup.py register` fails with -r argument http://bugs.python.org/issue13615 opened by techtonik #13617: Reject embedded null characters in wchar* strings http://bugs.python.org/issue13617 opened by haypo #13619: Add a new codec: "locale", the current locale encoding http://bugs.python.org/issue13619 opened by haypo #13621: Unicode performance regression in python3.3 vs python3.2 http://bugs.python.org/issue13621 opened by Boris.FELD #13629: _PyParser_TokenNames does not match up with the token.h number http://bugs.python.org/issue13629 opened by meador.inge #13630: IDLE: Find(ed) text is not highlighted while dialog box is ope http://bugs.python.org/issue13630 opened by marco #13631: readline fails to parse some forms of .editrc under editline ( http://bugs.python.org/issue13631 opened by zvezdan #13632: Update token documentation to reflect actual token types http://bugs.python.org/issue13632 opened by meador.inge #13633: Handling of hex character references in HTMLParser.handle_char http://bugs.python.org/issue13633 opened by ezio.melotti #13636: Python SSL Stack doesn't have a Secure Default set of ciphers http://bugs.python.org/issue13636 opened by naif #13638: PyErr_SetFromErrnoWithFilenameObject is undocumented http://bugs.python.org/issue13638 opened by pitrou #13639: UnicodeDecodeError when creating tar.gz with unicode name http://bugs.python.org/issue13639 opened by jason.coombs #13640: add mimetype for application/vnd.apple.mpegurl http://bugs.python.org/issue13640 opened by Hiroaki.Kawai #13641: decoding functions in the base64 module could accept unicode s http://bugs.python.org/issue13641 opened by pitrou #13642: urllib incorrectly quotes username and password in https basic http://bugs.python.org/issue13642 opened by joneskoo #13643: 'ascii' is a bad filesystem default encoding http://bugs.python.org/issue13643 opened by gz #13644: Python 3 crashes (segfaults) with this code. http://bugs.python.org/issue13644 opened by maniram.maniram #13645: test_import fails after test_coding http://bugs.python.org/issue13645 opened by pitrou #13646: Document poor interaction between multiprocessing and -m on Wi http://bugs.python.org/issue13646 opened by ncoghlan #13647: Python SSL stack doesn't securely validate certificate (as cli http://bugs.python.org/issue13647 opened by naif #13649: termios.ICANON is not documented http://bugs.python.org/issue13649 opened by techtonik #13651: Improve redirection in urllib http://bugs.python.org/issue13651 opened by tom.kel #13653: reorder set.intersection parameters for better performance http://bugs.python.org/issue13653 opened by dalke #13655: Python SSL stack doesn't have a default CA Store http://bugs.python.org/issue13655 opened by naif #13657: IDLE doesn't support sys.ps1 and sys.ps2. http://bugs.python.org/issue13657 opened by maniram.maniram #13658: Extra clause in class grammar documentation http://bugs.python.org/issue13658 opened by Joshua.Landau Most recent 15 issues with no replies (15) == #13658: Extra clause in class grammar documentation http://bugs.python.org/issue13658 #13657: IDLE doesn't support sys.ps1 and sys.ps2. http://bugs.python.org/issue13657 #13649: termios.ICANON is not documented http://bugs.python.org/issue13649 #13642: urllib incorrectly quotes username and password in https basic http://bugs.python.org/issue13642 #13641: decoding functions in the base64 module could accept unicode s http://bugs.python.org/issue13641 #13640: add mimetype for application/vnd.apple.mpegurl http://bugs.python.org/issue13640 #13638: PyErr_SetFromErrnoWithFilenameObject is undocumented http://bugs.python.org/issue13638 #13633: Handling of hex character references in HTMLParser.handle_char http://bugs.python.org/issue13633 #13632: Update token documentation to reflect actual token types http://bugs.python.org/issue13632 #13631: readline fails to parse some forms of .editrc under editline ( http://bugs.python.org/issue13631 #13608: remove born-deprecated PyUnicode_AsUnicodeAndSize http://bugs.python.org/issue13608 #13605: document argparse's nargs=REMAINDER http://bugs.python.org/issue13605 #13594: Aifc markers write fix http://bugs.python.org/issue13594 #13590: Prebuilt python-2.7.2 binaries for macosx can not compile c ex http://bugs.python.org/issue13590 #13574: refresh example in doc for Extending and Embedding http://bugs.python.org/issue13574 Most recent 15 issues