[issue27575] dict viewkeys intersection slow for large dicts
David Su added the comment: ping -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
Changes by David Su <dysu1...@gmail.com>: Added file: http://bugs.python.org/file43961/performance_test.sh ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
Changes by David Su <dysu1...@gmail.com>: Added file: http://bugs.python.org/file43960/dictview_intersection_test.py ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
David Su added the comment: Attached is a patch that I had been working on. Could you please review and provide me with some feedback? Thanks! -- keywords: +patch Added file: http://bugs.python.org/file43959/dict_view_intersection.patch ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
David Su added the comment: I am trying to implement the following algorithm: http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/ This algorithm is used for fast nearest neighbor queries for spell correction. Given a corpus of strings, for each string s, I generate all subsequences (Note: subsequence, not substring) of that string length (len(n) - k), call that subsequence t, where k is a tunable parameter. Then, I create a dict mapping the subsequences t to the original string s. This process creates a very large dict from a relatively small corpus. For example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3. Then, for a query string q, you generate all subsequences of (len(q) - k), and perform an intersection with the keys of the dict. Then, the values of the dict corresponding to the key intersection will be possible misspelling of the query string q. In pseudo-python: def get_subseq(s: str, k: int): returns all subsequences of s of length len(s) - k def build_dict(corpus: List[str], k: int): d = defaultdict(list) for orig_str in corpus: for subseq in get_subseq(orig_str, k): d[subseq].append(orig_str) return d def query(d: dict, q: str, k:int): q_subseq = set(get_subseq(q, k)) intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large return [orig_str for k in intersection for orig_str in d[k]] By exposing an intersection interface, a certain degree of performance is expected by the user. It took me a considerable amount of debugging to realize that the dict.viewkeys() intersection was the performance bottleneck. I finally decided to dig into the CPython source code before discovering the root cause of the issue. While I am familiar with C and found the issue relatively quickly, for those that aren't, it should not be necessary to dig into the source code for a mainstream programming language like Python. I am currently looking at how to potentially fix this in the CPython repository. I may submit a patch in the near future if I find a good solution. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
David Su added the comment: Here are some tests that I ran: Using current naive viewkeys intersection: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0)" "d.viewkeys() & {0}" 100 loops, best of 3: 447 msec per loop Nearly half a second per iteration is unacceptably slow. Solution 1 from my previous comment: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0); s = set(d)" "s & {0}" 100 loops, best of 3: 0.391 usec per loop This solution caches the keys of the dict in a set, then performs all intersections on that set. The initial creation of the set is slow, and it wastes a lot of memory, but each iteration is much faster. The workaround that I ended up using: $ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0)" "{item for item in {0} if item in d}" 100 loops, best of 3: 0.629 usec per loop This solution just looks up each value in the set to see if it is also in the dict using a set comprehension. This avoids wasting time and space on creating a set. For my use case, memory was a bigger issue as my dictionary can approach tens of GBs in size, and almost doubling the memory consumption from creating a cache set was unacceptable. Of course, in the set comprehension, the smaller of the dict/set should be iterated over to satisfy the O(min(len(d), len(s))) condition as specified in https://wiki.python.org/moin/TimeComplexity. A implementation would look something like this: def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set): if len(bigger_dict_or_set) < len(smaller_dict_or_set): bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set, bigger_dict_or_set return {item for item in smaller_dict_or_set if item in bigger_dict_or_set} In conclusion, porting over the set intersection logic to viewkeys would be the ideal solution. It avoids wasting memory like in the set cache solution, and I am sure the C implementation will be much faster than my workaround of manually performing an intersection using set comprehensions. My view is that dict.viewkeys() & set(...) should be as performant as the syntax suggests. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
Changes by David Su <dysu1...@gmail.com>: -- nosy: +rhettinger ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue27575] dict viewkeys intersection slow for large dicts
New submission from David Su: In dictobject.c, the function dictviews_and performs a very expensive set creation operation on the keys of the self dict object: >From Python 2.7: ... static PyObject* dictviews_and(PyObject* self, PyObject *other) { PyObject *result = PySet_New(self); ... tmp = PyObject_CallMethod(result, "intersection_update", "(O)", other); ... Similar logic can be found in the Python 3 code as well. For large dicts (~10 000 000 keys), it takes a significant amount of time to copy the keys into a new set, and also wastes a lot of memory. This issue is amplified when the intersection operation is performed many times. This implementation uses O(len(self)) time and space, while it is expected to use O(min(len(self), len(other)) time and space. Solution 1: Manually copy the keys of the dict into a set, and perform all intersection operations on that set. However, still wastes lots of memory. Solution 2 (Recommended): Port the intersection logic from the set class to the dict view class. -- components: Library (Lib) messages: 270836 nosy: David Su2 priority: normal severity: normal status: open title: dict viewkeys intersection slow for large dicts type: performance versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue27575> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com