[issue27575] dict viewkeys intersection slow for large dicts

2016-09-06 Thread David Su

David Su added the comment:

ping

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-08-01 Thread David Su

Changes by David Su <dysu1...@gmail.com>:


Added file: http://bugs.python.org/file43961/performance_test.sh

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-08-01 Thread David Su

Changes by David Su <dysu1...@gmail.com>:


Added file: http://bugs.python.org/file43960/dictview_intersection_test.py

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-08-01 Thread David Su

David Su added the comment:

Attached is a patch that I had been working on. Could you please review and 
provide me with some feedback? Thanks!

--
keywords: +patch
Added file: http://bugs.python.org/file43959/dict_view_intersection.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-07-23 Thread David Su

David Su added the comment:

I am trying to implement the following algorithm:

http://blog.faroo.com/2012/06/07/improved-edit-distance-based-spelling-correction/

This algorithm is used for fast nearest neighbor queries for spell correction. 

Given a corpus of strings, for each string s, I generate all subsequences 
(Note: subsequence, not substring) of that string length (len(n) - k), call 
that subsequence t, where k is a tunable parameter. Then, I create a dict 
mapping the subsequences t to the original string s.

This process creates a very large dict from a relatively small corpus. For 
example, my corpus of ~500k strings blew up to a dict of ~10M keys, with k=3.

Then, for a query string q, you generate all subsequences of (len(q) - k), and 
perform an intersection with the keys of the dict. Then, the values of the dict 
corresponding to the key intersection will be possible misspelling of the query 
string q.

In pseudo-python:

def get_subseq(s: str, k: int):
returns all subsequences of s of length len(s) - k

def build_dict(corpus: List[str], k: int):
d = defaultdict(list)

for orig_str in corpus:
for subseq in get_subseq(orig_str, k):
d[subseq].append(orig_str)

return d

def query(d: dict, q: str, k:int):
q_subseq = set(get_subseq(q, k))
intersection = d.viewkeys() & q_subseq # this is slow when len(d) is large
return [orig_str for k in intersection for orig_str in d[k]]


By exposing an intersection interface, a certain degree of performance is 
expected by the user. It took me a considerable amount of debugging to realize 
that the dict.viewkeys() intersection was the performance bottleneck. I finally 
decided to dig into the CPython source code before discovering the root cause 
of the issue. While I am familiar with C and found the issue relatively 
quickly, for those that aren't, it should not be necessary to dig into the 
source code for a mainstream programming language like Python.


I am currently looking at how to potentially fix this in the CPython 
repository. I may submit a patch in the near future if I find a good solution.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-07-20 Thread David Su

David Su added the comment:

Here are some tests that I ran:


Using current naive viewkeys intersection:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0)" 
"d.viewkeys() & {0}"
100 loops, best of 3: 447 msec per loop

Nearly half a second per iteration is unacceptably slow.



Solution 1 from my previous comment:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0); s = 
set(d)" "s & {0}"
100 loops, best of 3: 0.391 usec per loop

This solution caches the keys of the dict in a set, then performs all 
intersections on that set. 
The initial creation of the set is slow, and it wastes a lot of memory, but 
each iteration is much faster.



The workaround that I ended up using:

$ python -m timeit -n 100 -s "d = dict.fromkeys(xrange(1000), 0)" "{item 
for item in {0} if item in d}"
100 loops, best of 3: 0.629 usec per loop

This solution just looks up each value in the set to see if it is also in the 
dict using a set comprehension.
This avoids wasting time and space on creating a set. For my use case, memory 
was a bigger issue as my dictionary can approach tens of GBs in size, and 
almost doubling the memory consumption from creating a cache set was 
unacceptable. 

Of course, in the set comprehension, the smaller of the dict/set should be 
iterated over to satisfy the O(min(len(d), len(s))) condition as specified in 
https://wiki.python.org/moin/TimeComplexity.
A implementation would look something like this:

def efficient_intersection(smaller_dict_or_set, bigger_dict_or_set):
if len(bigger_dict_or_set) < len(smaller_dict_or_set):
bigger_dict_or_set, smaller_dict_or_set = smaller_dict_or_set, 
bigger_dict_or_set

return {item for item in smaller_dict_or_set if item in bigger_dict_or_set}




In conclusion, porting over the set intersection logic to viewkeys would be the 
ideal solution. It avoids wasting memory like in the set cache solution, and I 
am sure the C implementation will be much faster than my workaround of manually 
performing an intersection using set comprehensions.

My view is that dict.viewkeys() & set(...) should be as performant as the 
syntax suggests.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-07-20 Thread David Su

Changes by David Su <dysu1...@gmail.com>:


--
nosy: +rhettinger

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue27575] dict viewkeys intersection slow for large dicts

2016-07-19 Thread David Su

New submission from David Su:

In dictobject.c, the function dictviews_and performs a very expensive set 
creation operation on the keys of the self dict object:

>From Python 2.7:
...
static PyObject*
dictviews_and(PyObject* self, PyObject *other)
{
PyObject *result = PySet_New(self);
...
tmp = PyObject_CallMethod(result, "intersection_update", "(O)", other);
...

Similar logic can be found in the Python 3 code as well.


For large dicts (~10 000 000 keys), it takes a significant amount of time to 
copy the keys into a new set, and also wastes a lot of memory. This issue is 
amplified when the intersection operation is performed many times.

This implementation uses O(len(self)) time and space, while it is expected to 
use O(min(len(self), len(other)) time and space.


Solution 1: Manually copy the keys of the dict into a set, and perform all 
intersection operations on that set. However, still wastes lots of memory. 

Solution 2 (Recommended): Port the intersection logic from the set class to the 
dict view class.

--
components: Library (Lib)
messages: 270836
nosy: David Su2
priority: normal
severity: normal
status: open
title: dict viewkeys intersection slow for large dicts
type: performance
versions: Python 2.7, Python 3.2, Python 3.3, Python 3.4, Python 3.5, Python 3.6

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue27575>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com