[issue9520] Add Patricia Trie high performance container

2010-11-13 Thread Raymond Hettinger

Raymond Hettinger rhettin...@users.sourceforge.net added the comment:

I don't think a Patricia Trie is going to find its way into the code 
distribution.  It has a better chance as a third-party module listed on PyPI 
(much in the same way that people access memcached from Python).

--
resolution:  - rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-11-11 Thread Dirkjan Ochtman

Changes by Dirkjan Ochtman dirk...@ochtman.nl:


--
nosy: +djc

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-11-11 Thread Raymond Hettinger

Changes by Raymond Hettinger rhettin...@users.sourceforge.net:


--
assignee:  - rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-11-11 Thread Łukasz Langa

Changes by Łukasz Langa luk...@langa.pl:


--
nosy: +lukasz.langa

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 For example, on the x64 machine the following dict() mapping 
 10,000,000 very short unicode keys (~7 chars) to integers eats 149
 bytes per entry. 

This is counting the keys too. Under 3.x:

 d = {}
 for i in range(0, 1000): d[str(i)] = i
... 
 sys.getsizeof(d)
402653464
 sys.getsizeof(d) / len(d)
40.2653464

So, the dict itself uses ~40 bytes/entry. Since this is a 64-bit Python build, 
each entry uses three words of 8 bytes each, that is 24 bytes per entry (one 
word for the key, one word for the associated value, one word for the cached 
hash value). So, you see the ratio of allocated entries in the hash table over 
used entries is only a bit above 2, which is reasonable.

Do note that unicode objects themselves are not that compact:

 sys.getsizeof(100)
72

If you have many of them, you might use bytestrings instead:

 sys.getsizeof(b100)
40

I've modified your benchmark to run under 3.x and will post it in a later 
message (I don't know whether bio.trie exists for 3.x, though).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-14 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

So, here is the modified benchmark. It first creates a cache of the random 
wordlist, because it is quite long to compute (for N up to 1000). The cache 
is reused in subsequent runs. This takes some memory though (probably won't run 
it if you have less than 2GB). The cache itself is quite small on-disk (around 
80 MB).

I'm outputting the total dict size and the space occupied per individual key, 
and I've also timed lookups separately. Here are results on my machine:

1000 words ( 961 keys), 3269137 words/s, 16980987 lookups/s, 51 
bytes/key (0.0MB)
   1 words (9042 keys), 2697648 words/s, 13680052 lookups/s, 87 
bytes/key (0.8MB)
  10 words (   83168 keys), 2462269 words/s, 6956074 lookups/s, 37 
bytes/key (3.0MB)
  50 words (  389442 keys), 1802431 words/s, 4733774 lookups/s, 64 
bytes/key (24.0MB)
 100 words (  755372 keys), 1702130 words/s, 4485229 lookups/s, 66 
bytes/key (48.0MB)
 200 words ( 1463359 keys), 1616658 words/s, 4251021 lookups/s, 68 
bytes/key (96.0MB)
 500 words ( 3501140 keys), 1608889 words/s, 3909212 lookups/s, 57 
bytes/key (192.0MB)
1000 words ( 6764089 keys), 1531136 words/s, 3600395 lookups/s, 59 
bytes/key (384.0MB)

As you can see, the O(1) behaviour seems almost observed up to the 1000 
words limit (given the way the data is computed, I'm afraid I can't go further 
than that). Of course, very small dicts are faster because everything fits in 
the CPU caches.

--
Added file: http://bugs.python.org/file18520/dcbench-py3k.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-14 Thread Dmitry Chichkov

Dmitry Chichkov dchich...@gmail.com added the comment:

Yes, it looks like you are right. And while there is some slight performance 
degradation, at least nothing drastic is happening up to 30M keys. Using your 
modified test:

1000 words ( 961 keys), 3609555 words/s, 19239926 lookups/s, 51 
bytes/key (0.0MB)
   1 words (9042 keys), 3390980 words/s, 18180771 lookups/s, 87 
bytes/key (0.0MB)
  10 words (   83168 keys), 3298809 words/s, 12509481 lookups/s, 37 
bytes/key (3.0MB)
 100 words (  755372 keys), 2477793 words/s, 7537963 lookups/s, 66 
bytes/key (48.0MB)
 500 words ( 3501140 keys), 2291004 words/s, 6487727 lookups/s, 57 
bytes/key (192.0MB)
1000 words ( 6764089 keys), 2238081 words/s, 6216454 lookups/s, 59 
bytes/key (384.0MB)
2000 words (13061821 keys), 2175688 words/s, 5817085 lookups/s, 61 
bytes/key (768.0MB)
5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 
bytes/key (1536.0MB)

It looks like internal memory estimates (sys.getsizeof()) are also pretty good:

Before: ['VmPeak:\t10199764 kB', 'VmSize:\t 5552416 kB']
5000 words (31188460 keys), 2117751 words/s, 5137209 lookups/s, 51 
bytes/key (1536.0MB)
After: ['VmPeak:\t10199764 kB', 'VmSize:\t 7125284 kB']
 
7 125 284 kB - 5 552 416 kB = 1 536.00391 MB

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-13 Thread Dmitry Chichkov

Changes by Dmitry Chichkov dchich...@gmail.com:


Added file: http://bugs.python.org/file18515/dc.dict.bench.0.02.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-08 Thread Terry J. Reedy

Terry J. Reedy tjre...@udel.edu added the comment:

A decade ago, 8gig memory was rare, now it is almost standard for high-end 
consumer machines. People are now doing bigger computations and stressing the 
implementation more. So it is plausible that we need to either tweak the core 
class implementations or (non-exclusively) add specialize structures optimized 
for certain tasks.

However, please to not throw the word 'bug' around lightly. When inappropriate, 
it only gets in the way. The language standard intentionally omits performance 
guarantees, especially when running on fragmented finite memory with multiple 
cache levels and possible OS limitations.

--
nosy: +terry.reedy

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-08 Thread Dmitry Chichkov

Dmitry Chichkov dchich...@gmail.com added the comment:

Yes. Data containers optimized for very large datasets, compactness and strict 
adherence to O(1) can be beneficial. 

Python have great high performance containers, but there is a certain lack of 
compact ones. For example, on the x64 machine the following dict() mapping 
10,000,000 very short unicode keys (~7 chars) to integers eats 149 bytes per 
entry. 
 import os, re
 d = dict()
 for i in xrange(0, 1000): d[unicode(i)] = i
 print re.findall((VmPeak.*|VmSize.*), open('/proc/%d/status' % 
 os.getpid()).read())
['VmPeak:\t 1458324 kB', 'VmSize:\t 1458324 kB']

I can understand that there are all kinds of reasons why it is so and even why 
it is good. But having an unobtrusive *compact* container could be nice 
(although you'd be most welcome if you could tweak default containers, so they 
would adjust to the large datasets appropriately),

Also I can't emphasize more that compactness is still important sometimes. 
Modern days datasets are getting larger and larger (literally terabytes) and 
'just add more memory' strategy is not always feasible. 


Regarding the dict() violation of O(1). So far I was unable to reproduce it in 
the test. I can certainly see it on the real dataset, and trust me it was very 
annoying to see ETA 10 hours going down to 8 hours and then gradually up again 
to 17 hours and hanging there. This was _solved_ by switching from dict() to 
Bio.trie(). So this problem certainly had something to do with dict(). I don't 
know what is causing it though.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-06 Thread Daniel Urban

Changes by Daniel Urban urban.dani...@gmail.com:


--
nosy: +durban

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Éric Araujo

Éric Araujo mer...@netwok.org added the comment:

Thanks for the request Dmitry. You may want to read 
http://www.python.org/dev/peps/pep-0002/ to know which steps are required to 
add a module to the stdlib. In particular, the module has to be proven on the 
field and the author needs to sign a contributor agreement. It’s not clear 
which implementation you want to add (the one from BioPython I guess); I don’t 
know if this proposal will fly if you’re not the author of the module. (I don’t 
want to scare you away, just to explain the process.)

If you contact the author of the code you want to include and they agree with 
your idea, then it’ll be time to convince python-ideas that this would be a 
generally useful addition and that the chosen class is the best of breed.

A note about versions: I removed 3.1 since new features don’t go in stable 
releases, and 3.3 since it does not exist yet. Everything in 3.2 will go in 3.3 
too: this value is useful only to mean “won’t go in 3.2, remind later”.

I also edited the title to make it shorter, to make emails and reports more 
readable. Thanks again for your report.

--
nosy: +merwok
title: Add Patricia Trie high performance container (python's defaultdict(int) 
is unusable on datasets with 10,000,000+ keys.) - Add Patricia Trie high 
performance container
type: feature request - performance

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Éric Araujo

Changes by Éric Araujo mer...@netwok.org:


--
type: performance - feature request

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Dmitry Chichkov

Dmitry Chichkov dchich...@gmail.com added the comment:

Thank you for your comment. Perhaps we should try separate this into two issues:
1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. Here I 
should provide a solid test case showing a deviation from O(1);

2) Feature request/idea. Add Patricia Trie high performance container.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Mark Dickinson

Mark Dickinson dicki...@gmail.com added the comment:

 1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys.
 Here I should provide a solid test case showing a deviation from O(1);

That would be helpful.  Are you sure that the slow-down you're seeing isn't 
simply due to running out of system memory?

--
nosy: +mark.dickinson

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Antoine Pitrou

Antoine Pitrou pit...@free.fr added the comment:

 1) Bug. Python's dict() is unusable on datasets with 10,000,000+ keys. 
 Here I should provide a solid test case showing a deviation from O(1);

Try to disable the cyclic garbage collector before the test (gc.disable()).

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container

2010-08-05 Thread Dmitry Chichkov

Dmitry Chichkov dchich...@gmail.com added the comment:

No. I'm not simply running out of system memory. 8Gb/x64/linux. And in my test 
cases I've only seen ~25% of memory utilized. And good idea. I'll try to play 
with the cyclic garbage collector.

It is harder than I thought to make a solid synthetic test case addressing that 
issue. The trouble you need to be able to generate data (e.g. 100,000,000 
words/5,000,000 unique) with a distribution close to that in the real life 
scenario (e.g. word lengths, frequencies and uniqueness in the english text). 
If somebody have a good idea onto how to do it nicely - you'd be very welcome. 

My best shot so far is in the attachment.

--
Added file: http://bugs.python.org/file18406/dc.dict.bench.0.01.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

2010-08-04 Thread Dmitry Chichkov

New submission from Dmitry Chichkov dchich...@gmail.com:

On large data sets (10-100 million keys) the default python dictionary 
implementation fails to meet memory and performance constraints. It also 
apparently fails to keep O(1) complexity (after just 1M keys). As such, there 
is a need for good, optimized, practical implementation of a high-performance 
container that can store such datasets. 

One of the alternatives is a regular Patricia Trie data structure. It can meet 
performance requirements on such datasets. Criteria:
* strictly O(1);
* works well on large datasets (~100M keys); memory efficient;
* same or better performance as dict();
* supports regular dict | counter interface;
* supports unicode keys;
* supports any characters in the keys;
* persistent (can be pickled); 
* in-memory library;

There are a few existing implementations available:
* BioPython trie - http://biopython.org/
* py-radix - http://www.mindrot.org/projects/py-radix/
* PyPi trie - http://pypi.python.org/pypi/trie

A few other relevant alternatives/implementations:
* NIST trie - http://www.itl.nist.gov/div897/sqg/dads/HTML/patriciatree.html
* C++ Trie Library - http://wikipedia-clustering.speedblue.org/trie.php
* PyAvl trie - http://pypi.python.org/pypi/pyavl/1.12_1
* PyJudy tree - http://www.dalkescientific.com/Python/PyJudy.html
* Redis - http://code.google.com/p/redis/
* Memcached - http://memcached.org/

An alternative to a basic Patricia Trie could be some state-of-the-art approach 
from some modern research (i.e. 
http://www.aclweb.org/anthology/W/W09/W09-1505.pdf ), 


The best existing implementation I've been able to find so far is one in the 
BioPython. Compared to defaultdict(int) on the task of counting words. Dataset 
123,981,712 words (6,504,484 unique), 1..21 characters long:
* bio.tree - 459 Mb/0.13 Hours, good O(1) behavior
* defaultdict(int) - 693 Mb/0.32 Hours, poor, almost O(N) behavior

At 8,000, keys python defaultdict(int) starts showing almost O(N) behavior 
and gets unusable with 10,000,000+ unique keys. 



A few usage/installatio notes on BioPython trie:
$ sudo apt-get install python-biopython

 from Bio import trie
 trieobj = trie.trie()
 trieobj[hello] = 5
 trieobj[hello] += 1
 print trieobj[hello]
 print trieobj.keys()

More examples at:
http://python-biopython.sourcearchive.com/documentation/1.54/test__triefind_8py-source.html

--
components: Library (Lib)
messages: 112937
nosy: dmtr
priority: normal
severity: normal
status: open
title: Add Patricia Trie high performance container (python's defaultdict(int) 
is unusable on datasets with 10,000,000+ keys.)
type: performance
versions: Python 3.1, Python 3.2, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue9520] Add Patricia Trie high performance container (python's defaultdict(int) is unusable on datasets with 10, 000, 000+ keys.)

2010-08-04 Thread Ezio Melotti

Ezio Melotti ezio.melo...@gmail.com added the comment:

I think it would be better to propose this on the python-ideas mailing list.

--
nosy: +ezio.melotti, rhettinger
type: performance - feature request
versions:  -Python 3.1, Python 3.3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue9520
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com