Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Martin v. Löwis
Am 22.12.2011 19:15, schrieb Maciej Fijalkowski:
>> - I wonder whether the shared keys could be computed at compile
>>  time, considering all attribute names that get assigned for
>>  self. The compiler could list those in the code object, and
>>  class creation could iterate over all methods (taking base
>>  classes into account).
> 
> This is hard, because sometimes you don't quite know what the self
> *is* even, especially if __init__ calls some methods or there is any
> sort of control flow. You can however track what gets assigned at
> runtime at have shapes associated with objects.

Actually, it's fairly easy, as it only needs to be heuristical.
I am proposing the exact heuristics as specified above ("attribute
names that get assigned for self").

I don't think that __init__ calling methods is much of an issue here,
since these methods then still have attributes assigned to self.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Mark Shannon

Martin v. Löwis wrote:

The current dict implementation is getting pretty old,
isn't it time we had a new one (for xmas)?


I like the approach, and I think something should be done indeed.
If you don't contribute your approach, I'd like to drop at least
ma_smalltable for 3.3.

A number of things about your branch came to my mind:
- it would be useful to have a specialized representation for
  all-keys-are-strings. In that case, me_hash could be dropped
  from the representation. You would get savings compared to
  the status quo even in the non-shared case.
It might tricky switching key tables and I dont think it would save much 
memory as keys that are widely shared take up very little memory anyway,

and not many other dicts are long-lived.

(It might improve performance for dicts used for keyword arguments)


- why does _dictkeys need to be a full-blown Python object?
  We need refcounting and the size, but not the type slot.
It doesn't. It's just a hangover from my original HotPy implementation 
where all objects needed a type for the GC.

So yes, the type slot could be removed.


- I wonder whether the shared keys could be computed at compile
  time, considering all attribute names that get assigned for
  self. The compiler could list those in the code object, and
  class creation could iterate over all methods (taking base
  classes into account).



It probably wouldn't be that hard to make a guess at compile time as to 
what the shared keys would be, but it doesn't really matter.
The generation of intermediate shared keys will only happen once per 
class, so the overhead would be negligible.


To cut down on that overhead, we could use a ref-count trick: If the 
instance being updating and its class hold the only two refs to an 
immutable keys(-set -table -vector?) then just treat it as mutable.


I'll modify the repo to incorporate these changes when I have a chance.

Cheers,
Mark.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Martin v. Löwis
>> - it would be useful to have a specialized representation for
>>   all-keys-are-strings. In that case, me_hash could be dropped
>>   from the representation. You would get savings compared to
>>   the status quo even in the non-shared case.
> It might tricky switching key tables and I dont think it would save much
> memory as keys that are widely shared take up very little memory anyway,
> and not many other dicts are long-lived.

Why do you say that? In a plain 3.3 interpreter, I counted 595 dict
objects (see script below). Of these, 563 (so nearly of them) had
only strings as keys. Among those, I found 286 different key sets,
where 231 key sets occurred only once (i.e. wouldn't be shared).

Together, the string dictionaries had 13282 keys, and you could save
as many pointers (actually more, because there will be more key slots
than keys).

I'm not sure why you think the string dicts with unshared keys would be
short-lived. But even if they were, what matters is the steady-state
number of dictionaries - if for every short-lived dictionary that
gets released another one is created, any memory savings from reducing
the dict size would still materialize.

>> - I wonder whether the shared keys could be computed at compile
>>   time, considering all attribute names that get assigned for
>>   self. The compiler could list those in the code object, and
>>   class creation could iterate over all methods (taking base
>>   classes into account).
>>
> 
> It probably wouldn't be that hard to make a guess at compile time as to
> what the shared keys would be, but it doesn't really matter.
> The generation of intermediate shared keys will only happen once per
> class, so the overhead would be negligible.

I'm not so much concerned about overhead, but about correctness/
effectiveness of the heuristics. For a class with dynamic attributes,
you may well come up with a very large key set. With source analysis,
you wouldn't attempt to grow the keyset beyond what likely is being
shared.

Regards,
Martin

import sys
d = sys.getobjects(0,dict)
print(len(d), "dicts")
d2 = []
for o in d:
keys = o.keys()
if not keys:continue
types = tuple(set(type(k) for k in keys))
if types != (str,):
continue
d2.append(tuple(sorted(keys)))
print(len(d2), "str dicts")
freq = {}
for keys in d2:
freq[keys] = freq.get(keys,0)+1
print(len(freq), "different key sets")
freq = sorted(freq.items(), key=lambda t:t[1])
print(len([o for o in freq if o[1]==1]), "unsharable")
print(sum(len(o[0]) for o in freq), "keys")
print(freq[-10:])
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Mark Shannon

Martin v. Löwis wrote:

- it would be useful to have a specialized representation for
  all-keys-are-strings. In that case, me_hash could be dropped
  from the representation. You would get savings compared to
  the status quo even in the non-shared case.

It might tricky switching key tables and I dont think it would save much
memory as keys that are widely shared take up very little memory anyway,
and not many other dicts are long-lived.


Why do you say that? In a plain 3.3 interpreter, I counted 595 dict
objects (see script below). Of these, 563 (so nearly of them) had
only strings as keys. Among those, I found 286 different key sets,
where 231 key sets occurred only once (i.e. wouldn't be shared).

Together, the string dictionaries had 13282 keys, and you could save
as many pointers (actually more, because there will be more key slots
than keys).


The question is how much memory needs to be saved to be worth adding the 
complexity, 10kb: No, 100Mb: yes.

So data from "real" benchmarks would be useful.

Also, I'm assuming that it would be tricky to implement correctly due to 
implicit assumptions in the rest of the code.

If I'm wrong and its easy to implement then please do.



I'm not sure why you think the string dicts with unshared keys would be
short-lived. 

Not all, but most. Most dicts with unshared keys would most likely be
for keyword parameters. Explicit dicts tend to be few in number.
(When I say few I mean up to 1k, not 100k or 1M).

Module dicts are very likely to have unshared keys; they number in the 
10s or 100s, but they do tend to be large.



But even if they were, what matters is the steady-state
number of dictionaries - if for every short-lived dictionary that
gets released another one is created, any memory savings from reducing
the dict size would still materialize. 

But only a few kb?




- I wonder whether the shared keys could be computed at compile
  time, considering all attribute names that get assigned for
  self. The compiler could list those in the code object, and
  class creation could iterate over all methods (taking base
  classes into account).


It probably wouldn't be that hard to make a guess at compile time as to
what the shared keys would be, but it doesn't really matter.
The generation of intermediate shared keys will only happen once per
class, so the overhead would be negligible.


I'm not so much concerned about overhead, but about correctness/
effectiveness of the heuristics. For a class with dynamic attributes,
you may well come up with a very large key set. With source analysis,
you wouldn't attempt to grow the keyset beyond what likely is being
shared.

I agree some sort of heuristic is required to limit excessive growth
and prevent pathological behaviour.
The current implementation just has a cut off at a certain size;
it could definitely be improved.

As I said, I'll update the code soon and then, well what's the phase...
Oh yes, "patches welcome" ;)

Thanks for the feedback.

Cheers,
Mark.



Regards,
Martin

import sys
d = sys.getobjects(0,dict)
print(len(d), "dicts")
d2 = []
for o in d:
keys = o.keys()
if not keys:continue
types = tuple(set(type(k) for k in keys))
if types != (str,):
continue
d2.append(tuple(sorted(keys)))
print(len(d2), "str dicts")
freq = {}
for keys in d2:
freq[keys] = freq.get(keys,0)+1
print(len(freq), "different key sets")
freq = sorted(freq.items(), key=lambda t:t[1])
print(len([o for o in freq if o[1]==1]), "unsharable")
print(sum(len(o[0]) for o in freq), "keys")
print(freq[-10:])





___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Stefan Behnel

Mark Shannon, 23.12.2011 12:21:

Martin v. Löwis wrote:

- it would be useful to have a specialized representation for
all-keys-are-strings. In that case, me_hash could be dropped
from the representation. You would get savings compared to
the status quo even in the non-shared case.

It might tricky switching key tables and I dont think it would save much
memory as keys that are widely shared take up very little memory anyway,
and not many other dicts are long-lived.


Why do you say that? In a plain 3.3 interpreter, I counted 595 dict
objects (see script below). Of these, 563 (so nearly of them) had
only strings as keys. Among those, I found 286 different key sets,
where 231 key sets occurred only once (i.e. wouldn't be shared).

Together, the string dictionaries had 13282 keys, and you could save
as many pointers (actually more, because there will be more key slots
than keys).


The question is how much memory needs to be saved to be worth adding the
complexity, 10kb: No, 100Mb: yes.
So data from "real" benchmarks would be useful.


Consider taking a parsed MiniDOM tree as a benchmark. It contains so many 
instances of just a couple of different classes that it just has to make a 
huge difference if each of those instances is even just a bit smaller. It 
should also make a clear difference for plain Python ElementTree.


I attached a benchmark script that measures the parsing speed as well as 
the total memory usage of the in-memory tree. You can get data files from 
the following places, just download them and pass their file names on the 
command line:


http://gnosis.cx/download/hamlet.xml

http://www.ibiblio.org/xml/examples/religion/ot/ot.xml

Here are some results from my own machine for comparison:

http://blog.behnel.de/index.php?p=197

Stefan
# $Id: benchmark.py 3248 2007-09-02 15:01:26Z fredrik $
# simple elementtree benchmark program

from xml.etree import ElementTree
try:
from xml.etree import cElementTree
except ImportError:
cElementTree = None
try:
from lxml import etree
except ImportError:
etree = None
try:
from elementtree import XMLTreeBuilder # xmllib
except ImportError:
XMLTreeBuilder = None
try:
from elementtree import SimpleXMLTreeBuilder # xmllib
except ImportError:
SimpleXMLTreeBuilder = None
try:
from elementtree import SgmlopXMLTreeBuilder # sgmlop
except ImportError:
SgmlopXMLTreeBuilder = None
try:
from xml.dom import minidom # pyexpat+minidom
except ImportError:
minidom = None

try:
import resource
except ImportError:
resource = None

import os, sys
import traceback
from time import time

FORK=True

def fork(func):
if not hasattr(os, 'fork'):
return func
def wrap(*args, **kwargs):
if not FORK:
return func(*args, **kwargs)
cid = os.fork()
if cid:
os.waitpid(cid, 0)
else:
try:
func(*args, **kwargs)
except Exception:
traceback.print_exc()
finally:
os._exit(0)
return wrap

def measure_mem(old=0):
if resource is None:
return
used = resource.getrusage(resource.RUSAGE_SELF)
print('Memory usage: %s%s' % (used.ru_maxrss, (' (+%s)' % (used.ru_maxrss - old)) if old > 0 else ''))
return used.ru_maxrss

@fork
def benchmark(file, builder_module):
oldmem = measure_mem()
with open(file, "rb") as source:
t = time()
try:
builder = builder_module.XMLParser
except AttributeError:
builder = builder_module.TreeBuilder
parser = builder()
while 1:
data = source.read(32768)
if not data:
break
parser.feed(data)
tree = parser.close()
t = time() - t
print("%s.%s.feed(): %d nodes read in %.3f seconds" % (
builder_module.__name__, builder.__name__,
len(list(tree.getiterator())), t
))
measure_mem(oldmem)
del tree

@fork
def benchmark_parse(file, driver):
oldmem = measure_mem()
t = time()
tree = driver.parse(file)
t = time() - t
print(driver.__name__ + ".parse done in %.3f seconds" % t)
measure_mem(oldmem)
del tree

@fork
def benchmark_minidom(file):
oldmem = measure_mem()
t = time()
dom = minidom.parse(file)
t = time() - t
print("minidom tree read in %.3f seconds" % t)
measure_mem(oldmem)
del dom

class configure_parser(object):
def __init__(self, etree, name, **config):
self.__name__ = name
self.etree = etree
self.parser = etree.XMLParser(**config)
def parse(self, input):
return self.etree.parse(input, self.parser)

def run_benchmark(file):
benchmark_parse(file, ElementTree)
if cElementTree is not None:
benchmark_parse(file, cElementTree)
benchmark(file, cElementTree)
if etree is not None:
benchmark_parse(file, etree)
benchmark_parse(file, configure_parser(
   

Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Martin v. Löwis
> If I'm wrong and its easy to implement then please do.

Ok, so I take it that you are not interested in the idea. No problem.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Martin v. Löwis
> Consider taking a parsed MiniDOM tree as a benchmark. It contains so
> many instances of just a couple of different classes that it just has to
> make a huge difference if each of those instances is even just a bit
> smaller. It should also make a clear difference for plain Python
> ElementTree.

Of course, for minidom, Mark's current implementation should already
save quite a lot of memory, since all elements and text nodes have the
same attributes.

Still, it would be good to see how Mark's implementation deals with
that.

Regards,
Martin
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] A new dict for Xmas?

2011-12-23 Thread Mark Shannon

Martin v. Löwis wrote:

If I'm wrong and its easy to implement then please do.


Ok, so I take it that you are not interested in the idea. No problem.


Its just that I don't think it would yield results commensurate with the
effort.
Also I think its worth keeping the initial version as simple as
reasonably possible. Refinements can be added later.

Cheers,
Mark.


Regards,
Martin



___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


[Python-Dev] Summary of Python tracker Issues

2011-12-23 Thread Python tracker

ACTIVITY SUMMARY (2011-12-16 - 2011-12-23)
Python tracker at http://bugs.python.org/

To view or respond to any of the issues listed below, click on the issue.
Do NOT respond to this message.

Issues counts and deltas:
  open3168 ( -7)
  closed 22272 (+52)
  total  25440 (+45)

Open issues with patches: 1358 


Issues opened (27)
==

#13614: `setup.py register` fails if long_description contains RST
http://bugs.python.org/issue13614  opened by techtonik

#13615: `setup.py register` fails with -r argument
http://bugs.python.org/issue13615  opened by techtonik

#13617: Reject embedded null characters in wchar* strings
http://bugs.python.org/issue13617  opened by haypo

#13619: Add a new codec: "locale", the current locale encoding
http://bugs.python.org/issue13619  opened by haypo

#13621: Unicode performance regression in python3.3 vs python3.2
http://bugs.python.org/issue13621  opened by Boris.FELD

#13629: _PyParser_TokenNames does not match up with the token.h number
http://bugs.python.org/issue13629  opened by meador.inge

#13630: IDLE: Find(ed) text is not highlighted while dialog box is ope
http://bugs.python.org/issue13630  opened by marco

#13631: readline fails to parse some forms of .editrc under editline (
http://bugs.python.org/issue13631  opened by zvezdan

#13632: Update token documentation to reflect actual token types
http://bugs.python.org/issue13632  opened by meador.inge

#13633: Handling of hex character references in HTMLParser.handle_char
http://bugs.python.org/issue13633  opened by ezio.melotti

#13636: Python SSL Stack doesn't have a Secure Default set of ciphers
http://bugs.python.org/issue13636  opened by naif

#13638: PyErr_SetFromErrnoWithFilenameObject is undocumented
http://bugs.python.org/issue13638  opened by pitrou

#13639: UnicodeDecodeError when creating tar.gz with unicode name
http://bugs.python.org/issue13639  opened by jason.coombs

#13640: add mimetype for application/vnd.apple.mpegurl
http://bugs.python.org/issue13640  opened by Hiroaki.Kawai

#13641: decoding functions in the base64 module could accept unicode s
http://bugs.python.org/issue13641  opened by pitrou

#13642: urllib incorrectly quotes username and password in https basic
http://bugs.python.org/issue13642  opened by joneskoo

#13643: 'ascii' is a bad filesystem default encoding
http://bugs.python.org/issue13643  opened by gz

#13644: Python 3 crashes (segfaults) with this code.
http://bugs.python.org/issue13644  opened by maniram.maniram

#13645: test_import fails after test_coding
http://bugs.python.org/issue13645  opened by pitrou

#13646: Document poor interaction between multiprocessing and -m on Wi
http://bugs.python.org/issue13646  opened by ncoghlan

#13647: Python SSL stack doesn't securely validate certificate (as cli
http://bugs.python.org/issue13647  opened by naif

#13649: termios.ICANON is not documented
http://bugs.python.org/issue13649  opened by techtonik

#13651: Improve redirection in urllib
http://bugs.python.org/issue13651  opened by tom.kel

#13653: reorder set.intersection parameters for better performance
http://bugs.python.org/issue13653  opened by dalke

#13655: Python SSL stack doesn't have a default CA Store
http://bugs.python.org/issue13655  opened by naif

#13657: IDLE doesn't support sys.ps1 and sys.ps2.
http://bugs.python.org/issue13657  opened by maniram.maniram

#13658: Extra clause in class grammar documentation
http://bugs.python.org/issue13658  opened by Joshua.Landau



Most recent 15 issues with no replies (15)
==

#13658: Extra clause in class grammar documentation
http://bugs.python.org/issue13658

#13657: IDLE doesn't support sys.ps1 and sys.ps2.
http://bugs.python.org/issue13657

#13649: termios.ICANON is not documented
http://bugs.python.org/issue13649

#13642: urllib incorrectly quotes username and password in https basic
http://bugs.python.org/issue13642

#13641: decoding functions in the base64 module could accept unicode s
http://bugs.python.org/issue13641

#13640: add mimetype for application/vnd.apple.mpegurl
http://bugs.python.org/issue13640

#13638: PyErr_SetFromErrnoWithFilenameObject is undocumented
http://bugs.python.org/issue13638

#13633: Handling of hex character references in HTMLParser.handle_char
http://bugs.python.org/issue13633

#13632: Update token documentation to reflect actual token types
http://bugs.python.org/issue13632

#13631: readline fails to parse some forms of .editrc under editline (
http://bugs.python.org/issue13631

#13608: remove born-deprecated PyUnicode_AsUnicodeAndSize
http://bugs.python.org/issue13608

#13605: document argparse's nargs=REMAINDER
http://bugs.python.org/issue13605

#13594: Aifc markers write fix
http://bugs.python.org/issue13594

#13590: Prebuilt python-2.7.2 binaries for macosx can not compile c ex
http://bugs.python.org/issue13590

#13574: refresh example in doc for Extending and Embedding
http://bugs.python.org/issue13574



Most recent 15 issues