[issue23103] ipaddress should be Flyweight

2014-12-26 Thread Antoine Pitrou

Antoine Pitrou added the comment:

So, there are several aspects to this request:
- add a __slots__: this is Serhiy's patch
- add some weak interning of IP addresses: I believe it is ok to add it as an 
optimization, provided it doesn't otherwise change the API or the classes' 
behaviour

Seth: feel free to write a draft patch of interning, in order to get actual 
performance/memory consumption numbers. Also, perhaps open a specific issue for 
it, so we don't have many patches on the same issue?

--
nosy: +pitrou

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-24 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Here is a patch which makes IPv4Address and IPv6Address more lightweight (for 
other classes the size is not so important). It makes _version the class 
attribute and adds __slots__.

--
keywords: +patch
resolution: wont fix - 
stage:  - patch review
status: closed - open
versions: +Python 3.5 -Python 3.4
Added file: http://bugs.python.org/file37539/ipaddress_lightweight.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-24 Thread Eric Snow

Eric Snow added the comment:

@sbromberger, there's no need for your own package.  Just use something like 
this:


_addr_cache = {}

def ipaddr(addr):
try:
return _addr_cache[addr]
except KeyError:
_addr_cache[addr] = ipaddress.ipaddress(addr)
return _addr_cache[addr]


You could even throw weakrefs in there if your use case demanded it.

--
nosy: +eric.snow

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-24 Thread Josh Rosenberg

Josh Rosenberg added the comment:

Serhiy: I believe you need to add a bunch of __slots__ = () to various base 
classes in the module, even though they lack member variables. Every link in 
the inheritance chain must declare __slots__, or the child class will have 
__dict__ and __weakref__.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

New submission from Seth Bromberger:

ipaddress.ip_address instances should be flyweight. There's really no reason to 
make them mutable:

 a = ipaddress.ip_address(10.1.2.3)
 b = ipaddress.ip_address(10.1.2.3)
 id(a)
140066533772368
 id(b)
140066504622544

Especially with IPv6 and large numbers of addresses, it would be helpful not to 
have separate instances with the same underlying properties.

--
components: Library (Lib)
messages: 233038
nosy: Seth.Bromberger
priority: normal
severity: normal
status: open
title: ipaddress should be Flyweight
type: enhancement
versions: Python 3.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

Confirmed on 3.4.0; likely exists in previous versions as well.

--
versions: +Python 3.4 -Python 3.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Berker Peksag

Changes by Berker Peksag berker.pek...@gmail.com:


--
nosy: +ncoghlan, pmoody

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Josh Rosenberg

Josh Rosenberg added the comment:

What is your proposal? WeakValueDictionary mapping raw bytes object to single 
instance of ipaddress that is queried from ipaddress's __new__? No built-in has 
quite that extensive a level of caching and aggressive deduplication to my 
knowledge.

--
nosy: +josh.r

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

What is your proposal? WeakValueDictionary mapping raw bytes object to single 
instance of ipaddress that is queried from ipaddress's __new__? No built-in 
has quite that extensive a level of caching and aggressive deduplication to my 
knowledge.

I was thinking along those lines, but since v4 addresses are just UInt32 and v6 
are UInt128, somehow mapping just the address information if we can do that 
via weakref dicts. Right now these objects can't really be used to represent 
addresses parsed from logs with tens of millions of entries generated by a 
couple thousand machines. It would be nice to have a memory-efficient way of 
doing that without having to duplicate the concept in a third-party package.

There are a few other changes I'd like to see/make as well - this package is 
too focused on external functions for manipulating/creating addresses, when 
standard object/method structure would serve it better and be more intuitive. 
There is also no discernible need to modify an IP address' value. They should 
be thought of as immutables to the extent we can enforce that policy.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

IP address instances already are immutable and flyweight. There are no mutating 
methods. And a number of address attributes (packed, is_loopback, etc) are 
calculated on fly from integer representation.

But IP address objects can be lighter.

1) Use __slots__.
2) Every instance has the _version attribute. Why this is not class attribute?
3) Memory consumption can be even less if IP addresses would int subclasses. 
But this changes the API (in particular adds the __index__ method) and I doubt 
that we should do this.

--
nosy: +serhiy.storchaka

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread R. David Murray

Changes by R. David Murray rdmur...@bitdance.com:


--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

1) However:

 b = ipaddress.IPv4Address('1.2.3.4')
 a = ipaddress.IPv4Address('1.2.3.4')
 id(a)
4428380928
 id(b)
4428381768
 a == b
True
 b._ip += 6
 id(b)
4428381768
 b
IPv4Address('1.2.3.10')


2) Isn’t _version already a class attribute? It’s set in _Basev4/Basev6.

3) Memory consumption seems particularly wasteful in the current 
implementation: it takes 56 bytes (via sys.getsizeof) to store what amounts to 
4 bytes of data.

I thought about subclassing Int as well, and I agree it’s the wrong approach. 
There are too many int methods that don’t make sense in the context of IP 
addresses.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread R. David Murray

R. David Murray added the comment:

If ipaddress objects are already immutable, it seems to me that having a 
registry that makes it so there is only one instance per ip value isn't a crazy 
idea.  I've written such code in other contexts and it's not complicated.  We 
have caching of identifier strings and common integer constants for a 
reason...this seems a logical extension of that policy in the realm of a 
specific library data type.

The opposite argument is that it may be better left up to the application that 
has to handle lots of ips to do the caching according to what it knows to be an 
optimum pattern.  So consider me +0 on the idea.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

The opposite argument is that it may be better left up to the application that 
has to handle lots of ips to do the caching according to what it knows to be 
an optimum pattern.

I'd agree with you wholeheartedly if ipaddress wasn't part of stdlib, but the 
issue is that people generally gravitate to using stdlib over other packages 
when given the choice, and having something that behaves reasonably* in stdlib 
makes sense wherever possible.

*in this case, reasonably means, I think, I shouldn't have to worry that 
there's 1300% overhead for using IPv4Address() over a native 32-bit 
representation of an IP address. This matters little when you're just creating 
a few of these, but best practices regarding reuse might persuade folks to use 
these classes for all sorts of things. Doing things sanely at scale by default 
doesn't necessarily preclude further optimization if some developer thinks it's 
warranted.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Raymond Hettinger

Raymond Hettinger added the comment:

Adding __slots__ is reasonable.

Adding caching to fold together repeated addresses is dubious (it comes with 
its own space and time cost but won't benefit all users) and, as David said, 
the decision on how to handle this is up to the user.  Also, it is rather late 
in the game to be challenging the design of the module.  IIRC, this code or 
some variant of it already had heavy use prior to landing in the standard 
library (it also took into account what was learned with various other ipaddr 
modules).  The module docs say that the library was designed for speed and 
makes no mention of optimizing for space.

For the most part, once an API is deployed, users can be expected to exercise 
every feature whether intended or not.  Right now, you can add attributes to 
two otherwise indistinct ipaddresses.  The OP proposes to break that code.   
Right now, if a user has concerns about memory use due to duplicate address 
instances, they can easily created their own interning scheme (a = 
intern_dict.setdefault(a, a)).  If the OP get his wish, then those users will 
be worse off (due to double interning).

In other words, this ship has already sailed.

--
nosy: +rhettinger

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Josh Rosenberg

Josh Rosenberg added the comment:

So, just to be clear, checking the implementation (as of 3.4):

1. All the ipaddress classes leave __slots__ unset. So the overhead is actually 
higher than 56 bytes per instance; the __dict__ for an IPv4Address (ignoring 
the actual keys and values in the dict, just looking at the size of the dict 
itself), on Python 3.4, 64 bit Windows, is 56 + 288 = 344 bytes. Based on the 
size of the dict, I'm guessing some instance attributes aren't being set 
eagerly in __init__, so it's not getting the reduced memory usage from shared 
key dictionaries either.
2. It looks like as of 3.4, _version and _max_prefixlen were both instance 
attributes; the current code base seems to have made _max_prefixlen a class 
attribute, but left _version as an instance attribute (anything set on self is 
an instance attribute; doesn't matter if it's set in __init__ of a base class)
3. Even if we switch to using __slots__, remove support for weak references and 
user added attributes, and store nothing but the raw address encoded as an int, 
you're never going to get true flyweight objects in CPython. The lowest 
instance cost you can get for a class defined in Python (on a system with 64 
bit pointers) inheriting from object, is 40 bytes, plus 8 bytes per slot (plus 
the actual cost of the int object, which is another 16 bytes for an IPv4 
address on a 64 bit OS). If you want it lighter-weight than that, either you 
need to inherit from another built-in that it even lighter-weight, or you need 
to implement it in C. If you built IPv4Address on top of int for instance, you 
could get the instance cost down to 16 bytes. Downside to inheriting from int 
is that you'll interact with many other objects as if you were an int, which 
can cause a lot of unexpected behaviors (as others have noted).

Basically, there is no concept of flyweight object in Python. Aside from 
implementing it in C or inheriting from int, you're going to be using an 
absolute minimum (on 64 bit build) of 48 bytes for the basic object structure, 
plus another 16 for the int itself, 64 bytes total, or about 16x the real 
data being stored. Using a WeakValueDictionary would actually require another 8 
bytes per instance (a slot for __weakref__), and the overhead of the dictionary 
would likely be higher than the memory saved (unless you're regularly creating 
duplicate IP addresses and storing them for the long haul, but I suspect this 
is a less common use case than processing many different IP addresses once or 
twice).

I do think it would be a good idea to use __slots__ properly to get the memory 
down below 100 bytes per instance, but adding automatic IP address interning 
is probably not worth the effort. In the longer term, a C implementation of 
ipaddress might be worth it, not for improvements in computational performance, 
but to get the memory usage down for applications that need to make millions of 
instances.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Josh, you might want to have a look at the lightning talk on this page and the 
associated slides...

http://www.egenix.com/library/presentations/PyCon-UK-2014-When-performance-matters/

After having done the tests, using __slots__ is what I consider the flyweight 
pattern in Python.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

As a test, I tried the following (taken mostly from 
http://codesnipers.com/?q=python-flyweights):


class Foo(object):
_Foo = weakref.WeakValueDictionary()
def __new__(cls, addr):
obj = Foo._Foo.get(addr, None)
if obj is None:
obj = object.__new__(cls)
Foo._Foo[addr] = obj
obj.addr = addr
return obj

I created 10 million instances of Foo(34) in an array. Total space taken: ~80 
MB. Times: CPU times: user 6.93 s, sys: 48.7 ms, total: 6.98 s
Wall time: 6.98 s


I then created 10 million instances of a non-flyweight object, assigning an int 
to an instance variable:

class Bar(object):
pass

Total space taken: ~1.4 GB. Times:
CPU times: user 7.64 s, sys: 794 ms, total: 8.44 s
Wall time: 8.44 s

This corresponds (roughly) to the space taken by 10 million IPAddr objects.

So it appears, informally, that caching / flyweight results in modest time and 
significant memory savings.

I understand that the ship has sailed for a stdlib implementation, but these 
results are compelling enough for me to create a separate package.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

Sorry for the additional followup, but I re-ran the test with approximately 
real-world parameters. In this second test, I created 10 million Foo() 
instances with random values from 1-2000 (using random.randint). This 
corresponds to 2000 hosts generating 10 million logs.

Memory use is ~80 MB. Times: CPU times: user 27.5 s, sys: 101 ms, total: 27.6 s 
Wall time: 27.6 s

For the nonoptimized run (10m instances of 2000 random values):
Memory use is ~1.8GB. Times: CPU times: user 28.2 s, sys: 1.29 s, total: 29.5 s 
Wall time: 29.5 s

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Josh Rosenberg

Josh Rosenberg added the comment:

Marc-Andre: Oh, I know. I proselytize to coworkers on the virtues of using 
__slots__ for classes that will have many instances created (particularly since 
entirely too much of our stuff is still Py2.7, so we don't have the free 
savings from shared key dictionaries). Heck, I particularly enjoy inheriting 
from an inlined collections.namedtuple and an empty __slots__(which does add 8 
bytes to the size over declaring the __slots__ directly, but gets you as close 
to truly immutable instances as you can get when you need them, not just we're 
all adults here immutable instances).

I'm just pointing out that if he thinks 56 bytes per object is too large, he's 
going to be disappointed with what Python has to offer. General purpose 
user-defined Python objects don't optimize for low memory usage, and even 
__slots__ only gets you so far.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue23103] ipaddress should be Flyweight

2014-12-23 Thread Seth Bromberger

Seth Bromberger added the comment:

I'm just pointing out that if he thinks 56 bytes per object is too large, he's 
going to be disappointed with what Python has to offer. General purpose 
user-defined Python objects don't optimize for low memory usage, and even 
__slots__ only gets you so far.

He thinks that 1300% overhead is a bit too much for a simple object that can 
be represented by a 32-bit number, and he has been using python for several 
years and understands, generally, what the language has to offer (though not 
nearly so well as some of the respondents here). It may be in the roundoff at n 
 1e5, but when you start using these objects for real-world analysis, the 
overhead becomes problematic. I note with some amusement that the overhead is 
several orders of magnitude greater than those of the protocols these objects 
represent :)

In any case - I have no issues with the decision not to make these objects 
memory-efficient in stdlib and consequently with closing this out. Rolling my 
own package appears to be the best solution in any case.

Thanks for the discussion.

--
resolution:  - wont fix
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue23103
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com