Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Julian Taylor
On 18.02.2014 16:21, Julian Taylor wrote:
> On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith  wrote:
>> On 17 Feb 2014 15:17, "Sturla Molden"  wrote:
>>>
>>> Julian Taylor  wrote:
>>>
 When an array is created it tries to get its memory from the cache and
 when its deallocated it returns it to the cache.
>>>
> ...
>>
>> Another optimization we should consider that might help a lot in the same
>> situations where this would help: for code called from the cpython eval
>> loop, it's afaict possible to determine which inputs are temporaries by
>> checking their refcnt. In the second call to __add__ in '(a + b) + c', the
>> temporary will have refcnt 1, while the other arrays will all have refcnt
>>> 1. In such cases (subject to various sanity checks on shape, dtype, etc) we
>> could elide temporaries by reusing the input array for the output. The risk
>> is that there may be some code out there that calls these operations
>> directly from C with non-temp arrays that nonetheless have refcnt 1, but we
>> should at least investigate the feasibility. E.g. maybe we can do the
>> optimization for tp_add but not PyArray_Add.
>>
> 
> this seems to be a really good idea, I experimented a bit and it
> solves the temporary problem for this types of arithmetic nicely.
> Its simple to implement, just change to inplace in
> array_{add,sub,mul,div} handlers for the python slots. Doing so does
> not fail numpy, scipy and pandas testsuite so it seems save.
> Performance wise, besides the simple page zeroing limited benchmarks
> (a+b+c), it also it brings the laplace out of place benchmark to the
> same speed as the inplace benchmark [0]. This is very nice as the
> inplace variant is significantly harder to read.
> 
> Does anyone see any issue we might be overlooking in this refcount ==
> 1 optimization for the python api?
> I'll post a PR with the change shortly.

here is the PR:
https://github.com/numpy/numpy/pull/4322

probably still lacking some checks, but I think it can be tested
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Sturla Molden
I am cross-posting this to Cython user group to make sure they see this.
Sturla


Nathaniel Smith  wrote:
> On 18 Feb 2014 10:21, "Julian Taylor"  wrote:
> 
> On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith  wrote:
> On 17 Feb 2014 15:17, "Sturla Molden"  wrote:
> 
> Julian Taylor  wrote:
> 
> When an array is created it tries to get its memory from the cache
> 
> and
> 
> when its deallocated it returns it to the cache.
> 
> ...
> 
> Another optimization we should consider that might help a lot in the
> 
> same
> 
> situations where this would help: for code called from the cpython eval
> loop, it's afaict possible to determine which inputs are temporaries by
> checking their refcnt. In the second call to __add__ in '(a + b) + c',
> 
> the
> 
> temporary will have refcnt 1, while the other arrays will all have
> 
> refcnt
> 
> 1. In such cases (subject to various sanity checks on shape, dtype,
> 
> etc) we
> 
> could elide temporaries by reusing the input array for the output. The
> 
> risk
> 
> is that there may be some code out there that calls these operations
> directly from C with non-temp arrays that nonetheless have refcnt 1,
> 
> but we
> 
> should at least investigate the feasibility. E.g. maybe we can do the
> optimization for tp_add but not PyArray_Add.
> 
> this seems to be a really good idea, I experimented a bit and it solves
> the temporary problem for this types of arithmetic nicely. Its simple to
> implement, just change to inplace in array_{add,sub,mul,div} handlers for
> the python slots. Doing so does not fail numpy, scipy and pandas
> testsuite so it seems save. Performance wise, besides the simple page
> zeroing limited benchmarks (a+b+c), it also it brings the laplace out of
> place benchmark to the same speed as the inplace benchmark [0]. This is
> very nice as the inplace variant is significantly harder to read.
> 
> Sweet.
> 
> Does anyone see any issue we might be overlooking in this refcount == 1
> optimization for the python api? I'll post a PR with the change shortly.
> 
> It occurs belatedly that Cython code like   a = np.arange(10)
>   b = np.arange(10)
>   c = a + b might end up calling tp_add with refcnt 1 arrays. Ditto for
> same with cdef np.ndarray or cdef object added. We should check...
> 
> -n
> 
> ___ NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org  href="http://mail.scipy.org/mailman/listinfo/numpy-discussion";>http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Nathaniel Smith
On 18 Feb 2014 10:21, "Julian Taylor"  wrote:
>
> On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith  wrote:
> > On 17 Feb 2014 15:17, "Sturla Molden"  wrote:
> >>
> >> Julian Taylor  wrote:
> >>
> >> > When an array is created it tries to get its memory from the cache
and
> >> > when its deallocated it returns it to the cache.
> >>
> ...
> >
> > Another optimization we should consider that might help a lot in the
same
> > situations where this would help: for code called from the cpython eval
> > loop, it's afaict possible to determine which inputs are temporaries by
> > checking their refcnt. In the second call to __add__ in '(a + b) + c',
the
> > temporary will have refcnt 1, while the other arrays will all have
refcnt
> >>1. In such cases (subject to various sanity checks on shape, dtype,
etc) we
> > could elide temporaries by reusing the input array for the output. The
risk
> > is that there may be some code out there that calls these operations
> > directly from C with non-temp arrays that nonetheless have refcnt 1,
but we
> > should at least investigate the feasibility. E.g. maybe we can do the
> > optimization for tp_add but not PyArray_Add.
> >
>
> this seems to be a really good idea, I experimented a bit and it
> solves the temporary problem for this types of arithmetic nicely.
> Its simple to implement, just change to inplace in
> array_{add,sub,mul,div} handlers for the python slots. Doing so does
> not fail numpy, scipy and pandas testsuite so it seems save.
> Performance wise, besides the simple page zeroing limited benchmarks
> (a+b+c), it also it brings the laplace out of place benchmark to the
> same speed as the inplace benchmark [0]. This is very nice as the
> inplace variant is significantly harder to read.

Sweet.

> Does anyone see any issue we might be overlooking in this refcount ==
> 1 optimization for the python api?
> I'll post a PR with the change shortly.

It occurs belatedly that Cython code like
  a = np.arange(10)
  b = np.arange(10)
  c = a + b
might end up calling tp_add with refcnt 1 arrays. Ditto for same with cdef
np.ndarray or cdef object added. We should check...

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Julian Taylor
On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith  wrote:
> On 17 Feb 2014 15:17, "Sturla Molden"  wrote:
>>
>> Julian Taylor  wrote:
>>
>> > When an array is created it tries to get its memory from the cache and
>> > when its deallocated it returns it to the cache.
>>
...
>
> Another optimization we should consider that might help a lot in the same
> situations where this would help: for code called from the cpython eval
> loop, it's afaict possible to determine which inputs are temporaries by
> checking their refcnt. In the second call to __add__ in '(a + b) + c', the
> temporary will have refcnt 1, while the other arrays will all have refcnt
>>1. In such cases (subject to various sanity checks on shape, dtype, etc) we
> could elide temporaries by reusing the input array for the output. The risk
> is that there may be some code out there that calls these operations
> directly from C with non-temp arrays that nonetheless have refcnt 1, but we
> should at least investigate the feasibility. E.g. maybe we can do the
> optimization for tp_add but not PyArray_Add.
>

this seems to be a really good idea, I experimented a bit and it
solves the temporary problem for this types of arithmetic nicely.
Its simple to implement, just change to inplace in
array_{add,sub,mul,div} handlers for the python slots. Doing so does
not fail numpy, scipy and pandas testsuite so it seems save.
Performance wise, besides the simple page zeroing limited benchmarks
(a+b+c), it also it brings the laplace out of place benchmark to the
same speed as the inplace benchmark [0]. This is very nice as the
inplace variant is significantly harder to read.

Does anyone see any issue we might be overlooking in this refcount ==
1 optimization for the python api?
I'll post a PR with the change shortly.


Regardless of this change, caching memory blocks might still be
worthwhile for fancy indexing and other operations which require
allocations.


[0] http://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-normal
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Sturla Molden
Julian Taylor  wrote:
> I was thinking of something much simpler, just a layer of pointer stacks
> for different allocations sizes, the larger the size the smaller the
> cache with pessimistic defaults.
> e.g. the largest default cache layer is 128MB and with one or two
> entries so we can cache temporal close operations like a + (c * b).
> Maybe an age counter can be included that clears the largest old entry
> if it hasn't been used for X allocations.

It would not be difficult if we e.g. used two heaps instead of one. One
would sort the cached blocks according to size (similar to malloc), the
other would be a priority queue for age. Every now and then the expired
entries would be cleared off the cache. 

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-18 Thread Julian Taylor
On Tue, Feb 18, 2014 at 1:47 AM, David Cournapeau  wrote:
>
> On Mon, Feb 17, 2014 at 7:31 PM, Julian Taylor
>  wrote:
>>
>> hi,
>> I noticed that during some simplistic benchmarks (e.g.
>> https://github.com/numpy/numpy/issues/4310) a lot of time is spent in
>> the kernel zeroing pages.
>> This is because under linux glibc will always allocate large memory
>> blocks with mmap. As these pages can come from other processes the
>> kernel must zero them for security reasons.
>
>
> Do you have numbers for 'a lot of time' ? Is the above script the exact one
> you used for benchmarking this issue ?

I saw it in many benchmarks I did over time for the numerous little
improvements I added.
But I'm aware these are overly simplistic, thats why I'm asking for
numbers from real applications.
The paper I found did a little more thorough benchmarks and seem to
indicate more applications profit from it.

>
>>
>> For memory within the numpy process this unnecessary and possibly a
>> large overhead for the many temporaries numpy creates.
>>
>> The behavior of glibc can be tuned to change the threshold at which it
>> starts using mmap but that would be a platform specific fix.
>>
>> I was thinking about adding a thread local cache of pointers to of
>> allocated memory.
>> When an array is created it tries to get its memory from the cache and
>> when its deallocated it returns it to the cache.
>> The threshold and cached memory block sizes could be adaptive depending
>> on the application workload.
>>
>> For simplistic temporary heavy benchmarks this eliminates the time spent
>> in the kernel (system with time).
>
>
> For this kind of setup, I would advise to look into perf on linux. It should
> be much more precise than time.
>
> If nobody beats me to it, I can try to look at this this WE,

I'm using perf for most of my benchmarks.
But in this case time is sufficient as the system time is all you need to know.
perf confirms this time is almost all spent zeroing pages.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread David Cournapeau
On Mon, Feb 17, 2014 at 7:31 PM, Julian Taylor <
jtaylor.deb...@googlemail.com> wrote:

> hi,
> I noticed that during some simplistic benchmarks (e.g.
> https://github.com/numpy/numpy/issues/4310) a lot of time is spent in
> the kernel zeroing pages.
> This is because under linux glibc will always allocate large memory
> blocks with mmap. As these pages can come from other processes the
> kernel must zero them for security reasons.
>

Do you have numbers for 'a lot of time' ? Is the above script the exact one
you used for benchmarking this issue ?


> For memory within the numpy process this unnecessary and possibly a
> large overhead for the many temporaries numpy creates.
>
> The behavior of glibc can be tuned to change the threshold at which it
> starts using mmap but that would be a platform specific fix.
>
> I was thinking about adding a thread local cache of pointers to of
> allocated memory.
> When an array is created it tries to get its memory from the cache and
> when its deallocated it returns it to the cache.
> The threshold and cached memory block sizes could be adaptive depending
> on the application workload.
>
> For simplistic temporary heavy benchmarks this eliminates the time spent
> in the kernel (system with time).
>

For this kind of setup, I would advise to look into perf on linux. It
should be much more precise than time.

If nobody beats me to it, I can try to look at this this WE,


> But I don't know how relevant this is for real world applications.
> Have you noticed large amounts of time spent in the kernel in your apps?
>

In my experience, more time is spent on figuring out how to spare memory
than speeding this kind of operations for 'real life applications' (TM).

What happens to your benchmark if you tune malloc to not use mmap at all ?

David

>
> I also found this paper which describes pretty much exactly what I'm
> proposing:
> pyhpc.org/workshop/papers/Doubling.pdf‎
>
> Someone know why their changes were never incorporated in numpy? I
> couldn't find a reference in the list archive.
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Stefan Seefeld
On 02/17/2014 06:56 PM, Nathaniel Smith wrote:
> On Mon, Feb 17, 2014 at 3:55 PM, Stefan Seefeld  wrote:
>> On 02/17/2014 03:42 PM, Nathaniel Smith wrote:
>>> Another optimization we should consider that might help a lot in the
>>> same situations where this would help: for code called from the
>>> cpython eval loop, it's afaict possible to determine which inputs are
>>> temporaries by checking their refcnt. In the second call to __add__ in
>>> '(a + b) + c', the temporary will have refcnt 1, while the other
>>> arrays will all have refcnt >1. In such cases (subject to various
>>> sanity checks on shape, dtype, etc) we could elide temporaries by
>>> reusing the input array for the output. The risk is that there may be
>>> some code out there that calls these operations directly from C with
>>> non-temp arrays that nonetheless have refcnt 1, but we should at least
>>> investigate the feasibility. E.g. maybe we can do the optimization for
>>> tp_add but not PyArray_Add.
>> For element-wise operations such as the above, wouldn't it be even
>> better to use loop fusion, by evaluating the entire compound expression
>> per element, instead of each individual operation ? That would require
>> methods such as __add__ to return an operation object, rather than the
>> result value. I believe a technique like that is used in the numexpr
>> package (https://github.com/pydata/numexpr), which I saw announced here
>> recently...
> Hi Stefan (long time no see!),

Indeed ! :-)

> Sure, that would be an excellent thing, but adding a delayed
> evaluation engine to numpy is a big piece of new code, and we'd want
> to make it something you opt-in to explicitly. (There are too many
> weird potential issues with e.g. errors showing up at some far away
> place from the actual broken code, due to evaluation being delayed to
> there.) By contrast, the optimization suggested here is a tiny change
> we could do now, and would still be useful even in the hypothetical
> future where we do have lazy evaluation, for anyone who doesn't use
> it.

Sure, I fully agree. I didn't mean to suggest this as an alternative to
a focused memory management optimization.
Still, it seems this would be a nice project (perhaps even under the
GSoC umbrella). It could be controlled by a metaclass (substituting
appropriate ndarray methods), and thus could be enabled separately and
explicitly.

Anyhow, just an idea for someone else to pick up. :-)

Stefan

-- 

  ...ich hab' noch einen Koffer in Berlin...

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Julian Taylor
On 17.02.2014 22:27, Sturla Molden wrote:
> Nathaniel Smith  wrote:
>> Also, I'd be pretty wary of caching large chunks of unused memory. People
>> already have a lot of trouble understanding their program's memory usage,
>> and getting rid of 'greedy free' will make this even worse.
> 
> A cache would only be needed when there is a lot of computing going on, so
> one could set an "early expire date" on chached chunks. Anything older than
> a second or two could be returned. A cache would thus require a garbage
> collector thread.
> 


I was thinking of something much simpler, just a layer of pointer stacks
for different allocations sizes, the larger the size the smaller the
cache with pessimistic defaults.
e.g. the largest default cache layer is 128MB and with one or two
entries so we can cache temporal close operations like a + (c * b).
Maybe an age counter can be included that clears the largest old entry
if it hasn't been used for X allocations.

We can also monitor the allocations and make the maximum cache layer a
fraction of the biggest allocation that occurred during the runtime.

Devising a good scheme is probably tricky but we can start out with
something simple like limiting the cache to 512MB which will already
profit many cases while still being a acceptable amount of memory to
waste on a modern machine.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Nathaniel Smith
On Mon, Feb 17, 2014 at 3:55 PM, Stefan Seefeld  wrote:
> On 02/17/2014 03:42 PM, Nathaniel Smith wrote:
>> Another optimization we should consider that might help a lot in the
>> same situations where this would help: for code called from the
>> cpython eval loop, it's afaict possible to determine which inputs are
>> temporaries by checking their refcnt. In the second call to __add__ in
>> '(a + b) + c', the temporary will have refcnt 1, while the other
>> arrays will all have refcnt >1. In such cases (subject to various
>> sanity checks on shape, dtype, etc) we could elide temporaries by
>> reusing the input array for the output. The risk is that there may be
>> some code out there that calls these operations directly from C with
>> non-temp arrays that nonetheless have refcnt 1, but we should at least
>> investigate the feasibility. E.g. maybe we can do the optimization for
>> tp_add but not PyArray_Add.
>
> For element-wise operations such as the above, wouldn't it be even
> better to use loop fusion, by evaluating the entire compound expression
> per element, instead of each individual operation ? That would require
> methods such as __add__ to return an operation object, rather than the
> result value. I believe a technique like that is used in the numexpr
> package (https://github.com/pydata/numexpr), which I saw announced here
> recently...

Hi Stefan (long time no see!),

Sure, that would be an excellent thing, but adding a delayed
evaluation engine to numpy is a big piece of new code, and we'd want
to make it something you opt-in to explicitly. (There are too many
weird potential issues with e.g. errors showing up at some far away
place from the actual broken code, due to evaluation being delayed to
there.) By contrast, the optimization suggested here is a tiny change
we could do now, and would still be useful even in the hypothetical
future where we do have lazy evaluation, for anyone who doesn't use
it.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Sturla Molden
Nathaniel Smith  wrote:
> Also, I'd be pretty wary of caching large chunks of unused memory. People
> already have a lot of trouble understanding their program's memory usage,
> and getting rid of 'greedy free' will make this even worse.

A cache would only be needed when there is a lot of computing going on, so
one could set an "early expire date" on chached chunks. Anything older than
a second or two could be returned. A cache would thus require a garbage
collector thread.

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Stefan Seefeld
On 02/17/2014 03:42 PM, Nathaniel Smith wrote:
> Another optimization we should consider that might help a lot in the
> same situations where this would help: for code called from the
> cpython eval loop, it's afaict possible to determine which inputs are
> temporaries by checking their refcnt. In the second call to __add__ in
> '(a + b) + c', the temporary will have refcnt 1, while the other
> arrays will all have refcnt >1. In such cases (subject to various
> sanity checks on shape, dtype, etc) we could elide temporaries by
> reusing the input array for the output. The risk is that there may be
> some code out there that calls these operations directly from C with
> non-temp arrays that nonetheless have refcnt 1, but we should at least
> investigate the feasibility. E.g. maybe we can do the optimization for
> tp_add but not PyArray_Add. 

For element-wise operations such as the above, wouldn't it be even
better to use loop fusion, by evaluating the entire compound expression
per element, instead of each individual operation ? That would require
methods such as __add__ to return an operation object, rather than the
result value. I believe a technique like that is used in the numexpr
package (https://github.com/pydata/numexpr), which I saw announced here
recently...

FWIW,
Stefan

PS: Such a loop-fusion technique would also open the door to other
optimizations, such as vectorization (simd)...

-- 

  ...ich hab' noch einen Koffer in Berlin...

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Nathaniel Smith
On 17 Feb 2014 15:17, "Sturla Molden"  wrote:
>
> Julian Taylor  wrote:
>
> > When an array is created it tries to get its memory from the cache and
> > when its deallocated it returns it to the cache.
>
> Good idea, however there is already a C function that does this. It uses a
> heap to keep the cached memory blocks sorted according to size. You know
it
> as malloc — and is why we call this allocation from the heap. Which by the
> way is what NumPy already does.  ;-)

Common malloc implementations are not well optimized for programs that have
frequent, short-lived, large-sized allocations. Usually they optimize for
small short-lived allocations of of small sizes. It's totally plausible
that we could do a better job in the common case of array operations like
'a + b + c + d' that allocate and free a bunch of same-sized temporary
arrays as they go. (Right now, if those arrays are large, that expression
will always generate multiple mmap/munmap calls.) The question is to what
extent numpy programs are bottlenecked by such allocations.

Also, I'd be pretty wary of caching large chunks of unused memory. People
already have a lot of trouble understanding their program's memory usage,
and getting rid of 'greedy free' will make this even worse.

Another optimization we should consider that might help a lot in the same
situations where this would help: for code called from the cpython eval
loop, it's afaict possible to determine which inputs are temporaries by
checking their refcnt. In the second call to __add__ in '(a + b) + c', the
temporary will have refcnt 1, while the other arrays will all have refcnt
>1. In such cases (subject to various sanity checks on shape, dtype, etc)
we could elide temporaries by reusing the input array for the output. The
risk is that there may be some code out there that calls these operations
directly from C with non-temp arrays that nonetheless have refcnt 1, but we
should at least investigate the feasibility. E.g. maybe we can do the
optimization for tp_add but not PyArray_Add.

-n
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Julian Taylor
On 17.02.2014 21:16, Sturla Molden wrote:
> Julian Taylor  wrote:
> 
>> When an array is created it tries to get its memory from the cache and
>> when its deallocated it returns it to the cache.
> 
> Good idea, however there is already a C function that does this. It uses a
> heap to keep the cached memory blocks sorted according to size. You know it
> as malloc — and is why we call this allocation from the heap. Which by the
> way is what NumPy already does.  ;-)
> 

not with glibc, glibc has no cache for mmap allocated memory.
It does cache sbrk allocated memory and uses a dynamic threshold for
using it but its tuned for generic applications so the maximum threshold
is very low, I think its 32 MB. Far too low for many numerical applications.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Sturla Molden
Julian Taylor  wrote:

> When an array is created it tries to get its memory from the cache and
> when its deallocated it returns it to the cache.

Good idea, however there is already a C function that does this. It uses a
heap to keep the cached memory blocks sorted according to size. You know it
as malloc — and is why we call this allocation from the heap. Which by the
way is what NumPy already does.  ;-)

Sturla

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] allocated memory cache for numpy

2014-02-17 Thread Julian Taylor
hi,
I noticed that during some simplistic benchmarks (e.g.
https://github.com/numpy/numpy/issues/4310) a lot of time is spent in
the kernel zeroing pages.
This is because under linux glibc will always allocate large memory
blocks with mmap. As these pages can come from other processes the
kernel must zero them for security reasons.

For memory within the numpy process this unnecessary and possibly a
large overhead for the many temporaries numpy creates.

The behavior of glibc can be tuned to change the threshold at which it
starts using mmap but that would be a platform specific fix.

I was thinking about adding a thread local cache of pointers to of
allocated memory.
When an array is created it tries to get its memory from the cache and
when its deallocated it returns it to the cache.
The threshold and cached memory block sizes could be adaptive depending
on the application workload.

For simplistic temporary heavy benchmarks this eliminates the time spent
in the kernel (system with time).

But I don't know how relevant this is for real world applications.
Have you noticed large amounts of time spent in the kernel in your apps?

I also found this paper which describes pretty much exactly what I'm
proposing:
pyhpc.org/workshop/papers/Doubling.pdf‎

Someone know why their changes were never incorporated in numpy? I
couldn't find a reference in the list archive.
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion