Re: [Numpy-discussion] allocated memory cache for numpy
On 18.02.2014 16:21, Julian Taylor wrote: > On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith wrote: >> On 17 Feb 2014 15:17, "Sturla Molden" wrote: >>> >>> Julian Taylor wrote: >>> When an array is created it tries to get its memory from the cache and when its deallocated it returns it to the cache. >>> > ... >> >> Another optimization we should consider that might help a lot in the same >> situations where this would help: for code called from the cpython eval >> loop, it's afaict possible to determine which inputs are temporaries by >> checking their refcnt. In the second call to __add__ in '(a + b) + c', the >> temporary will have refcnt 1, while the other arrays will all have refcnt >>> 1. In such cases (subject to various sanity checks on shape, dtype, etc) we >> could elide temporaries by reusing the input array for the output. The risk >> is that there may be some code out there that calls these operations >> directly from C with non-temp arrays that nonetheless have refcnt 1, but we >> should at least investigate the feasibility. E.g. maybe we can do the >> optimization for tp_add but not PyArray_Add. >> > > this seems to be a really good idea, I experimented a bit and it > solves the temporary problem for this types of arithmetic nicely. > Its simple to implement, just change to inplace in > array_{add,sub,mul,div} handlers for the python slots. Doing so does > not fail numpy, scipy and pandas testsuite so it seems save. > Performance wise, besides the simple page zeroing limited benchmarks > (a+b+c), it also it brings the laplace out of place benchmark to the > same speed as the inplace benchmark [0]. This is very nice as the > inplace variant is significantly harder to read. > > Does anyone see any issue we might be overlooking in this refcount == > 1 optimization for the python api? > I'll post a PR with the change shortly. here is the PR: https://github.com/numpy/numpy/pull/4322 probably still lacking some checks, but I think it can be tested ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
I am cross-posting this to Cython user group to make sure they see this. Sturla Nathaniel Smith wrote: > On 18 Feb 2014 10:21, "Julian Taylor" wrote: > > On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith wrote: > On 17 Feb 2014 15:17, "Sturla Molden" wrote: > > Julian Taylor wrote: > > When an array is created it tries to get its memory from the cache > > and > > when its deallocated it returns it to the cache. > > ... > > Another optimization we should consider that might help a lot in the > > same > > situations where this would help: for code called from the cpython eval > loop, it's afaict possible to determine which inputs are temporaries by > checking their refcnt. In the second call to __add__ in '(a + b) + c', > > the > > temporary will have refcnt 1, while the other arrays will all have > > refcnt > > 1. In such cases (subject to various sanity checks on shape, dtype, > > etc) we > > could elide temporaries by reusing the input array for the output. The > > risk > > is that there may be some code out there that calls these operations > directly from C with non-temp arrays that nonetheless have refcnt 1, > > but we > > should at least investigate the feasibility. E.g. maybe we can do the > optimization for tp_add but not PyArray_Add. > > this seems to be a really good idea, I experimented a bit and it solves > the temporary problem for this types of arithmetic nicely. Its simple to > implement, just change to inplace in array_{add,sub,mul,div} handlers for > the python slots. Doing so does not fail numpy, scipy and pandas > testsuite so it seems save. Performance wise, besides the simple page > zeroing limited benchmarks (a+b+c), it also it brings the laplace out of > place benchmark to the same speed as the inplace benchmark [0]. This is > very nice as the inplace variant is significantly harder to read. > > Sweet. > > Does anyone see any issue we might be overlooking in this refcount == 1 > optimization for the python api? I'll post a PR with the change shortly. > > It occurs belatedly that Cython code like a = np.arange(10) > b = np.arange(10) > c = a + b might end up calling tp_add with refcnt 1 arrays. Ditto for > same with cdef np.ndarray or cdef object added. We should check... > > -n > > ___ NumPy-Discussion mailing list > NumPy-Discussion@scipy.org href="http://mail.scipy.org/mailman/listinfo/numpy-discussion";>http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 18 Feb 2014 10:21, "Julian Taylor" wrote: > > On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith wrote: > > On 17 Feb 2014 15:17, "Sturla Molden" wrote: > >> > >> Julian Taylor wrote: > >> > >> > When an array is created it tries to get its memory from the cache and > >> > when its deallocated it returns it to the cache. > >> > ... > > > > Another optimization we should consider that might help a lot in the same > > situations where this would help: for code called from the cpython eval > > loop, it's afaict possible to determine which inputs are temporaries by > > checking their refcnt. In the second call to __add__ in '(a + b) + c', the > > temporary will have refcnt 1, while the other arrays will all have refcnt > >>1. In such cases (subject to various sanity checks on shape, dtype, etc) we > > could elide temporaries by reusing the input array for the output. The risk > > is that there may be some code out there that calls these operations > > directly from C with non-temp arrays that nonetheless have refcnt 1, but we > > should at least investigate the feasibility. E.g. maybe we can do the > > optimization for tp_add but not PyArray_Add. > > > > this seems to be a really good idea, I experimented a bit and it > solves the temporary problem for this types of arithmetic nicely. > Its simple to implement, just change to inplace in > array_{add,sub,mul,div} handlers for the python slots. Doing so does > not fail numpy, scipy and pandas testsuite so it seems save. > Performance wise, besides the simple page zeroing limited benchmarks > (a+b+c), it also it brings the laplace out of place benchmark to the > same speed as the inplace benchmark [0]. This is very nice as the > inplace variant is significantly harder to read. Sweet. > Does anyone see any issue we might be overlooking in this refcount == > 1 optimization for the python api? > I'll post a PR with the change shortly. It occurs belatedly that Cython code like a = np.arange(10) b = np.arange(10) c = a + b might end up calling tp_add with refcnt 1 arrays. Ditto for same with cdef np.ndarray or cdef object added. We should check... -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On Mon, Feb 17, 2014 at 9:42 PM, Nathaniel Smith wrote: > On 17 Feb 2014 15:17, "Sturla Molden" wrote: >> >> Julian Taylor wrote: >> >> > When an array is created it tries to get its memory from the cache and >> > when its deallocated it returns it to the cache. >> ... > > Another optimization we should consider that might help a lot in the same > situations where this would help: for code called from the cpython eval > loop, it's afaict possible to determine which inputs are temporaries by > checking their refcnt. In the second call to __add__ in '(a + b) + c', the > temporary will have refcnt 1, while the other arrays will all have refcnt >>1. In such cases (subject to various sanity checks on shape, dtype, etc) we > could elide temporaries by reusing the input array for the output. The risk > is that there may be some code out there that calls these operations > directly from C with non-temp arrays that nonetheless have refcnt 1, but we > should at least investigate the feasibility. E.g. maybe we can do the > optimization for tp_add but not PyArray_Add. > this seems to be a really good idea, I experimented a bit and it solves the temporary problem for this types of arithmetic nicely. Its simple to implement, just change to inplace in array_{add,sub,mul,div} handlers for the python slots. Doing so does not fail numpy, scipy and pandas testsuite so it seems save. Performance wise, besides the simple page zeroing limited benchmarks (a+b+c), it also it brings the laplace out of place benchmark to the same speed as the inplace benchmark [0]. This is very nice as the inplace variant is significantly harder to read. Does anyone see any issue we might be overlooking in this refcount == 1 optimization for the python api? I'll post a PR with the change shortly. Regardless of this change, caching memory blocks might still be worthwhile for fancy indexing and other operations which require allocations. [0] http://yarikoptic.github.io/numpy-vbench/vb_vb_app.html#laplace-normal ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
Julian Taylor wrote: > I was thinking of something much simpler, just a layer of pointer stacks > for different allocations sizes, the larger the size the smaller the > cache with pessimistic defaults. > e.g. the largest default cache layer is 128MB and with one or two > entries so we can cache temporal close operations like a + (c * b). > Maybe an age counter can be included that clears the largest old entry > if it hasn't been used for X allocations. It would not be difficult if we e.g. used two heaps instead of one. One would sort the cached blocks according to size (similar to malloc), the other would be a priority queue for age. Every now and then the expired entries would be cleared off the cache. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On Tue, Feb 18, 2014 at 1:47 AM, David Cournapeau wrote: > > On Mon, Feb 17, 2014 at 7:31 PM, Julian Taylor > wrote: >> >> hi, >> I noticed that during some simplistic benchmarks (e.g. >> https://github.com/numpy/numpy/issues/4310) a lot of time is spent in >> the kernel zeroing pages. >> This is because under linux glibc will always allocate large memory >> blocks with mmap. As these pages can come from other processes the >> kernel must zero them for security reasons. > > > Do you have numbers for 'a lot of time' ? Is the above script the exact one > you used for benchmarking this issue ? I saw it in many benchmarks I did over time for the numerous little improvements I added. But I'm aware these are overly simplistic, thats why I'm asking for numbers from real applications. The paper I found did a little more thorough benchmarks and seem to indicate more applications profit from it. > >> >> For memory within the numpy process this unnecessary and possibly a >> large overhead for the many temporaries numpy creates. >> >> The behavior of glibc can be tuned to change the threshold at which it >> starts using mmap but that would be a platform specific fix. >> >> I was thinking about adding a thread local cache of pointers to of >> allocated memory. >> When an array is created it tries to get its memory from the cache and >> when its deallocated it returns it to the cache. >> The threshold and cached memory block sizes could be adaptive depending >> on the application workload. >> >> For simplistic temporary heavy benchmarks this eliminates the time spent >> in the kernel (system with time). > > > For this kind of setup, I would advise to look into perf on linux. It should > be much more precise than time. > > If nobody beats me to it, I can try to look at this this WE, I'm using perf for most of my benchmarks. But in this case time is sufficient as the system time is all you need to know. perf confirms this time is almost all spent zeroing pages. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On Mon, Feb 17, 2014 at 7:31 PM, Julian Taylor < jtaylor.deb...@googlemail.com> wrote: > hi, > I noticed that during some simplistic benchmarks (e.g. > https://github.com/numpy/numpy/issues/4310) a lot of time is spent in > the kernel zeroing pages. > This is because under linux glibc will always allocate large memory > blocks with mmap. As these pages can come from other processes the > kernel must zero them for security reasons. > Do you have numbers for 'a lot of time' ? Is the above script the exact one you used for benchmarking this issue ? > For memory within the numpy process this unnecessary and possibly a > large overhead for the many temporaries numpy creates. > > The behavior of glibc can be tuned to change the threshold at which it > starts using mmap but that would be a platform specific fix. > > I was thinking about adding a thread local cache of pointers to of > allocated memory. > When an array is created it tries to get its memory from the cache and > when its deallocated it returns it to the cache. > The threshold and cached memory block sizes could be adaptive depending > on the application workload. > > For simplistic temporary heavy benchmarks this eliminates the time spent > in the kernel (system with time). > For this kind of setup, I would advise to look into perf on linux. It should be much more precise than time. If nobody beats me to it, I can try to look at this this WE, > But I don't know how relevant this is for real world applications. > Have you noticed large amounts of time spent in the kernel in your apps? > In my experience, more time is spent on figuring out how to spare memory than speeding this kind of operations for 'real life applications' (TM). What happens to your benchmark if you tune malloc to not use mmap at all ? David > > I also found this paper which describes pretty much exactly what I'm > proposing: > pyhpc.org/workshop/papers/Doubling.pdf > > Someone know why their changes were never incorporated in numpy? I > couldn't find a reference in the list archive. > ___ > NumPy-Discussion mailing list > NumPy-Discussion@scipy.org > http://mail.scipy.org/mailman/listinfo/numpy-discussion > ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 02/17/2014 06:56 PM, Nathaniel Smith wrote: > On Mon, Feb 17, 2014 at 3:55 PM, Stefan Seefeld wrote: >> On 02/17/2014 03:42 PM, Nathaniel Smith wrote: >>> Another optimization we should consider that might help a lot in the >>> same situations where this would help: for code called from the >>> cpython eval loop, it's afaict possible to determine which inputs are >>> temporaries by checking their refcnt. In the second call to __add__ in >>> '(a + b) + c', the temporary will have refcnt 1, while the other >>> arrays will all have refcnt >1. In such cases (subject to various >>> sanity checks on shape, dtype, etc) we could elide temporaries by >>> reusing the input array for the output. The risk is that there may be >>> some code out there that calls these operations directly from C with >>> non-temp arrays that nonetheless have refcnt 1, but we should at least >>> investigate the feasibility. E.g. maybe we can do the optimization for >>> tp_add but not PyArray_Add. >> For element-wise operations such as the above, wouldn't it be even >> better to use loop fusion, by evaluating the entire compound expression >> per element, instead of each individual operation ? That would require >> methods such as __add__ to return an operation object, rather than the >> result value. I believe a technique like that is used in the numexpr >> package (https://github.com/pydata/numexpr), which I saw announced here >> recently... > Hi Stefan (long time no see!), Indeed ! :-) > Sure, that would be an excellent thing, but adding a delayed > evaluation engine to numpy is a big piece of new code, and we'd want > to make it something you opt-in to explicitly. (There are too many > weird potential issues with e.g. errors showing up at some far away > place from the actual broken code, due to evaluation being delayed to > there.) By contrast, the optimization suggested here is a tiny change > we could do now, and would still be useful even in the hypothetical > future where we do have lazy evaluation, for anyone who doesn't use > it. Sure, I fully agree. I didn't mean to suggest this as an alternative to a focused memory management optimization. Still, it seems this would be a nice project (perhaps even under the GSoC umbrella). It could be controlled by a metaclass (substituting appropriate ndarray methods), and thus could be enabled separately and explicitly. Anyhow, just an idea for someone else to pick up. :-) Stefan -- ...ich hab' noch einen Koffer in Berlin... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 17.02.2014 22:27, Sturla Molden wrote: > Nathaniel Smith wrote: >> Also, I'd be pretty wary of caching large chunks of unused memory. People >> already have a lot of trouble understanding their program's memory usage, >> and getting rid of 'greedy free' will make this even worse. > > A cache would only be needed when there is a lot of computing going on, so > one could set an "early expire date" on chached chunks. Anything older than > a second or two could be returned. A cache would thus require a garbage > collector thread. > I was thinking of something much simpler, just a layer of pointer stacks for different allocations sizes, the larger the size the smaller the cache with pessimistic defaults. e.g. the largest default cache layer is 128MB and with one or two entries so we can cache temporal close operations like a + (c * b). Maybe an age counter can be included that clears the largest old entry if it hasn't been used for X allocations. We can also monitor the allocations and make the maximum cache layer a fraction of the biggest allocation that occurred during the runtime. Devising a good scheme is probably tricky but we can start out with something simple like limiting the cache to 512MB which will already profit many cases while still being a acceptable amount of memory to waste on a modern machine. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On Mon, Feb 17, 2014 at 3:55 PM, Stefan Seefeld wrote: > On 02/17/2014 03:42 PM, Nathaniel Smith wrote: >> Another optimization we should consider that might help a lot in the >> same situations where this would help: for code called from the >> cpython eval loop, it's afaict possible to determine which inputs are >> temporaries by checking their refcnt. In the second call to __add__ in >> '(a + b) + c', the temporary will have refcnt 1, while the other >> arrays will all have refcnt >1. In such cases (subject to various >> sanity checks on shape, dtype, etc) we could elide temporaries by >> reusing the input array for the output. The risk is that there may be >> some code out there that calls these operations directly from C with >> non-temp arrays that nonetheless have refcnt 1, but we should at least >> investigate the feasibility. E.g. maybe we can do the optimization for >> tp_add but not PyArray_Add. > > For element-wise operations such as the above, wouldn't it be even > better to use loop fusion, by evaluating the entire compound expression > per element, instead of each individual operation ? That would require > methods such as __add__ to return an operation object, rather than the > result value. I believe a technique like that is used in the numexpr > package (https://github.com/pydata/numexpr), which I saw announced here > recently... Hi Stefan (long time no see!), Sure, that would be an excellent thing, but adding a delayed evaluation engine to numpy is a big piece of new code, and we'd want to make it something you opt-in to explicitly. (There are too many weird potential issues with e.g. errors showing up at some far away place from the actual broken code, due to evaluation being delayed to there.) By contrast, the optimization suggested here is a tiny change we could do now, and would still be useful even in the hypothetical future where we do have lazy evaluation, for anyone who doesn't use it. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
Nathaniel Smith wrote: > Also, I'd be pretty wary of caching large chunks of unused memory. People > already have a lot of trouble understanding their program's memory usage, > and getting rid of 'greedy free' will make this even worse. A cache would only be needed when there is a lot of computing going on, so one could set an "early expire date" on chached chunks. Anything older than a second or two could be returned. A cache would thus require a garbage collector thread. Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 02/17/2014 03:42 PM, Nathaniel Smith wrote: > Another optimization we should consider that might help a lot in the > same situations where this would help: for code called from the > cpython eval loop, it's afaict possible to determine which inputs are > temporaries by checking their refcnt. In the second call to __add__ in > '(a + b) + c', the temporary will have refcnt 1, while the other > arrays will all have refcnt >1. In such cases (subject to various > sanity checks on shape, dtype, etc) we could elide temporaries by > reusing the input array for the output. The risk is that there may be > some code out there that calls these operations directly from C with > non-temp arrays that nonetheless have refcnt 1, but we should at least > investigate the feasibility. E.g. maybe we can do the optimization for > tp_add but not PyArray_Add. For element-wise operations such as the above, wouldn't it be even better to use loop fusion, by evaluating the entire compound expression per element, instead of each individual operation ? That would require methods such as __add__ to return an operation object, rather than the result value. I believe a technique like that is used in the numexpr package (https://github.com/pydata/numexpr), which I saw announced here recently... FWIW, Stefan PS: Such a loop-fusion technique would also open the door to other optimizations, such as vectorization (simd)... -- ...ich hab' noch einen Koffer in Berlin... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 17 Feb 2014 15:17, "Sturla Molden" wrote: > > Julian Taylor wrote: > > > When an array is created it tries to get its memory from the cache and > > when its deallocated it returns it to the cache. > > Good idea, however there is already a C function that does this. It uses a > heap to keep the cached memory blocks sorted according to size. You know it > as malloc — and is why we call this allocation from the heap. Which by the > way is what NumPy already does. ;-) Common malloc implementations are not well optimized for programs that have frequent, short-lived, large-sized allocations. Usually they optimize for small short-lived allocations of of small sizes. It's totally plausible that we could do a better job in the common case of array operations like 'a + b + c + d' that allocate and free a bunch of same-sized temporary arrays as they go. (Right now, if those arrays are large, that expression will always generate multiple mmap/munmap calls.) The question is to what extent numpy programs are bottlenecked by such allocations. Also, I'd be pretty wary of caching large chunks of unused memory. People already have a lot of trouble understanding their program's memory usage, and getting rid of 'greedy free' will make this even worse. Another optimization we should consider that might help a lot in the same situations where this would help: for code called from the cpython eval loop, it's afaict possible to determine which inputs are temporaries by checking their refcnt. In the second call to __add__ in '(a + b) + c', the temporary will have refcnt 1, while the other arrays will all have refcnt >1. In such cases (subject to various sanity checks on shape, dtype, etc) we could elide temporaries by reusing the input array for the output. The risk is that there may be some code out there that calls these operations directly from C with non-temp arrays that nonetheless have refcnt 1, but we should at least investigate the feasibility. E.g. maybe we can do the optimization for tp_add but not PyArray_Add. -n ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
On 17.02.2014 21:16, Sturla Molden wrote: > Julian Taylor wrote: > >> When an array is created it tries to get its memory from the cache and >> when its deallocated it returns it to the cache. > > Good idea, however there is already a C function that does this. It uses a > heap to keep the cached memory blocks sorted according to size. You know it > as malloc — and is why we call this allocation from the heap. Which by the > way is what NumPy already does. ;-) > not with glibc, glibc has no cache for mmap allocated memory. It does cache sbrk allocated memory and uses a dynamic threshold for using it but its tuned for generic applications so the maximum threshold is very low, I think its 32 MB. Far too low for many numerical applications. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] allocated memory cache for numpy
Julian Taylor wrote: > When an array is created it tries to get its memory from the cache and > when its deallocated it returns it to the cache. Good idea, however there is already a C function that does this. It uses a heap to keep the cached memory blocks sorted according to size. You know it as malloc — and is why we call this allocation from the heap. Which by the way is what NumPy already does. ;-) Sturla ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] allocated memory cache for numpy
hi, I noticed that during some simplistic benchmarks (e.g. https://github.com/numpy/numpy/issues/4310) a lot of time is spent in the kernel zeroing pages. This is because under linux glibc will always allocate large memory blocks with mmap. As these pages can come from other processes the kernel must zero them for security reasons. For memory within the numpy process this unnecessary and possibly a large overhead for the many temporaries numpy creates. The behavior of glibc can be tuned to change the threshold at which it starts using mmap but that would be a platform specific fix. I was thinking about adding a thread local cache of pointers to of allocated memory. When an array is created it tries to get its memory from the cache and when its deallocated it returns it to the cache. The threshold and cached memory block sizes could be adaptive depending on the application workload. For simplistic temporary heavy benchmarks this eliminates the time spent in the kernel (system with time). But I don't know how relevant this is for real world applications. Have you noticed large amounts of time spent in the kernel in your apps? I also found this paper which describes pretty much exactly what I'm proposing: pyhpc.org/workshop/papers/Doubling.pdf Someone know why their changes were never incorporated in numpy? I couldn't find a reference in the list archive. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion