Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Paul Anton Letnes

On 3. okt. 2012, at 18:22, Chris Barker wrote:

> On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes
>  wrote:
> 
>>> I'm not sure the problem you are trying to solve -- accumulating in a
>>> list is pretty efficient anyway -- not a whole lot overhead.
>> 
>> Oh, there's significant overhead, since we're not talking of a list - we're 
>> talking of a list-of-lists.
> 
> hmm, a list of nupy scalers (custom dtype) would be a better option,
> though maybe not all that much better -- still an extra pointer and
> pyton object for each row.
> 
> 
>> I see your point - but if you're to return a single array, and the file is 
>> close to the total system memory, you've still got a factor of 2 issue when 
>> shuffling the binary data from the accumulator into the result array. That 
>> is, unless I'm missong something here?
> 
> Indeed, I think that's how my current accumulator works -- the
> __array__() method returns a copy of the data buffer, so that you
> won't accidentally re-allocate it under the hood later and screw up
> the returned version.
> 
> But it is indeed accumulating in a numpy array, so it should be pretty
> possible, maybe even easy to turn it into a regular array without a
> data copy. You'd just have to destroy the original somehow (or mark it
> as never-resize) so you wouldn't have the clash. messing wwith the
> OWNDATA flags might take care of that.
> 
> But it seems Wes has a better solution.

Indeed.

> One other note, though -- if you have arrays that are that close to
> max system memory, you are very likely to have other trouble anyway --
> numpy does make a lot of copies!

That's true. Now, I'm not worried about this myself, but several people have 
complained about this on the mailing list, and it seemed like an easy fix. Oh 
well, it's too late for it now, anyways.

Paul
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Chris Barker
On Wed, Oct 3, 2012 at 9:05 AM, Paul Anton Letnes
 wrote:

>> I'm not sure the problem you are trying to solve -- accumulating in a
>> list is pretty efficient anyway -- not a whole lot overhead.
>
> Oh, there's significant overhead, since we're not talking of a list - we're 
> talking of a list-of-lists.

hmm, a list of nupy scalers (custom dtype) would be a better option,
though maybe not all that much better -- still an extra pointer and
pyton object for each row.


> I see your point - but if you're to return a single array, and the file is 
> close to the total system memory, you've still got a factor of 2 issue when 
> shuffling the binary data from the accumulator into the result array. That 
> is, unless I'm missong something here?

Indeed, I think that's how my current accumulator works -- the
__array__() method returns a copy of the data buffer, so that you
won't accidentally re-allocate it under the hood later and screw up
the returned version.

But it is indeed accumulating in a numpy array, so it should be pretty
possible, maybe even easy to turn it into a regular array without a
data copy. You'd just have to destroy the original somehow (or mark it
as never-resize) so you wouldn't have the clash. messing wwith the
OWNDATA flags might take care of that.

But it seems Wes has a better solution.

One other note, though -- if you have arrays that are that close to
max system memory, you are very likely to have other trouble anyway --
numpy does make a lot of copies!

-Chris



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Paul Anton Letnes

On 1. okt. 2012, at 21:07, Chris Barker wrote:

> Paul,
> 
> Nice to see someone working on these issues, but:
> 
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.

Oh, there's significant overhead, since we're not talking of a list - we're 
talking of a list-of-lists. My guesstimate from my hacking session (off the top 
of my head - I don't have my benchmarks in working memory :) is around 3-5 
times more memory with the list-of-lists approach for a single column / 1D 
array, which presumably is the worst case (a length 1 list for each line of 
input). Hence, if you want to load a 2 GB file into RAM on a machine with 4 GB 
of the stuff, you're screwed with the old approach and a happy camper with mine.

> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.

I see your point - but if you're to return a single array, and the file is 
close to the total system memory, you've still got a factor of 2 issue when 
shuffling the binary data from the accumulator into the result array. That is, 
unless I'm missong something here?

Cheers
Paul

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Paul Anton Letnes

On 3. okt. 2012, at 17:48, Wes McKinney wrote:

> On Monday, October 1, 2012, Chris Barker wrote:
> Paul,
> 
> Nice to see someone working on these issues, but:
> 
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.
> 
> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.
> 
> I also have a Cython version that is not quite done (darn regular job
> getting in the way) that is both faster and more memory efficient.
> 
> Also, frankly, just writing the array pre-allocation and re-sizeing
> code into loadtxt would not be a whole lot of code either, and would
> be both fast and memory efficient.
> 
> Let mw know if you want any of my code to play with.
> 
> >  However, I got the impression that someone was
> > working on a More Advanced (TM) C-based file reader, which will
> > replace loadtxt;
> 
> yes -- I wonder what happened with that? Anyone?
> 
> -CHB
> 
> 
> 
> this patch is intended as a useful thing to have
> > while we're waiting for that to appear.
> >
> > The patch passes all tests in the test suite, and documentation for
> > the kwarg has been added. I've modified all tests to include the
> > seekable kwarg, but that was mostly to check that all tests are passed
> > also with this kwarg. I guess it's bit too late for 1.7.0 though?
> >
> > Should I make a pull request? I'm happy to take any and all
> > suggestions before I do.
> >
> > Cheers
> > Paul
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> 
> 
> --
> 
> Christopher Barker, Ph.D.
> Oceanographer
> 
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
> 
> chris.bar...@noaa.gov
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> 
> I've finally built a new, very fast C-based tokenizer/parser with type 
> inference, NA-handling, etc. for pandas sporadically over the last month-- 
> it's almost ready to ship. It's roughly an order of magnitude faster than 
> loadtxt and uses very little temporary space. Should be easy to push upstream 
> into NumPy to replace the innards of np.loadtxt if I can get a bit of help 
> with the plumbing (it already yields structured arrays in addition to pandas 
> DataFrames so there isn't a great deal that needs doing). 
> 
> Blog post with CPU and memory benchmarks to follow-- will post a link here. 
> 
> - Wes
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion


So Chris, looks like Wes has us beaten in every conceivable way. Hey, that's a 
good thing :)  I suppose the thing to do now is to make sure Wes' function 
tackles the loadtxt test suite?

Paul

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-03 Thread Wes McKinney
On Monday, October 1, 2012, Chris Barker wrote:

> Paul,
>
> Nice to see someone working on these issues, but:
>
> I'm not sure the problem you are trying to solve -- accumulating in a
> list is pretty efficient anyway -- not a whole lot overhead.
>
> But if you do want to improve that, it may be better to change the
> accumulating method, rather than doing the double-read thing. I"ve
> written, and posted here, code that provides an Acumulator that uses
> numpy internally, so not much memory overhead. In the end, it's not
> any faster than accumulating in a list and then converting to an
> array, but it does use less memory.
>
> I also have a Cython version that is not quite done (darn regular job
> getting in the way) that is both faster and more memory efficient.
>
> Also, frankly, just writing the array pre-allocation and re-sizeing
> code into loadtxt would not be a whole lot of code either, and would
> be both fast and memory efficient.
>
> Let mw know if you want any of my code to play with.
>
> >  However, I got the impression that someone was
> > working on a More Advanced (TM) C-based file reader, which will
> > replace loadtxt;
>
> yes -- I wonder what happened with that? Anyone?
>
> -CHB
>
>
>
> this patch is intended as a useful thing to have
> > while we're waiting for that to appear.
> >
> > The patch passes all tests in the test suite, and documentation for
> > the kwarg has been added. I've modified all tests to include the
> > seekable kwarg, but that was mostly to check that all tests are passed
> > also with this kwarg. I guess it's bit too late for 1.7.0 though?
> >
> > Should I make a pull request? I'm happy to take any and all
> > suggestions before I do.
> >
> > Cheers
> > Paul
> > ___
> > NumPy-Discussion mailing list
> > NumPy-Discussion@scipy.org 
> > http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
>
> --
>
> Christopher Barker, Ph.D.
> Oceanographer
>
> Emergency Response Division
> NOAA/NOS/OR&R(206) 526-6959   voice
> 7600 Sand Point Way NE   (206) 526-6329   fax
> Seattle, WA  98115   (206) 526-6317   main reception
>
> chris.bar...@noaa.gov 
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org 
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>

I've finally built a new, very fast C-based tokenizer/parser with type
inference, NA-handling, etc. for pandas sporadically over the last month--
it's almost ready to ship. It's roughly an order of magnitude faster than
loadtxt and uses very little temporary space. Should be easy to push
upstream into NumPy to replace the innards of np.loadtxt if I can get a bit
of help with the plumbing (it already yields structured arrays in addition
to pandas DataFrames so there isn't a great deal that needs doing).

Blog post with CPU and memory benchmarks to follow-- will post a link here.

- Wes
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-10-01 Thread Chris Barker
Paul,

Nice to see someone working on these issues, but:

I'm not sure the problem you are trying to solve -- accumulating in a
list is pretty efficient anyway -- not a whole lot overhead.

But if you do want to improve that, it may be better to change the
accumulating method, rather than doing the double-read thing. I"ve
written, and posted here, code that provides an Acumulator that uses
numpy internally, so not much memory overhead. In the end, it's not
any faster than accumulating in a list and then converting to an
array, but it does use less memory.

I also have a Cython version that is not quite done (darn regular job
getting in the way) that is both faster and more memory efficient.

Also, frankly, just writing the array pre-allocation and re-sizeing
code into loadtxt would not be a whole lot of code either, and would
be both fast and memory efficient.

Let mw know if you want any of my code to play with.

>  However, I got the impression that someone was
> working on a More Advanced (TM) C-based file reader, which will
> replace loadtxt;

yes -- I wonder what happened with that? Anyone?

-CHB



this patch is intended as a useful thing to have
> while we're waiting for that to appear.
>
> The patch passes all tests in the test suite, and documentation for
> the kwarg has been added. I've modified all tests to include the
> seekable kwarg, but that was mostly to check that all tests are passed
> also with this kwarg. I guess it's bit too late for 1.7.0 though?
>
> Should I make a pull request? I'm happy to take any and all
> suggestions before I do.
>
> Cheers
> Paul
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion



-- 

Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/OR&R(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

chris.bar...@noaa.gov
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] memory-efficient loadtxt

2012-09-30 Thread Paul Anton Letnes
For convenience and clarity, this is the diff in question:
https://github.com/Dynetrekk/numpy-1/commit/5bde67531a2005ef80a2690a75c65cebf97c9e00

And this is my numpy fork:
https://github.com/Dynetrekk/numpy-1/

Paul


On Sun, Sep 30, 2012 at 4:14 PM, Paul Anton Letnes
 wrote:
> Hello everyone,
>
> I've modified loadtxt to make it (potentially) more memory efficient.
> The idea is that if a user passes a seekable file, (s)he can also pass
> the 'seekable=True' kwarg. Then, loadtxt will count the number of
> lines (containing data) and allocate an array of exactly the right
> size to hold the loaded data. The downside is that the line counting
> more than doubles the runtime, as it loops over the file twice, and
> there's a sort-of unnecessary np.array function call in the loop. The
> branch is called faster-loadtxt, which is silly due to the runtime
> doubling, but I'm hoping that the false advertising is acceptable :)
> (I naively expected a speedup by removing some needless list
> manipulation.)
>
> I'm pretty sure that the function can be micro-optimized quite a bit
> here and there, and in particular, the main for loop is a bit
> duplicated right now. However, I got the impression that someone was
> working on a More Advanced (TM) C-based file reader, which will
> replace loadtxt; this patch is intended as a useful thing to have
> while we're waiting for that to appear.
>
> The patch passes all tests in the test suite, and documentation for
> the kwarg has been added. I've modified all tests to include the
> seekable kwarg, but that was mostly to check that all tests are passed
> also with this kwarg. I guess it's bit too late for 1.7.0 though?
>
> Should I make a pull request? I'm happy to take any and all
> suggestions before I do.
>
> Cheers
> Paul
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion