Re: [Numpy-discussion] Subdividing NumPy array into Regular Grid

2014-10-26 Thread Paul Hobson
I think you want np.meshgrid
-paul

On Sun, Oct 26, 2014 at 2:09 AM, Artur Bercik  wrote:

> I have a rectangle with the following coordinates:
>
> import numpy as np
>
> ulx,uly = (110, 60) ##uppper left lon, upper left lat
> urx,ury = (120, 60) ##uppper right lon, upper right lat
> lrx, lry = (120, 50) ##lower right lon, lower right lat
> llx, lly = (110, 50) ##lower left lon, lower left lat
>
> I want to divide that single rectangle into 100 regular grids inside that,
> and want to calculate the (ulx, uly), (urx,ury), (lrx, lry), and (llx, lly)
> for each grid separately:
>
> lats = np.linspace(60, 50, 10)
> lons = np.linspace(110, 120, 10)
>
> lats = np.repeat(lats,10).reshape(10,10)
> lons = np.tile(lons,10).reshape(10,10)
>
> I could not imagine what to do then?
>
> Is somebody familiar with such kind of problem?
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread RayS

At 06:32 AM 10/26/2014, you wrote:

On Sun, Oct 26, 2014 at 1:21 PM, Eelco Hoogendoorn
 wrote:
> Im not sure why the memory doubling is necessary. Isnt it possible to
> preallocate the arrays and write to them?

Not without reading the whole file first to know how many rows to preallocate


Seems to me that loadtxt()
http://docs.scipy.org/doc/numpy/reference/generated/numpy.loadtxt.html
should have an optional shape. I often know how many rows I have (# 
of samples of data) from other meta data.

Then:
- if the file is smaller for some reason (you're not sure and pad 
your estimate) it could do one of

- zero pad array
- raise()
- return truncated view
- if larger
- raise()
- return data read (this would act like fileObject.read( size ) )
- Ray S ___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Saullo Castro
I agree with @Daniele's point, storing huge arrays in text files migh
indicate a bad process but once these functions can be improved, why
not? Unless this turns to be a burden to change.

Regarding the estimation of the array size, I don't see a big performance
loss when the file iterator is exhausting once more in order to estimate
the number of rows and pre-allocate the proper arrays to avoid using list
of lists. The hardest part seems to be dealing with arrays of strings
(perhaps easily solved with dtype=object) and structured arrays.

Cheers,
Saullo


2014-10-26 18:00 GMT+01:00 :

> Send NumPy-Discussion mailing list submissions to
> numpy-discussion@scipy.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
> or, via email, send a message with subject or body 'help' to
> numpy-discussion-requ...@scipy.org
>
> You can reach the person managing the list at
> numpy-discussion-ow...@scipy.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of NumPy-Discussion digest..."
>
>
> Today's Topics:
>
>1. Re: Memory efficient alternative for np.loadtxt and
>   np.genfromtxt (Daniele Nicolodi)
>
>
> --
>
> Message: 1
> Date: Sun, 26 Oct 2014 17:42:32 +0100
> From: Daniele Nicolodi 
> Subject: Re: [Numpy-discussion] Memory efficient alternative for
> np.loadtxt and np.genfromtxt
> To: numpy-discussion@scipy.org
> Message-ID: <544d2478.8020...@grinta.net>
> Content-Type: text/plain; charset=windows-1252
>
> On 26/10/14 09:46, Saullo Castro wrote:
> > I would like to start working on a memory efficient alternative for
> > np.loadtxt and np.genfromtxt that uses arrays instead of lists to store
> > the data while the file iterator is exhausted.
>
> ...
>
> > I would be glad if you could share your experience on this matter.
>
> I'm of the opinion that if your workflow requires you to regularly load
> large arrays from text files, something else needs to be fixed rather
> than the numpy speed and memory usage in reading data from text files.
>
> There are a number of data formats that are interoperable and allow to
> store data much more efficiently. hdf5 is one natural choice, maybe with
> the blosc compressor.
>
> Cheers,
> Daniele
>
>
>
> --
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> End of NumPy-Discussion Digest, Vol 97, Issue 57
> 
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: NumPy 1.9.1 release candidate

2014-10-26 Thread Julian Taylor
Hi,

We have finally finished the first release candidate of NumOy 1.9.1,
sorry for the week delay.
The 1.9.1 release will as usual be a bugfix only release to the 1.9.x
series.
The tarballs and win32 binaries are available on sourceforge:
https://sourceforge.net/projects/numpy/files/NumPy/1.9.1rc1/

If no regressions show up the final release is planned next week.
The upgrade is recommended for all users of the 1.9.x series.

Following issues have been fixed:
* gh-5184: restore linear edge behaviour of gradient to as it was in < 1.9.
  The second order behaviour is available via the `edge_order` keyword
* gh-4007: workaround Accelerate sgemv crash on OSX 10.9
* gh-5100: restore object dtype inference from iterable objects without
`len()`
* gh-5163: avoid gcc-4.1.2 (red hat 5) miscompilation causing a crash
* gh-5138: fix nanmedian on arrays containing inf
* gh-5203: copy inherited masks in MaskedArray.__array_finalize__
* gh-2317: genfromtxt did not handle filling_values=0 correctly
* gh-5067: restore api of npy_PyFile_DupClose in python2
* gh-5063: cannot convert invalid sequence index to tuple
* gh-5082: Segmentation fault with argmin() on unicode arrays
* gh-5095: don't propagate subtypes from np.where
* gh-5104: np.inner segfaults with SciPy's sparse matrices
* gh-5136: Import dummy_threading if importing threading fails
* gh-5148: Make numpy import when run with Python flag '-OO'
* gh-5147: Einsum double contraction in particular order causes ValueError
* gh-479: Make f2py work with intent(in out)
* gh-5170: Make python2 .npy files readable in python3
* gh-5027: Use 'll' as the default length specifier for long long
* gh-4896: fix build error with MSVC 2013 caused by C99 complex support
* gh-4465: Make PyArray_PutTo respect writeable flag
* gh-5225: fix crash when using arange on datetime without dtype set
* gh-5231: fix build in c99 mode

Source tarballs, windows installers and release notes can be found at
https://sourceforge.net/projects/numpy/files/NumPy/1.9.1rc1/

Cheers,
Julian Taylor





signature.asc
Description: OpenPGP digital signature
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Daniele Nicolodi
On 26/10/14 09:46, Saullo Castro wrote:
> I would like to start working on a memory efficient alternative for
> np.loadtxt and np.genfromtxt that uses arrays instead of lists to store
> the data while the file iterator is exhausted.

...

> I would be glad if you could share your experience on this matter.

I'm of the opinion that if your workflow requires you to regularly load
large arrays from text files, something else needs to be fixed rather
than the numpy speed and memory usage in reading data from text files.

There are a number of data formats that are interoperable and allow to
store data much more efficiently. hdf5 is one natural choice, maybe with
the blosc compressor.

Cheers,
Daniele

___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Nathaniel Smith
On 26 Oct 2014 11:54, "Jeff Reback"  wrote:
>
> you should have a read here/
> http://wesmckinney.com/blog/?p=543
>
> going below the 2x memory usage on read in is non trivial and costly in
terms of performance

On Linux you can probably go below 2x overhead easily, by exploiting the
fact that realloc on large memory blocks is basically O(1) (yes really):
http://blog.httrack.com/blog/2014/04/05/a-story-of-realloc-and-laziness/

Sadly osx does not provide anything similar and I can't tell for sure about
windows.

Though on further thought, the numbers Wes quotes there aren't actually the
most informative - massif will tell you how much virtual memory you have
allocated, but a lot of that is going to be a pure vm accounting trick. The
output array memory will actually be allocated incrementally one block at a
time as you fill it in. This means that if you can free each temporary
chunk immediately after you copy it into the output array, then even simple
approaches can have very low overhead. It's possible pandas's actual
overhead is already closer to 1x than 2x, and this is just hidden by the
tools Wes is using to measure it.

-n

> On Oct 26, 2014, at 4:46 AM, Saullo Castro 
wrote:
>
>> I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.
>>
>> The motivation came from this SO question:
>>
>> http://stackoverflow.com/q/26569852/832621
>>
>> where for huge arrays the current NumPy ASCII readers are really slow
and require ~6 times more memory. This case I tested with Pandas'
read_csv() and it required 2 times more memory.
>>
>> I would be glad if you could share your experience on this matter.
>>
>> Greetings,
>> Saullo
>>
>> ___
>> NumPy-Discussion mailing list
>> NumPy-Discussion@scipy.org
>> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Jeff Reback
you are describing a special case where you know the data size apriori (eg not 
streaming), dtypes are readily apparent from a small sample case 
and in general your data is not messy 

I would agree if these can be satisfied then you can achieve closer to a 1x 
memory overhead

using bcolZ is great but prob not a realistic option for a dependency for numpy 
(you should prob just memory map it directly instead); though this has a big 
perf impact - so need to weigh these things

not all cases deserve the same treatment - chunking is often the best option 
IMHO - provides a constant memory usage (though ultimately still 2x); but 
combined with memory mapping can provide a fixed resource utilization 

> On Oct 26, 2014, at 9:41 AM, Daπid  wrote:
> 
> 
>> On 26 October 2014 12:54, Jeff Reback  wrote:
>> you should have a read here/
>> http://wesmckinney.com/blog/?p=543
>> 
>> going below the 2x memory usage on read in is non trivial and costly in 
>> terms of performance 
> 
> 
> If you know in advance the number of rows (because it is in the header, 
> counted with wc -l, or any other prior information) you can preallocate the 
> array and fill in the numbers as you read, with virtually no overhead.
> 
> If the number of rows is unknown, an alternative is to use a chunked data 
> container like Bcolz [1] (former carray) instead of Python structures. It may 
> be used as such, or copied back to a ndarray if we want the memory to be 
> aligned. Including a bit of compression we can get the memory overhead to 
> somewhere under 2x (depending on the dataset), at the cost of not so much CPU 
> time, and this could be very useful for large data and slow filesystems. 
> 
> 
> /David.
> 
> [1] http://bcolz.blosc.org/
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Derek Homeier
On 26 Oct 2014, at 02:21 pm, Eelco Hoogendoorn  
wrote:

> Im not sure why the memory doubling is necessary. Isnt it possible to 
> preallocate the arrays and write to them? I suppose this might be inefficient 
> though, in case you end up reading only a small subset of rows out of a 
> mostly corrupt file? But that seems to be a rather uncommon corner case.
> 
> Either way, id say a doubling of memory use is fair game for numpy. 
> Generality is more important than absolute performance. The most important 
> thing is that temporary python datastructures are avoided. That shouldn't be 
> too hard to accomplish, and would realize most of the performance and memory 
> gains, I imagine.

Preallocation is not straightforward because the parser needs to be able in 
general to work with streamed input.
I think I even still have a branch on github bypassing this on request (by 
keyword argument).
But a factor 2 is already a huge improvement over that factor ~6 coming from 
the current text readers buffering
the entire input as list of list of Python strings, not to speak of the vast 
performance gain from using a parser
implemented in C like pandas’ - in fact one of the last times this subject came 
up one suggestion was to steal
pandas.read_cvs and adopt as required.
Someone also posted some code or the draft thereof for using resizable arrays 
quite a while ago, which would
reduce the memory overhead for very large arrays.

Cheers,
Derek



___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Daπid
On 26 October 2014 12:54, Jeff Reback  wrote:

> you should have a read here/
> http://wesmckinney.com/blog/?p=543
>
> going below the 2x memory usage on read in is non trivial and costly in
> terms of performance
>


If you know in advance the number of rows (because it is in the header,
counted with wc -l, or any other prior information) you can preallocate the
array and fill in the numbers as you read, with virtually no overhead.

If the number of rows is unknown, an alternative is to use a chunked data
container like Bcolz [1] (former carray) instead of Python structures. It
may be used as such, or copied back to a ndarray if we want the memory to
be aligned. Including a bit of compression we can get the memory overhead
to somewhere under 2x (depending on the dataset), at the cost of not so
much CPU time, and this could be very useful for large data and slow
filesystems.


/David.

[1] http://bcolz.blosc.org/
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Robert Kern
On Sun, Oct 26, 2014 at 1:21 PM, Eelco Hoogendoorn
 wrote:
> Im not sure why the memory doubling is necessary. Isnt it possible to
> preallocate the arrays and write to them?

Not without reading the whole file first to know how many rows to preallocate.

-- 
Robert Kern
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Eelco Hoogendoorn
Im not sure why the memory doubling is necessary. Isnt it possible to
preallocate the arrays and write to them? I suppose this might be
inefficient though, in case you end up reading only a small subset of rows
out of a mostly corrupt file? But that seems to be a rather uncommon corner
case.

Either way, id say a doubling of memory use is fair game for numpy.
Generality is more important than absolute performance. The most important
thing is that temporary python datastructures are avoided. That shouldn't
be too hard to accomplish, and would realize most of the performance and
memory gains, I imagine.

On Sun, Oct 26, 2014 at 12:54 PM, Jeff Reback  wrote:

> you should have a read here/
> http://wesmckinney.com/blog/?p=543
>
> going below the 2x memory usage on read in is non trivial and costly in
> terms of performance
>
> On Oct 26, 2014, at 4:46 AM, Saullo Castro 
> wrote:
>
> I would like to start working on a memory efficient alternative for
> np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
> data while the file iterator is exhausted.
>
> The motivation came from this SO question:
>
> http://stackoverflow.com/q/26569852/832621
>
> where for huge arrays the current NumPy ASCII readers are really slow and
> require ~6 times more memory. This case I tested with Pandas' read_csv()
> and it required 2 times more memory.
>
> I would be glad if you could share your experience on this matter.
>
> Greetings,
> Saullo
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>
>
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Jeff Reback
you should have a read here/
http://wesmckinney.com/blog/?p=543

going below the 2x memory usage on read in is non trivial and costly in terms 
of performance 

> On Oct 26, 2014, at 4:46 AM, Saullo Castro  wrote:
> 
> I would like to start working on a memory efficient alternative for 
> np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the 
> data while the file iterator is exhausted.
> 
> The motivation came from this SO question:
> 
> http://stackoverflow.com/q/26569852/832621
> 
> where for huge arrays the current NumPy ASCII readers are really slow and 
> require ~6 times more memory. This case I tested with Pandas' read_csv() and 
> it required 2 times more memory.
> 
> I would be glad if you could share your experience on this matter.
> 
> Greetings,
> Saullo
> ___
> NumPy-Discussion mailing list
> NumPy-Discussion@scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Subdividing NumPy array into Regular Grid

2014-10-26 Thread Artur Bercik
I have a rectangle with the following coordinates:

import numpy as np

ulx,uly = (110, 60) ##uppper left lon, upper left lat
urx,ury = (120, 60) ##uppper right lon, upper right lat
lrx, lry = (120, 50) ##lower right lon, lower right lat
llx, lly = (110, 50) ##lower left lon, lower left lat

I want to divide that single rectangle into 100 regular grids inside that,
and want to calculate the (ulx, uly), (urx,ury), (lrx, lry), and (llx, lly)
for each grid separately:

lats = np.linspace(60, 50, 10)
lons = np.linspace(110, 120, 10)

lats = np.repeat(lats,10).reshape(10,10)
lons = np.tile(lons,10).reshape(10,10)

I could not imagine what to do then?

Is somebody familiar with such kind of problem?
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] npy_log2 undefined on Linux

2014-10-26 Thread Matthew Brett
Hi,

On Sat, Oct 25, 2014 at 11:26 PM, David Cournapeau  wrote:
> Not exactly: if you build numpy with mingw (as is the official binary), you
> need to build everything that uses numpy C API with it.

Some of the interwebs appear to believe that the mingw .a file is
compatible with visual studio:
http://stackoverflow.com/questions/2096519/from-mingw-static-library-a-to-visual-studio-static-library-lib

I can get the program to compile by copying libnpymath.a as
npymath.lib, followed by many linker errors like:

npymath.lib(npy_math.o) : error LNK2019: unresolved external symbol
_cosf referenced in function _npy_cosf

Perhaps, as the Onion once said "error found on internet"...

See you,

Matthew



Matthew
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Memory efficient alternative for np.loadtxt and np.genfromtxt

2014-10-26 Thread Saullo Castro
I would like to start working on a memory efficient alternative for
np.loadtxt and np.genfromtxt that uses arrays instead of lists to store the
data while the file iterator is exhausted.

The motivation came from this SO question:

http://stackoverflow.com/q/26569852/832621

where for huge arrays the current NumPy ASCII readers are really slow and
require ~6 times more memory. This case I tested with Pandas' read_csv()
and it required 2 times more memory.

I would be glad if you could share your experience on this matter.

Greetings,
Saullo
___
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion