[Numpy-discussion] switching to float32

2009-06-25 Thread John Schulman
I'm trying to reduce the memory used in a calculation, so I'd like to
switch my program to float32 instead of float64. Is it possible to
change the numpy default float size, so I don't have to explicitly
state dtype=np.float32 everywhere?

Thanks,
John
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] switching to float32

2009-06-25 Thread Geoffrey Ely
This does not exactly answer your question, but you can use the dtype  
string representation and positional parameter to make things nicer.  
For example:

a = numpy.array( [1.0, 2.0, 3.0], 'f' )

instead of

a = numpy.array( [1.0, 2.0, 3.0], dtype=numpy.float32 )

-Geoff

On Jun 25, 2009, at 6:43 AM, John Schulman wrote:

 I'm trying to reduce the memory used in a calculation, so I'd like to
 switch my program to float32 instead of float64. Is it possible to
 change the numpy default float size, so I don't have to explicitly
 state dtype=np.float32 everywhere?

 Thanks,
 John
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loading data

2009-06-25 Thread Anne Archibald
2009/6/25 Mag Gam magaw...@gmail.com:
 Hello.

 I am very new to NumPy and Python. We are doing some research in our
 Physics lab and we need to store massive amounts of data (100GB
 daily). I therefore, am going to use hdf5 and h5py. The problem is I
 am using np.loadtxt() to create my array and create a dataset
 according to that. np.loadtxt() is reading a file which is about 50GB.
 This takes a very long time! I was wondering if there was a much
 easier and better way of doing this.

If you are stuck with the text array, you probably can't beat
numpy.loadtxt(); reading a 50 GB text file is going to be slow no
matter how you cut it. So I would take a look at the code that
generates the text file, and see if there's any way you can make it
generate a format that is faster to read. (I assume the code is in C
or FORTRAN and you'd rather not mess with it more than necessary).

Of course, generating hdf5 directly is probably fastest; you might
look at the C and FORTRAN hdf5 libraries and see how hard it would be
to integrate them into the code that currently generates a text file.
Even if you need to have a python script to gather the data and add
metadata, hdf5 will be much much more efficient than text files as an
intermediate format.

If integrating HDF5 into the generating application is too difficult,
you can try simply generating a binary format. Using numpy's
structured data types, it is possible to read in binary files
extremely efficiently. If you're using the same architecture to
generate the files as read them, you can just write out raw binary
arrays of floats or doubles and then read them into numpy. I think
FORTRAN also has a semi-standard padded binary format which isn't too
difficult to read either. You could even use numpy's native file
format, which for a single array should be pretty straightforward, and
should yield portable results.

If you really can't modify the code that generates the text files,
your code is going to be slow. But you might be able to make it
slightly less slow. If, for example, the text files are a very
specific format, especially if they're made up of columns of fixed
width, it would be possible to write compiled code to read them
slightly more quickly. (The very easiest way to do this is to write a
little C program that reads the text files and writes out a slightly
friendlier format, as above.) But you may well find that simply
reading a 50 GB file dominates your run time, which would mean that
you're stuck with slowness.


In short: avoid text files if at all possible.


Good luck,
Anne

 TIA
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://mail.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] loading data

2009-06-25 Thread Neil Martinsen-Burrell
On Thu, June 25, 2009 7:59 pm, Mag Gam wrote:
 I am very new to NumPy and Python. We are doing some research in our
 Physics lab and we need to store massive amounts of data (100GB
 daily). I therefore, am going to use hdf5 and h5py. The problem is I
 am using np.loadtxt() to create my array and create a dataset
 according to that. np.loadtxt() is reading a file which is about 50GB.
 This takes a very long time! I was wondering if there was a much
 easier and better way of doing this.

50 GB is a *lot* of data to read from a disk into memory (if you really do
have that much memory).  A magnetic hard drive can read less than 150
MB/s, so just to read the blocks off the disk would take over 5 minutes. 
np.loadtxt has additional processing on top of that.  I think you may be
interested in PyTables (www.pytables.org) or np.memmap, although since you
have already settled on HDF5, PyTables would be a natural choice, since it
can process on-disk datasets as if they were NumPy arrays (which might be
nice if you don't have all 50GB of memory).

-Neil


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion