Re: [Numpy-discussion] switching to float32
John Schulman joschu at caltech.edu writes: I'm trying to reduce the memory used in a calculation, so I'd like to switch my program to float32 instead of float64. Is it possible to change the numpy default float size, so I don't have to explicitly state dtype=np.float32 everywhere? Possibly not the nicest way, but np.float64 = np.float32 somewhere at the beginning should work. Christian ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
Thanks everyone for the great and well thought out responses! To make matters worse, this is actually a 50gb compressed csv file. So it looks like this, 2009.06.01.plasmasub.csv.gz We get this data from another lab from the Westcoast every night therefore I don't have the option to have this file natively in hdf5. We are sticking with hdf5 because we have other applications that use this data and we wanted to standardize hdf5. Since my file is in csv, would it better for me to create a a tsv file temporarily and have np.loadtxt ? Also, I am curious about Neil's np.memmap. Do you have a some sample code for mapping a compressed csv file into memory? and loading the dataset into a dset (hdf5 structure)? TIA On Thu, Jun 25, 2009 at 9:50 PM, Anne Archibaldperidot.face...@gmail.com wrote: 2009/6/25 Mag Gam magaw...@gmail.com: Hello. I am very new to NumPy and Python. We are doing some research in our Physics lab and we need to store massive amounts of data (100GB daily). I therefore, am going to use hdf5 and h5py. The problem is I am using np.loadtxt() to create my array and create a dataset according to that. np.loadtxt() is reading a file which is about 50GB. This takes a very long time! I was wondering if there was a much easier and better way of doing this. If you are stuck with the text array, you probably can't beat numpy.loadtxt(); reading a 50 GB text file is going to be slow no matter how you cut it. So I would take a look at the code that generates the text file, and see if there's any way you can make it generate a format that is faster to read. (I assume the code is in C or FORTRAN and you'd rather not mess with it more than necessary). Of course, generating hdf5 directly is probably fastest; you might look at the C and FORTRAN hdf5 libraries and see how hard it would be to integrate them into the code that currently generates a text file. Even if you need to have a python script to gather the data and add metadata, hdf5 will be much much more efficient than text files as an intermediate format. If integrating HDF5 into the generating application is too difficult, you can try simply generating a binary format. Using numpy's structured data types, it is possible to read in binary files extremely efficiently. If you're using the same architecture to generate the files as read them, you can just write out raw binary arrays of floats or doubles and then read them into numpy. I think FORTRAN also has a semi-standard padded binary format which isn't too difficult to read either. You could even use numpy's native file format, which for a single array should be pretty straightforward, and should yield portable results. If you really can't modify the code that generates the text files, your code is going to be slow. But you might be able to make it slightly less slow. If, for example, the text files are a very specific format, especially if they're made up of columns of fixed width, it would be possible to write compiled code to read them slightly more quickly. (The very easiest way to do this is to write a little C program that reads the text files and writes out a slightly friendlier format, as above.) But you may well find that simply reading a 50 GB file dominates your run time, which would mean that you're stuck with slowness. In short: avoid text files if at all possible. Good luck, Anne TIA ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
A Friday 26 June 2009 12:38:11 Mag Gam escrigué: Thanks everyone for the great and well thought out responses! To make matters worse, this is actually a 50gb compressed csv file. So it looks like this, 2009.06.01.plasmasub.csv.gz We get this data from another lab from the Westcoast every night therefore I don't have the option to have this file natively in hdf5. We are sticking with hdf5 because we have other applications that use this data and we wanted to standardize hdf5. Well, since you are adopting HDF5, the best solution is that the Westcoast lab would send the file directly in HDF5. That will save you a lot of headaches. If this is not possible, then I think the best would be that you do some profiles in your code and see where the bottleneck is. Using cProfile normally offers a good insight on what's consuming more time in your converter. There are three most probable hot spots, the decompressor (gzip) time, the np.loadtxt and the HDF5 writer function. If the problem is gzip, then you won't be unable to accelerate the conversion unless the other lab is willing to use a lighter compressor (lzop, for example). If it is np.loadtxt(), then you should ask yourself if you are trying to load everything in-memory; if you are, don't do that; just try to load write slice by slice. Finally, if the problem is on the HDF5 write, try to use write array slices (and not record- by-record writes). Also, I am curious about Neil's np.memmap. Do you have a some sample code for mapping a compressed csv file into memory? and loading the dataset into a dset (hdf5 structure)? No, np.memmap is meant to map *uncompressed binary* files in memory, so you can't follow this path. -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
I really like the slice by slice idea! But, I don't know how to implement the code. Do you have any sample code? I suspect its the writing portion thats taking the lonest. I did a simple decompress test and its fast. On Fri, Jun 26, 2009 at 7:05 AM, Francesc Altedfal...@pytables.org wrote: A Friday 26 June 2009 12:38:11 Mag Gam escrigué: Thanks everyone for the great and well thought out responses! To make matters worse, this is actually a 50gb compressed csv file. So it looks like this, 2009.06.01.plasmasub.csv.gz We get this data from another lab from the Westcoast every night therefore I don't have the option to have this file natively in hdf5. We are sticking with hdf5 because we have other applications that use this data and we wanted to standardize hdf5. Well, since you are adopting HDF5, the best solution is that the Westcoast lab would send the file directly in HDF5. That will save you a lot of headaches. If this is not possible, then I think the best would be that you do some profiles in your code and see where the bottleneck is. Using cProfile normally offers a good insight on what's consuming more time in your converter. There are three most probable hot spots, the decompressor (gzip) time, the np.loadtxt and the HDF5 writer function. If the problem is gzip, then you won't be unable to accelerate the conversion unless the other lab is willing to use a lighter compressor (lzop, for example). If it is np.loadtxt(), then you should ask yourself if you are trying to load everything in-memory; if you are, don't do that; just try to load write slice by slice. Finally, if the problem is on the HDF5 write, try to use write array slices (and not record- by-record writes). Also, I am curious about Neil's np.memmap. Do you have a some sample code for mapping a compressed csv file into memory? and loading the dataset into a dset (hdf5 structure)? No, np.memmap is meant to map *uncompressed binary* files in memory, so you can't follow this path. -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
A Friday 26 June 2009 13:09:13 Mag Gam escrigué: I really like the slice by slice idea! Hmm, after looking at the np.loadtxt() docstrings it seems it works by loading the complete file at once, so you shouldn't use this directly (unless you split your big file before, but this will take time too). So, I'd say that your best bet would be to use Python's `csv.reader()` iterator to iterate over the lines in your file and setup a buffer (a NumPy array/recarray would be fine), so that when the buffer is full it is written to the HDF5 file. That should be pretty optimal. With this you will not try to load the entire file into memory, which is what I think is probably killing the performance in your case (unless your machine has much more memory than 50 GB, that is). -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
Yes, you are correct! I think this is the best path. However, I need to learn how to append a hdf5 dataset . I looked at this, http://code.google.com/p/h5py/wiki/FAQ#Appending_data_to_a_dataset but was not able to do so. Do you happen to have any sample code for this, if you used hdf5. On Fri, Jun 26, 2009 at 7:31 AM, Francesc Altedfal...@pytables.org wrote: A Friday 26 June 2009 13:09:13 Mag Gam escrigué: I really like the slice by slice idea! Hmm, after looking at the np.loadtxt() docstrings it seems it works by loading the complete file at once, so you shouldn't use this directly (unless you split your big file before, but this will take time too). So, I'd say that your best bet would be to use Python's `csv.reader()` iterator to iterate over the lines in your file and setup a buffer (a NumPy array/recarray would be fine), so that when the buffer is full it is written to the HDF5 file. That should be pretty optimal. With this you will not try to load the entire file into memory, which is what I think is probably killing the performance in your case (unless your machine has much more memory than 50 GB, that is). -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] loading data
A Friday 26 June 2009 13:46:13 Mag Gam escrigué: Yes, you are correct! I think this is the best path. However, I need to learn how to append a hdf5 dataset . I looked at this, http://code.google.com/p/h5py/wiki/FAQ#Appending_data_to_a_dataset but was not able to do so. Do you happen to have any sample code for this, if you used hdf5. Well, by looking at the docs, it seems just a matter of using `.resize()` and then a traditional assignment (à la arr[row1:row2] = slice). But I'd recommend you to not run too fast and read the documentation carefully: you will notice that, in the end, you will be far more productive, trust me ;-) -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Rec array: numpy.rec vs numpy.array with complex dtype
Dear Numpy list: We've been using the numpy.rec classes to make record array objects. We've noticed that in more recent versions of numpy, record-array like objects can be made directly with the numpy.ndarray class, by passing a complex data type. However, it looks like the numpy.rec class is still supported. So, we have a couple of questions: 1) Which is the preferred way to make a record array, numpy.rec, or the numpy.ndarray with complex data type? A somewhat detailed explanation of the comparative properties would be great. (We know it's buried somewhere in the document ... sorry for being lazy!) 2) The individual records in numpy.rec array have the numpy.record type. The individual records in the numpy.array approach have numpy.void type. Can you tell us a little about how these differ, and what the advantages of one vs the other is? 3) We've heard talk about complex data types in numpy in general. Is there some good place we can read about this more extensively? Also: one thing we use and like about the numpy.rec constructors is that they can take a names argument, and the constructor function does some inferring about what the formats you want are, e.g.: img = numpy.rec.fromrecords([(0,1,'a'),(2,0,'b')], names = ['A','B','C']) produces: rec.array([(0, 1, 'a'), (2, 0, 'b')], dtype=[('A', 'i4'), ('B', 'i4'), ('C', '|S1')]) This is very convenient. My immediate guess for the equivalent thing with the numpy.ndarray approach: img = numpy.array([(0,1,'a'),(2,0,'b')], names = ['A','B','C']) does not work. Is there some syntax for doing this? Thanks, Dan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rec array: numpy.rec vs numpy.array with complex dtype
On Jun 26, 2009, at 2:51 PM, Dan Yamins wrote: We've been using the numpy.rec classes to make record array objects. We've noticed that in more recent versions of numpy, record-array like objects can be made directly with the numpy.ndarray class, by passing a complex data type. Hasn't it always been the case? However, it looks like the numpy.rec class is still supported. So, we have a couple of questions: 1) Which is the preferred way to make a record array, numpy.rec, or the numpy.ndarray with complex data type? A somewhat detailed explanation of the comparative properties would be great. (We know it's buried somewhere in the document ... sorry for being lazy!) Short answer: a np.recarray is a subclass of ndarray with structured dtype, where fields can be accessed has attributes (as in 'yourarray.yourfield') instead of as items (as in yourarray['yourfield']). Under the hood, that means that the __getattribute__ method (and the corresponding __setattr__) had to be overloaded (you need to check whether an attribute is a field or not), which slows things down compared to a standard ndarray. My favorite way to get a np.recarray is to define a standard ndarray w/ complex dtype, and then take a view as a recarray Example: np.array([(1,10),(2,20)],dtype=[('a',int), ('b',int)]).view(np.recarray) 2) The individual records in numpy.rec array have the numpy.record type. The individual records in the numpy.array approach have numpy.void type. Can you tell us a little about how these differ, and what the advantages of one vs the other is? Mmh: x = np.array([(1,10),(2,20)],dtype=[('a',int),('b',int)]) rx = x.view(np.recarray) type(x[0]) type 'numpy.void' type(rx[0]) type 'numpy.void' What numpy version are you using ? 3) We've heard talk about complex data types in numpy in general. Is there some good place we can read about this more extensively? I think the proper term is 'structured data type', or 'structured array'. Also: one thing we use and like about the numpy.rec constructors is that they can take a names argument, and the constructor function does some inferring about what the formats you want are, e.g.: img = numpy.rec.fromrecords([(0,1,'a'),(2,0,'b')], names = ['A','B','C']) produces: rec.array([(0, 1, 'a'), (2, 0, 'b')], dtype=[('A', 'i4'), ('B', 'i4'), ('C', '|S1')]) This is very convenient. My immediate guess for the equivalent thing with the numpy.ndarray approach: img = numpy.array([(0,1,'a'),(2,0,'b')], names = ['A','B','C']) does not work. Is there some syntax for doing this? You have to construct your dtype explicitly, as in dtype=[('A', 'i4'), ('B', 'i4'), ('C', '|S1')]. np.rec.fromrecords processes the array and try to guess the best type for each field, but it's slow and not always correct (what if your third field should have been '|S3' ?) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] switching to float32
On Fri, Jun 26, 2009 at 03:39, Christian K.ckk...@hoc.net wrote: John Schulman joschu at caltech.edu writes: I'm trying to reduce the memory used in a calculation, so I'd like to switch my program to float32 instead of float64. Is it possible to change the numpy default float size, so I don't have to explicitly state dtype=np.float32 everywhere? Possibly not the nicest way, but np.float64 = np.float32 somewhere at the beginning should work. No. There is no way to change the default dtype of ones(), zeros(), etc. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rec array: numpy.rec vs numpy.array with complex dtype
Pierre, thanks for your response. I have some follow up questions. Short answer: a np.recarray is a subclass of ndarray with structured dtype, where fields can be accessed has attributes (as in 'yourarray.yourfield') instead of as items (as in yourarray['yourfield']). Is this the only substantial thing added in the recarray class? The fact that you can access some fields via attribute notation? We haven't been using this feature anyhow ... (what happens with the field names have spaces?) Is the recarray class still being developed actively? My favorite way to get a np.recarray is to define a standard ndarray w/ complex dtype, and then take a view as a recarray Example: np.array([(1,10),(2,20)],dtype=[('a',int), ('b',int)]).view(np.recarray) Is the purpose of this basically to use the property of recarrays of accessing fields as attributes? Or do you have other reasons why you like this view? Mmh: x = np.array([(1,10),(2,20)],dtype=[('a',int),('b',int)]) rx = x.view(np.recarray) type(x[0]) type 'numpy.void' type(rx[0]) type 'numpy.void' In [18]: x = np.rec.fromrecords([(0,1,'a'),(2,0,'b')],names = ['A','B']) In [19]: x[0] Out[19]: (0, 1, 'a') In [20]: type(x[0]) Out[20]: class 'numpy.core.records.record' In [21]: np.version.version Out[21]: '1.3.0' 3) We've heard talk about complex data types in numpy in general. Is there some good place we can read about this more extensively? I think the proper term is 'structured data type', or 'structured array'. Do you recommend a place we can learn about the interesting things one can do with structured data types? Or is the on-line documentation on the scipy site the best as of now? You have to construct your dtype explicitly, as in dtype=[('A', 'i4'), ('B', 'i4'), ('C', '|S1')]. np.rec.fromrecords processes the array and try to guess the best type for each field, but it's slow and not always correct Evidently. But sometimes (in fact, a lot of times, in our particular applications), the type inference works fine and the slowdown is not large enough to be noticeable. And of course in the recarray constructors one can override the type inference by including a 'dtype' or 'formats' argument as well. Obviously we can write constructor functions that include type inference algorithms of our own, ... but having a standard way to do this, with best practices maintained in the numpy core would be quite useful nonetheless. thanks, Dan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Rec array: numpy.rec vs numpy.array with complex dtype
On Jun 26, 2009, at 3:59 PM, Dan Yamins wrote: Short answer: a np.recarray is a subclass of ndarray with structured dtype, where fields can be accessed has attributes (as in 'yourarray.yourfield') instead of as items (as in yourarray['yourfield']). Is this the only substantial thing added in the recarray class? AFAIK, yes. The fact that you can access some fields via attribute notation? We haven't been using this feature anyhow ... (what happens with the field names have spaces?) Well, spaces in a field name is a bad idea, but nothing prevents you to do it (I wonder whether we shouldn't check for it in the definition of the dtype). Anyway, that will of course fail gloriously if you try to access it by attribute. Is the recarray class still being developed actively? I don't know. There's not much you can add to it, is there ? My favorite way to get a np.recarray is to define a standard ndarray w/ complex dtype, and then take a view as a recarray Example: np.array([(1,10),(2,20)],dtype=[('a',int), ('b',int)]).view(np.recarray) Is the purpose of this basically to use the property of recarrays of accessing fields as attributes? Or do you have other reasons why you like this view? You're correct, it's only to provide a more convenient way to access fields. I personally stopped using recarrays in favor of the easier ndarrays w/ structured dtype. If I really need to access fields as attributes, I'd write a subclass and make each field accesible as a property. Do you recommend a place we can learn about the interesting things one can do with structured data types? Or is the on-line documentation on the scipy site the best as of now? http://docs.scipy.org/doc/numpy/reference/arrays.dtypes.html is a good start. Feel free to start some tutorial page. np.rec.fromrecords processes the array and try to guess the best type for each field, but it's slow and not always correct Evidently. But sometimes (in fact, a lot of times, in our particular applications), the type inference works fine and the slowdown is not large enough to be noticeable. And of course in the recarray constructors one can override the type inference by including a 'dtype' or 'formats' argument as well. Obviously we can write constructor functions that include type inference algorithms of our own, ... but having a standard way to do this, with best practices maintained in the numpy core would be quite useful nonetheless. Well, you can always use the functions of the np.rec modules (fromfile, fromstring, fromrecords...). You can also have a look at np.lib.io.genfromtxt, a function to create a ndarray (or recarray, or MaskedArray) from a text file. I don't think overloading np.array to support cases like the ones you described is a good idea: I prefer to have some specific tools (like the np.rec functions) that one catch- all function. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] switching to float32
Robert Kern schrieb: On Fri, Jun 26, 2009 at 03:39, Christian K.ckk...@hoc.net wrote: John Schulman joschu at caltech.edu writes: I'm trying to reduce the memory used in a calculation, so I'd like to switch my program to float32 instead of float64. Is it possible to change the numpy default float size, so I don't have to explicitly state dtype=np.float32 everywhere? Possibly not the nicest way, but np.float64 = np.float32 somewhere at the beginning should work. No. There is no way to change the default dtype of ones(), zeros(), etc. Right, I answered before finishing to think. I thought it was about replacing the dtype keyword arg in all places to float32 which could very well be accomplished by the text editor itself as well. Sorry. Christian ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Import Numpy on Windows Vista x64 AMD
Ticket# 1084 (http://projects.scipy.org/numpy/timeline?from=2009-06-09T03%3A01%3A59-0500precision=second) says that the numpy import on Windows Vista x64 AMD systems works now. Is this for Numpy 1.3 or 1.4 and if 1.3 has anyone tried it successfully? Thanks. Dinesh ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] fromfile and ticket #1152
The question is: what should happen when fewer items are read than requested. The current behaviour is 1) Error message written to stderr (needs to be fixed) 2) If 0 items are read then nomemory error is raised ;) So, should a warning be raised and an array returned with however many items were read? Meaning an empty array if nothing was read. Or should and error be raised in this circumstance? The behaviour is currently undocumented. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Import Numpy on Windows Vista x64 AMD
On Sat, Jun 27, 2009 at 6:42 AM, Dinesh B Vadhiadineshbvad...@hotmail.com wrote: Ticket# 1084 (http://projects.scipy.org/numpy/timeline?from=2009-06-09T03%3A01%3A59-0500precision=second) says that the numpy import on Windows Vista x64 AMD systems works now. I mistakenly closed it as fixed, but it is just a duplicate. The problem persists, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] stderr
There are a few spots where messages are printed to stderr. Some of these look almost like debugging stuff, for instance NPY_NO_EXPORT void form...@name@(char *buf, size_t buflen, @name@ val, unsigned int prec) { /* XXX: Find a correct size here for format string */ char format[64], *res; size_t i, cnt; PyOS_snprintf(format, sizeof(format), _FMT1, prec); res = numpyos_ascii_for...@type@(buf, buflen, format, val, 0); if (res == NULL) { fprintf(stderr, Error while formatting\n); return; } /* If nothing but digits after sign, append .0 */ cnt = strlen(buf); for (i = (val 0) ? 1 : 0; i cnt; ++i) { if (!isdigit(Py_CHARMASK(buf[i]))) { break; } } if (i == cnt buflen = cnt + 3) { strcpy(buf[cnt],.0); } } Do we want to raise an error here? Alternatively, we could use an assert. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion