[Numpy-discussion] recfunctions.stack_arrays
Pierre (or anyone else who cares to chime in), I'm using stack_arrays to combine data from two different files into a single array. In one of these files, the data from one entire record comes back missing, which, thanks to your recent change, ends up having a boolean dtype. There is actual data for this same field in the 2nd file, so it ends up having the dtype of float64. When I try to combine the two arrays, I end up with the following traceback: data = stack_arrays((old_data, data)) File /home/rmay/.local/lib64/python2.5/site-packages/metpy/cbook.py, line 260, in stack_arrays output = ma.masked_all((np.sum(nrecords),), newdescr) File /home/rmay/.local/lib64/python2.5/site-packages/numpy/ma/extras.py, line 79, in masked_all a = masked_array(np.empty(shape, dtype), ValueError: two fields with the same name Which is unsurprising. Do you think there is any reasonable way to get stack_arrays() to find a common dtype for fields with the same name? Or another suggestion on how to approach this? If you think coercing one/both of the fields to a common dtype is the way to go, just point me to a function that could figure out the dtype and I'll try to put together a patch. Thanks, Ryan P.S. Thanks so much for your work on putting those utility functions in recfunctions.py It makes it so much easier to have these functions available in the library itself rather than needing to reinvent the wheel over and over. -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug with mafromtxt
Pierre GM wrote: On Jan 24, 2009, at 6:23 PM, Ryan May wrote: Ok, thanks. I've dug a little further, and it seems like the problem is that a column of all missing values ends up as a column of all None's. When you create a (masked) array from a list of None's, you end up with an object array. On one hand I'd love for things to behave differently in this case, but on the other I understand why things work this way. Ryan, Mind giving r6434 a try? As usual, don't hesitate to report any problem. Works great! Thanks for the quick fix. I had racked my brain on how to go about fixing this cleanly, but this is far simpler than what I would have done. It makes sense, since all I really needed for the masked column was something *other* than object. Thanks a lot, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy ndarray questions
Jochen wrote: Hi all, I just wrote ctypes bindings to fftw3 (see http://projects.scipy.org/pipermail/scipy-user/2009-January/019557.html for the post to scipy). Now I have a couple of numpy related questions: In order to be able to use simd instructions I create an ndarray subclass, which uses fftw_malloc to allocate the memory and fftw_free to free the memory when the array is deleted. This works fine for inplace operations however if someone does something like this: a = fftw3.AlignedArray(1024,complex) a = a+1 a.ctypes.data points to a different memory location (this is actually an even bigger problem when executing fftw plans), however type(a) still gives me class 'fftw3.planning.AlignedArray'. This might help some: http://www.scipy.org/Subclasses Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Bug with mafromtxt
Pierre, I've found what I consider to be a bug in the new mafromtxt (though apparently it existed in earlier versions as well). If you have an entire column of data in a file that contains only masked data, and try to get mafromtxt to automatically choose the dtype, the dtype gets selected to be object type. In this case, I'd think the better behavior would be float, but I'm not sure how hard it would be to make this the case. Here's a test case: import numpy as np from StringIO import StringIO s = StringIO('1 2 3\n4 5 6\n') a = np.mafromtxt(s, missing='2,5', dtype=None) print a.dtype Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug with mafromtxt
Pierre GM wrote: Ryan, Thanks for reporting. An idea would be to force the dtype of the masked column to the largest dtype of the other columns (in your example, that would be int). I'll try to see how easily it can be done early next week. Meanwhile, you can always give an explicit dtype at creation. Ok, thanks. I've dug a little further, and it seems like the problem is that a column of all missing values ends up as a column of all None's. When you create a (masked) array from a list of None's, you end up with an object array. On one hand I'd love for things to behave differently in this case, but on the other I understand why things work this way. Ryan On Jan 24, 2009, at 5:58 PM, Ryan May wrote: Pierre, I've found what I consider to be a bug in the new mafromtxt (though apparently it existed in earlier versions as well). If you have an entire column of data in a file that contains only masked data, and try to get mafromtxt to automatically choose the dtype, the dtype gets selected to be object type. In this case, I'd think the better behavior would be float, but I'm not sure how hard it would be to make this the case. Here's a test case: import numpy as np from StringIO import StringIO s = StringIO('1 2 3\n4 5 6\n') a = np.mafromtxt(s, missing='2,5', dtype=None) print a.dtype Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Pattern for reading non-simple binary files
Hi, I'm trying to read in a data from a binary-formatted file. I have the data format, (available at: http://www1.ncdc.noaa.gov/pub/data/documentlibrary/tddoc/td7000.pdf if you're really curious), but it's not what I would consider simple, with a lot of different blocks and messages, some that are optional and some that have different formats depending on the data type. My question is, has anyone dealt with data like this using numpy? Have you found a good pattern for how to construct a numpy dtype dynamically to decode the different parts of the file appropriately as you go along? Any insight would be appreciated. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy performance vs Matlab.
Nicolas ROUX wrote: Hi, I need help ;-) I have here a testcase which works much faster in Matlab than Numpy. The following code takes less than 0.9sec in Matlab, but 21sec in Python. Numpy is 24 times slower than Matlab ! The big trouble I have is a large team of people within my company is ready to replace Matlab by Numpy/Scipy/Matplotlib, but I have to demonstrate that this kind of Python Code is executed with the same performance than Matlab, without writing C extension. This is becoming a critical point for us. This is a testcase that people would like to see working without any code restructuring. The reasons are: - this way of writing is fairly natural. - the original code which showed me the matlab/Numpy performance differences is much more complex, and can't benefit from broadcasting or other numpy tips (I can later give this code) ...So I really need to use the code below, without restructuring. Numpy/Python code: # import numpy import time print Start test \n dim = 3000 a = numpy.zeros((dim,dim,3)) start = time.clock() for i in range(dim): for j in range(dim): a[i,j,0] = a[i,j,1] a[i,j,2] = a[i,j,0] a[i,j,1] = a[i,j,2] end = time.clock() - start print Test done, %f sec % end # SNIP Any idea on it ? Did I missed something ? I think you may have reduced the complexity a bit too much. The python code above sets all of the elements equal to a[i,j,1]. Is there any reason you can't use slicing to avoid the loops? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
Jean-Baptiste Rudant wrote: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 You can accomplish what your FieldArray class does using numpy dtypes: import numpy as np dt = np.dtype([('age', np.int32), ('weight', np.int32)]) N = int(10e6) data = np.empty(N, dtype=dt) data['age'] = np.random.randint(0, 99, 10e6) data['weight'] = np.random.randint(0, 200, 10e6) data['age'] += 1 Timing for recarrays (your code): In [10]: timeit data.age += 1 10 loops, best of 3: 221 ms per loop Timing for my example: In [2]: timeit data['age']+=1 10 loops, best of 3: 150 ms per loop Hope this helps. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: Ryan, OK, I'll look into that. I won't have time to address it before this next week, however. Option #2 looks like the best. No hurries, I just want to make sure I raise any issues I see while the design is still up for change. In other news, I was considering renaming genloadtxt to genfromtxt, and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the function names. That way, loadtxt is untouched. +1 I know I've changed my tune here, but at this point it seems like there's so much more functionality here that calling it loadtxt would be a disservice to how much the new function can do (and how much work you've done). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unexpected MaskedArray behavior
Pierre GM wrote: On Dec 16, 2008, at 1:57 PM, Ryan May wrote: I just noticed the following and I was kind of surprised: a = ma.MaskedArray([1,2,3,4,5], mask=[False,True,True,False,False]) b = a*5 b masked_array(data = [5 -- -- 20 25], mask = [False True True False False], fill_value=99) b.data array([ 5, 10, 15, 20, 25]) I was expecting that the underlying data wouldn't get modified while masked. Is this actual behavior expected? Meh. Masked data shouldn't be trusted anyway, so I guess it doesn't really matter one way or the other. But I tend to agree, it'd make more sense leave masked data untouched (or at least, reset them to their original value after the operation), which would mimic the behavior of gimp/photoshop. Looks like there's a relatively easy fix. I need time to check whether it doesn't break anything elsewhere, nor that it slows things down too much. I won't have time to test all that before next week, though. In any case, that would be for 1.3.x, not for 1.2.x. In the meantime, if you need the functionality, use something like ma.where(a.mask,a,a*5) I agree that masked values probably shouldn't be trusted, I was just surprised to see the behavior. I just assumed that no operations were taking place on masked values. Just to clarify what I was doing here: I had a masked array of data, where the mask was set by a variety of different masked values. Later on in the code, after doing some unit conversions, I went back to look at the raw data to find points that had one particular masked value set. Instead, I was surprised to see all of the masked values had changed and I could no longer find any of the special values in the data. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Unexpected MaskedArray behavior
Hi, I just noticed the following and I was kind of surprised: a = ma.MaskedArray([1,2,3,4,5], mask=[False,True,True,False,False]) b = a*5 b masked_array(data = [5 -- -- 20 25], mask = [False True True False False], fill_value=99) b.data array([ 5, 10, 15, 20, 25]) I was expecting that the underlying data wouldn't get modified while masked. Is this actual behavior expected? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. I have one more use issue that you may or may not want to fix. My problem is that missing values are specified by their string representation, so that a string representing a missing value, while having the same actual numeric value, may not compare equal when represented as a string. For instance, if you specify that -999.0 represents a missing value, but the value written to the file is -999.00, you won't end up masking the -999.00 data point. I'm sure a test case will help here: def test_withmissing_float(self): data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00') test = mloadtxt(data, dtype=None, delimiter=',', missing='-999.0', names=True) control = ma.array([(0, 1.5), (2, -1.)], mask=[(False, False), (False, True)], dtype=[('A', np.int), ('B', np.float)]) print control print test assert_equal(test, control) assert_equal(test.mask, control.mask) Right now this fails with the latest version of genloadtxt. I've worked around this by specifying a whole bunch of string representations of the values, but I wasn't sure if you knew of a better way that this could be handled within genloadtxt. I can only think of two ways, though I'm not thrilled with either: 1) Call the converter on the string form of the missing value and compare against the converted value from the file to determine if missing. (Probably very slow) 2) Add a list of objects (ints, floats, etc.) to compare against after conversion to determine if they're missing. This might needlessly complicate the function, which I know you've already taken pains to optimize. If there's no good way to do it, I'm content to live with a workaround. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ? Thx for any comment and suggestions. Current version works out of the box for me. Thanks for running point on this. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt: second serving
Pierre GM wrote: All, Here's the second round of genloadtxt. That's a tad cleaner version than the previous one, where I tried to take into account the different comments and suggestions that were posted. So, tabs should be supported and explicit whitespaces are not collapsed. Looks pretty good, but there's one breakage against what I had working with my local copy (with mods). When adding the filtering of names read from the file using usecols, there's a reason I set a flag and fixed it later: converters specified by name. If we have usecols and converters specified by name, and we read the names from a file, we have the following sequence: 1) Read names 2) Convert usecols names to column numbers. 3) Filter name list using usecols. Indices of names list no longer map to column numbers. 4) Change converters from mapping names-funcs to mapping col#-func using indices from namesOOPS. It's an admittedly complex combination, but it allows flexibly reading text files since you're only basing on field names, no column numbers. Here's a test case: def test_autonames_usecols_and_converter(self): Tests names and usecols data = StringIO.StringIO('A B C D\n 121 45 9.1') test = loadtxt(data, usecols=('A', 'C', 'D'), names=True, dtype=None, converters={'C':lambda s: 2 * int(s)}) control = np.array(('', 90, 9.1), dtype=[('A', '|S4'), ('C', int), ('D', float)]) assert_equal(test, control) This fails with your current implementation, but works for me when: 1) Set a flag when reading names from header line in file 2) Filter names from file using usecols (if the flag is true) *after* remapping the converters. There may be a better approach, but this is the simplest I've come up with so far. FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit comparison: same input, no missing data, one with genloadtxt, one with np.loadtxt and a last one with matplotlib.mlab.csv2rec. As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but twice faster than csv2rec. One of the explanation for the slowness is indeed the use of classes for splitting lines and converting values. Instead of a basic function, we use the __call__ method of the class, which itself calls another function depending on the attribute values. I'd like to reduce this overhead, any suggestion is more than welcome, as usual. Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in numpy.ma, with an alias recfromcsv for John, using his defaults. Unless somebody comes with a brilliant optimization. Why only in numpy.ma and not somewhere in core numpy itself (missing values aside)? You have a pretty good masked array agnostic wrapper that IMO could go in numpy, though maybe not as loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... Found a problem. If you read the names from the file and specify usecols, you end up with the first N names read from the file as the fields in your output (where N is the number of entries in usecols), instead of having the names of the columns you asked for. For instance: from StringIO import StringIO from genload_proposal import loadtxt f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None) array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('stnm', 'i8'), ('relh', 'f8')]) What I want to come out is: array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('relh', 'i8'), ('tair', 'f8')]) I've attached a version that fixes this by setting a flag internally if the names are read from the file. If this flag is true, at the end the names are filtered down to only the ones that are given in usecols. I also have one other thought. Is there any way we can make this handle object arrays, or rather, a field containing objects, specifically datetime objects? Right now, this does not work because calling view does not work for object arrays. I'm just looking for a simple way to store date/time in my record array (currently a string field). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var def __init__(self, delimiter=None, comments='#'): self.comments = comments # Delimiter is a character if delimiter is None: self._isfixed = False self.delimiter = None elif _is_string_like(delimiter): self._isfixed = False self.delimiter = delimiter.strip() or None # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): self._isfixed = True idx = np.cumsum([0]+list(delimiter)) self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): self._isfixed = True
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Zachary Pincus wrote: Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like Gene Name\tPubMed ID \tStarting Position get split wrong, even when I explicitly pass in '\t' as the delimiter! Similarly, I believe that some of the tests are formulated wrong: def test_nodelimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] At least, that's what I would expect. Treating contiguous blocks of whitespace as single delimiters is perfectly reasonable when None is provided as the delimiter, but when an explicit delimiter has been provided, it strikes me that the code shouldn't try to further- interpret it... Does anyone else have any opinion here? I agree. If the user explicity passes something as a delimiter, we should use it and not try to be too smart. +1 Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Pierre GM wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. A couple of quick nitpicks: 1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend() 2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same. Other than those, it's working fine for me here. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Stéfan van der Walt wrote: Hi Pierre 2008/12/1 Pierre GM [EMAIL PROTECTED]: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io I see the code length increased from 200 lines to 800. This made me wonder about the execution time: initial benchmarks suggest a 3x slow-down. Could this be a problem for loading large text files? If so, should we consider keeping both versions around, or by default bypassing all the extra hooks? I've wondered about this being an issue. On one hand, you hate to make existing code noticeably slower. On the other hand, if speed is important to you, why are you using ascii I/O? I personally am not entirely against having two versions of loadtxt-like functions. However, the idea seems a little odd, seeing as how loadtxt was already supposed to be the swiss army knife of text reading. I'm seeing a similar slowdown with Pierre's version of the code. The version of loadtxt that I cobbled together with the StringConverter class (and no missing value support) shows about a 50% slowdown, so clearly there's a performance penalty for trying to make a generic function that can be all things to all people. On the other hand, this approach reduces code duplication. I'm not really opinionated on what the right approach is here. My only opinion is that this functionality *really* needs to be in numpy in some fashion. For my own use case, with the old version, I could read a text file and by hand separate out columns and mask values. Now, I open a file and get a structured array with an automatically detected dtype (names and types!) plus masked values. My $0.02. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
John Hunter wrote: On Tue, Nov 25, 2008 at 11:23 PM, Ryan May [EMAIL PROTECTED] wrote: Updated patch attached. This includes: * Updated docstring * New tests * Fixes for previous issues * Fixes to make new tests actually work I appreciate any and all feedback. I'm having trouble applying your patch, so I haven't tested yet, but do you (and do you want to) handle a case like this:: from StringIO import StringIO import matplotlib.mlab as mlab f1 = StringIO(\ name age weight John 23 145. Harry 43 180.) for line in f1: print line.split(' ') Ie, space delimited but using an irregular number of spaces? One place this comes up a lot is when the output files are actually fixed-width using spaces to line up the columns. One could count the columns to figure out the fixed widths and work with that, but it is much easier to simply assume space delimiting and handle the irregular number of spaces assuming one or more spaces is the delimiter. In csv2rec, we write a custom file object to handle this case. Apologies if you are already handling this and I missed it... I think line.split(None) handles this case, so *in theory* passing delimiter=None would do it. I *am* interested in this case, so I'll have to give it a try when I get a chance. (I sense this is the same case as Manuel just asked about.) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! Thanks for the support! One question: I have a datafile in ASCII format that uses a fixed width for each column. If no data if present, the space is left empty (see second row). What is the default behavior of the StringConverter class in this case? Does it ignore the empty entry by default? If so, what is the value in the array in this case? Is it nan? Example file: 1| 123.4| -123.4| 00.0 2| | 234.7| 12.2 I don't think this is so much anything to do with StringConverter, but more to do with how to split lines. Maybe we should add an option that, instead of simply specifying characters that delimit the fields, allows one to pass a custom function to split lines? That could either be done by overriding `delimiter` or by adding a new option like `splitter` I'll have to give that some thought. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] More loadtxt() changes
Hi, I have a couple more changes to loadtxt() that I'd like to code up in time for 1.3, but I thought I should run them by the list before doing too much work. These are already implemented in some fashion in matplotlib.mlab.csv2rec(), but the code bases are different enough, that pretty much only the idea can be lifted. All of these changes would be done in a manner that is backwards compatible with the current API. 1) Support for setting the names of fields in the returned structured array without using dtype. This can be a passed in list of names or reading the names of fields from the first line of the file. Many files have a header line that gives a name for each column. Adding this would obviously make loadtxt much more general and allow for more generic code, IMO. My current thinking is to add a *name* keyword parameter that defaults to None, for no support for reading names. Setting it to True would tell loadtxt() to read the names from the first line (after skiprows). The other option would be to set names to a list of strings. 2) Support for automatic dtype inference. Instead of assuming all values are floats, this would try a list of options until one worked. For strings, this would keep track of the longest string within a given field before setting the dtype. This would allow reading of files containing a mixture of types much more easily, without having to go to the trouble of constructing a full dtype by hand. This would work alongside any custom converters one passes in. My current thinking of API would just be to add the option of passing the string 'auto' as the dtype parameter. 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Here's an example of my use case (without 50 columns): ID,First Name,Last Name,Homework1,Homework2,Quiz1,Homework3,Final 1234,Joe,Smith,85,90,,76, 5678,Jane,Doe,65,99,,78, 9123,Joe,Plumber,45,90,,92, Currently reading in this code requires a bit of boilerplace (declaring dtypes, converters). While it's nothing I can't write, it still would be easier to write it once within loadtxt and have it for everyone. Any support for *any* of these ideas? Any suggestions on how the user should pass in the information? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Ryan, FYI, I've been coding over the last couple of weeks an extension of loadtxt for a better support of masked data, with the option to read column names in a header. Please find an example below (I also have unittest). Most of the work is actually inspired from matplotlib's mlab.csv2rec. It might be worth not duplicating efforts. Cheers, P. Absolutely! Definitely don't want to duplicate effort here. What I see here meets a lot of what I was looking for. Here are some questions: 1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array. 2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen(). 3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination? I'm currently cooking up some of these changes myself, but thought I would see what you thought first. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 2:06 PM, Ryan May wrote: 1) It looks like the function returns a structured array rather than a rec array, so that fields are obtained by doing a dictionary access. Since it's a dictionary access, is there any reason that the header needs to be munged to replace characters and reserved names? IIUC, csv2rec changes names b/c it returns a rec array, which uses attribute lookup and hence all names need to be valid python identifiers. This is not the case for a structured array. Personally, I prefer flexible ndarrays to recarrays, hence the output. However, I still think that names should be as clean as possible to avoid bad surprises down the road. Ok, I'm not really partial to this, I just thought it would simplify. Your point is valid. 2) Can we avoid the use of seek() in here? I just posted a patch to change the check to readline, which was the only file function used previously. This allowed the direct use of a file-like object returned by urllib2.urlopen(). I coded that a couple of weeks ago, before you posted your patch and I didn't have tme to check it. Yes, we could try getting rid of seek. However, we need to find a way to rewind to the beginning of the file if the dtypes are not given in input (as we parsed the whole file to find the best converter in that case). What about doing the parsing and type inference in a loop and holding onto the already split lines? Then loop through the lines with the converters that were finally chosen? In addition to making my usecase work, this has the benefit of not doing the I/O twice. 3) In order to avoid breaking backwards compatibility, can we change to default for dtype to be float32, and instead use some kind of special value ('auto' ?) to use the automatic dtype determination? I'm not especially concerned w/ backwards compatibility, because we're supporting masked values (something that np.loadtxt shouldn't have to worry about). Initially, I needed a replacement to the fromfile function in the scikits.timeseries.trecords package. I figured it'd be easier and more portable to get a function for generic masked arrays, that could be adapted afterwards to timeseries. In any case, I was more considering the functions I send you to be part of some numpy.ma.io module than a replacement to np.loadtxt. I tried to get the syntax as close as possible to np.loadtxt and mlab.csv2rec, but there'll always be some differences. So, yes, we could try to use a default dtype=float and yes, we could have an extra parameter 'auto'. But is it really that useful ? I'm not sure (well, no, I'm sure it's not...) I understand you're not concerned with backwards compatibility, but with the exception of missing handling, which is probably specific to masked arrays, I was hoping to just add functionality to loadtxt(). Numpy doesn't need a separate text reader for most of this and breaking API for any of this is likely a non-starter. So while, yes, having float be the default dtype is probably not the most useful, leaving it also doesn't break existing code. -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 2:37 PM, Ryan May wrote: What about doing the parsing and type inference in a loop and holding onto the already split lines? Then loop through the lines with the converters that were finally chosen? In addition to making my usecase work, this has the benefit of not doing the I/O twice. You mean, filling a list and relooping on it if we need to ? Sounds like a plan, but doesn't it create some extra temporaries we may not want ? It shouldn't create any *extra* temporaries since we already make a list of lists before creating the final array. It just introduces an extra looping step. (I'd reuse the existing list of lists). Depends on how we do it. We could have a modified np.loadtxt that takes some of the ideas of the file I send you (the StringConverter, for example), then I could have a numpy.ma.io that would take care of the missing data. And something in scikits.timeseries for the dates... The new np.loadtxt could use the default of the initial one, or we could create yet another function (np.loadfromtxt) that would match what I was suggesting, and np.loadtxt would be a special stripped downcase with dtype=float by default. thoughts? My personal opinion is that if it doesn't make loadtxt too unwieldly, to just add a few of the options to loadtxt() itself. I'm working on tweaking loadtxt() to add the auto dtype and the names, relying heavily on your StringConverter class (nice code btw.). If my understanding of StringConverter is correct, tweaking the new loadtxt for ma or timeseries would only require passing in modified versions of StringConverter. I'll post that when I'm done and we can see if it looks like too much functionality stapled together or not. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Nope, we still need to double check whether there's any missing data in any field of the line we process, independently of the conversion. So there must be some extra loop involved, and I'd need a special function in numpy.ma to take care of that. So our options are * create a new function in numpy.ma and leave np.loadtxt like that * write a new np.loadtxt incorporating most of the ideas of the code I send, but I'd still need to adapt it to support masked values. You couldn't run this loop on the array returned by np.loadtxt() (by masking on the appropriate fill value)? I'll post that when I'm done and we can see if it looks like too much functionality stapled together or not. Sounds like a plan. Wouldn't mind getting more feedback from fellow users before we get too deep, however... Agreed. Anyone? -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Sounds like a plan. Wouldn't mind getting more feedback from fellow users before we get too deep, however... Ok, I've attached, as a first cut, a diff against SVN HEAD that does (I think) what I'm looking for. It passes all of the old tests and passes my own quick test. A more rigorous test suite will follow, but I want this out the door before I need to leave for the day. What this changeset essentially does is just add support for automatic dtypes along with supplying/reading names for flexible dtypes. It leverages StringConverter heavily, using a few tweaks so that old behavior is kept. This is by no means a final version. Probably the biggest change from what I mentioned earlier is that instead of dtype='auto', I've used dtype=None to signal the detection code, since dtype=='auto' causes problems. I welcome any and all suggestions here, both on the code and on the original idea of adding these capabilities to loadtxt(). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: lib/io.py === --- lib/io.py (revision 6099) +++ lib/io.py (working copy) @@ -233,29 +233,138 @@ for name in todel: os.remove(name) -# Adapted from matplotlib +def _string_like(obj): +try: obj + '' +except (TypeError, ValueError): return False +return True -def _getconv(dtype): -typ = dtype.type -if issubclass(typ, np.bool_): -return lambda x: bool(int(x)) -if issubclass(typ, np.integer): -return lambda x: int(float(x)) -elif issubclass(typ, np.floating): -return float -elif issubclass(typ, np.complex): -return complex +def str2bool(value): + +Tries to transform a string supposed to represent a boolean to a boolean. + +Raises +-- +ValueError +If the string is not 'True' or 'False' (case independent) + +value = value.upper() +if value == 'TRUE': +return True +elif value == 'FALSE': +return False else: -return str +return int(bool(value)) +class StringConverter(object): + +Factory class for function transforming a string into another object (int, +float). -def _string_like(obj): -try: obj + '' -except (TypeError, ValueError): return 0 -return 1 +After initialization, an instance can be called to transform a string +into another object. If the string is recognized as representing a missing +value, a default value is returned. +Parameters +-- +dtype : dtype, optional +Input data type, used to define a basic function and a default value +for missing data. For example, when `dtype` is float, the :attr:`func` +attribute is set to ``float`` and the default value to `np.nan`. +missing_values : sequence, optional +Sequence of strings indicating a missing value. + +Attributes +-- +func : function +Function used for the conversion +default : var +Default value to return when the input corresponds to a missing value. +mapper : sequence of tuples +Sequence of tuples (function, default value) to evaluate in order. + + +from numpy.core import nan # To avoid circular import +mapper = [(str2bool, None), + (lambda x: int(float(x)), -1), + (float, nan), + (complex, nan+0j), + (str, '???')] + +def __init__(self, dtype=None, missing_values=None): +if dtype is None: +self.func = str2bool +self.default = None +self._status = 0 +else: +dtype = np.dtype(dtype).type +self.func,self.default,self._status = self._get_from_dtype(dtype) + +# Store the list of strings corresponding to missing values. +if missing_values is None: +self.missing_values = [] +else: +self.missing_values = set(list(missing_values) + ['']) + +def __call__(self, value): +if value in self.missing_values: +return self.default +return self.func(value) + +def upgrade(self, value): + +Tries to find the best converter for `value`, by testing different +converters in order. +The order in which the converters are tested is read from the +:attr:`_status` attribute of the instance. + +try: +self.__call__(value) +except ValueError: +_statusmax = len(self.mapper) +if self._status == _statusmax: +raise ValueError(Could not find a valid conversion function) +elif self._status _statusmax - 1: +self._status += 1 +(self.func, self.default) = self.mapper[self._status] +self.upgrade(value) + +def _get_from_dtype(self, dtype): + +Sets the :attr:`func
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: Ryan, Quick comments: * I already have some unittests for StringConverter, check the file I attach. Ok, great. * Your str2bool will probably mess things up in upgrade compared to the one JDH had written (the one I send you): you don't wanna use int(bool(value)), as it'll always give you 0 or 1 when you might need a ValueError Ok, I wasn't sure. I was trying to merge what the old code used with the new str2bool you supplied. That's probably not all that necessary. * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) * I'd probably get rid of StringConverter._get_from_dtype, as it is not needed outside the __init__. You may wanna stick to the original __init__. Done. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 10:02 PM, Ryan May wrote: Pierre GM wrote: * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) Then, define a _locked attribute in StringConverter, and prevent upgrade to run if self._locked is True. Sure if you're into logic and sound design. I was going more for hackish and obtuse. (No seriously, I don't know why I didn't think of that.) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Minimum dtype
Hi, I'm running on a 64-bit machine, and see the following: numpy.array(64.6).dtype dtype('float64') numpy.array(64).dtype dtype('int64') Is there any function/setting to make these default to 32-bit types except where necessary? I don't mean by specifying dtype=numpy.float32 or dtype=numpy.int32. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Pierre GM wrote: On Nov 25, 2008, at 10:02 PM, Ryan May wrote: Pierre GM wrote: * Your locked version of update won't probably work either, as you force the converter to output a string (you set the status to largest possible, that's the one that outputs strings). Why don't you set the status to the current one (make a tmp one if needed). Looking at the code, it looks like mapper is only used in the upgrade() method. My goal by setting status to the largest possible is to lock the converter to the supplied function. That way for the user supplied converters, the StringConverter doesn't try to upgrade away from it. My thinking was that if the user supplied converter function fails, the user should know. (Though I got this wrong the first time.) Updated patch attached. This includes: * Updated docstring * New tests * Fixes for previous issues * Fixes to make new tests actually work I appreciate any and all feedback. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 6107) +++ numpy/lib/io.py (working copy) @@ -233,29 +233,136 @@ for name in todel: os.remove(name) -# Adapted from matplotlib +def _string_like(obj): +try: obj + '' +except (TypeError, ValueError): return False +return True -def _getconv(dtype): -typ = dtype.type -if issubclass(typ, np.bool_): -return lambda x: bool(int(x)) -if issubclass(typ, np.integer): -return lambda x: int(float(x)) -elif issubclass(typ, np.floating): -return float -elif issubclass(typ, np.complex): -return complex +def str2bool(value): + +Tries to transform a string supposed to represent a boolean to a boolean. + +Raises +-- +ValueError +If the string is not 'True' or 'False' (case independent) + +value = value.upper() +if value == 'TRUE': +return True +elif value == 'FALSE': +return False else: -return str +raise ValueError(Invalid boolean) +class StringConverter(object): + +Factory class for function transforming a string into another object (int, +float). -def _string_like(obj): -try: obj + '' -except (TypeError, ValueError): return 0 -return 1 +After initialization, an instance can be called to transform a string +into another object. If the string is recognized as representing a missing +value, a default value is returned. +Parameters +-- +dtype : dtype, optional +Input data type, used to define a basic function and a default value +for missing data. For example, when `dtype` is float, the :attr:`func` +attribute is set to ``float`` and the default value to `np.nan`. +missing_values : sequence, optional +Sequence of strings indicating a missing value. + +Attributes +-- +func : function +Function used for the conversion +default : var +Default value to return when the input corresponds to a missing value. +mapper : sequence of tuples +Sequence of tuples (function, default value) to evaluate in order. + + +from numpy.core import nan # To avoid circular import +mapper = [(str2bool, None), + (int, -1), #Needs to be int so that it can fail and promote + #to float + (float, nan), + (complex, nan+0j), + (str, '???')] + +def __init__(self, dtype=None, missing_values=None): +self._locked = False +if dtype is None: +self.func = str2bool +self.default = None +self._status = 0 +else: +dtype = np.dtype(dtype).type +if issubclass(dtype, np.bool_): +(self.func, self.default, self._status) = (str2bool, 0, 0) +elif issubclass(dtype, np.integer): +#Needs to be int(float(x)) so that floating point values will +#be coerced to int when specifid by dtype +(self.func, self.default, self._status) = (lambda x: int(float(x)), -1, 1) +elif issubclass(dtype, np.floating): +(self.func, self.default, self._status) = (float, np.nan, 2) +elif issubclass(dtype, np.complex): +(self.func, self.default, self._status) = (complex, np.nan + 0j, 3) +else: +(self.func, self.default, self._status) = (str, '???', -1) + +# Store the list of strings corresponding to missing values. +if missing_values is None: +self.missing_values = [] +else: +self.missing_values = set(list(missing_values) + ['']) + +def __call__(self, value): +if value in self.missing_values: +return self.default +return
[Numpy-discussion] numpy.loadtxt requires seek()?
Hi, Does anyone know why numpy.loadtxt(), in checking the validity of a filehandle, checks for the seek() method, which appears to have no bearing on whether an object will work? I'm trying to use loadtxt() directly with the file-like object returned by urllib2.urlopen(). If I change the check for 'seek' to one for 'readline', using the urlopen object works with a hitch. As far as I can tell, all the filehandle object needs to meet is: 1) Have a readline() method so that loadtxt can skip the first N lines and read the first line of data 2) Be compatible with itertools.chain() (should be any iterable) At a minimum, I'd ask to change the check for 'seek' to one for 'readline'. On a bit deeper thought, it would seem that loadtxt would work with any iterable that returns individual lines. I'd like then to change the calls to readline() to just getting the next object from the iterable (iter.next() ?) and change the check for a file-like object to just a check for an iterable. In fact, we could use the iter() builtin to convert whatever got passed. That would give automatically a next() method and would raise a TypeError if it's incompatible. Thoughts? I'm willing to write up the patch for either . Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.loadtxt requires seek()?
Stéfan van der Walt wrote: 2008/11/20 Ryan May [EMAIL PROTECTED]: Does anyone know why numpy.loadtxt(), in checking the validity of a filehandle, checks for the seek() method, which appears to have no bearing on whether an object will work? I think this is simply a naive mistake on my part. I was looking for a way to identify files; your patch would be welcome. I've attached a simple patch that changes the check for seek() to a check for readline(). I'll punt on my idea of just using iterators, since that seems like slightly greater complexity for no gain. (I'm not sure how many people end up with data in a list of strings and wish they could pass that to loadtxt). While you're at it, would you commit my patch to add support for bzipped files as well (attached)? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 5953) +++ numpy/lib/io.py (working copy) @@ -253,8 +253,8 @@ Parameters -- fname : file or string -File or filename to read. If the filename extension is ``.gz``, -the file is first decompressed. +File or filename to read. If the filename extension is ``.gz`` or +``.bz2``, the file is first decompressed. dtype : data-type Data type of the resulting array. If this is a record data-type, the resulting array will be 1-dimensional, and each row will be @@ -320,6 +320,9 @@ if fname.endswith('.gz'): import gzip fh = gzip.open(fname) +elif fname.endswith('.bz2'): +import bz2 +fh = bz2.BZ2File(fname) else: fh = file(fname) elif hasattr(fname, 'seek'): Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 6085) +++ numpy/lib/io.py (working copy) @@ -333,7 +333,7 @@ fh = gzip.open(fname) else: fh = file(fname) -elif hasattr(fname, 'seek'): +elif hasattr(fname, 'readline'): fh = fname else: raise ValueError('fname must be a string or file handle') ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Matlib docstring typos
Hi, Here's a quick diff to fix some typos in the docstrings for matlib.zeros and matlib.ones. They're causing 2 (of many) failures in the doctests for me on SVN HEAD. Filed in trac as #953 (http://www.scipy.org/scipy/numpy/ticket/953) (Unless someone wants to give me SVN rights for fixing/adding small things like this.) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Matlib docstring typos
Pauli Virtanen wrote: Hi, Wed, 12 Nov 2008 10:16:35 -0600, Ryan May wrote: Here's a quick diff to fix some typos in the docstrings for matlib.zeros and matlib.ones. They're causing 2 (of many) failures in the doctests for me on SVN HEAD. There are probably bound to be more of these. It's possible to fix them using this: http://docs.scipy.org/numpy/ http://docs.scipy.org/numpy/docs/numpy.matlib.zeros/ http://docs.scipy.org/numpy/docs/numpy.matlib.ones/ The changes will propagate from there eventually to SVN, alongside all other documentation improvements. Great, can someone get me edit access? User: rmay Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] setting element
Charles سمير Doutriaux wrote: Hello, I'm wondering if there's aquick way to do the following: s[:,5]=value in a general function def setval(array,index,value,axis=0): ## code here Assuming that axis specifies where the index goes, that would be: def setval(array, index, value, axis=0): slices = [slice(None)] * len(array.shape) slices[axis] = index array[slices] = value (Adapted from the code for numpy.diff) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy array change notifier?
Brent Pedersen wrote: On Mon, Oct 27, 2008 at 1:56 PM, Robert Kern [EMAIL PROTECTED] wrote: On Mon, Oct 27, 2008 at 15:54, Erik Tollerud [EMAIL PROTECTED] wrote: Is there any straightforward way of notifying on change of a numpy array that leaves the numpy arrays still efficient? Not currently, no. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion out of curiosity, would something like this affect efficiency (and/or work): class Notify(numpy.ndarray): def __setitem__(self, *args): self.notify(*args) return super(Notify, self).__setitem__(*args) def notify(self, *args): print 'notify:', args with also overriding setslice? I haven't given this much thought, but you'd also likely need to do this for the infix operators (+=, etc.). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Loadtxt .bz2 support
Charles R Harris wrote: On Tue, Oct 21, 2008 at 1:30 PM, Ryan May [EMAIL PROTECTED] wrote: Hi, I noticed numpy.loadtxt has support for gzipped text files, but not for bz2'd files. Here's a 3 line patch to add bzip2 support to loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 5953) +++ numpy/lib/io.py (working copy) @@ -320,6 +320,9 @@ if fname.endswith('.gz'): import gzip fh = gzip.open(fname) +elif fname.endswith('.bz2'): +import bz2 +fh = bz2.BZ2File(fname) else: fh = file(fname) elif hasattr(fname, 'seek'): Could you open a ticket for this? Mark it as an enhancement. Done. #940 http://scipy.org/scipy/numpy/ticket/940 Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Loadtxt .bz2 support
Hi, I noticed numpy.loadtxt has support for gzipped text files, but not for bz2'd files. Here's a 3 line patch to add bzip2 support to loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma Index: numpy/lib/io.py === --- numpy/lib/io.py (revision 5953) +++ numpy/lib/io.py (working copy) @@ -320,6 +320,9 @@ if fname.endswith('.gz'): import gzip fh = gzip.open(fname) +elif fname.endswith('.bz2'): +import bz2 +fh = bz2.BZ2File(fname) else: fh = file(fname) elif hasattr(fname, 'seek'): ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Need **working** code example of 2-D arrays
Stéfan van der Walt wrote: Linda, 2008/10/13 Linda Seltzer [EMAIL PROTECTED]: Those statements are not demeaning; lighten up. STOP IT. JUST STOP IT. STOP IT RIGHT NOW. Is there a moderator on the list to put a stop to these kinds of statements? I deserve to be treated with respect. I deserve to have my questions treated with respect. I deserve to receive technical information without personal attacks. I think you'll be hard pressed to find a more friendly, open and relaxed mailing list than this one. We're like having piña coladas while we type. That said, keep in mind that you are asking professionals to donate *their* valuable time to solve *your* problem. They gladly do so, but at the same time they try to be efficient; so if you sometimes receive a curt answer, it certainly wasn't meant to be rude. Many of us also sprinkle our responses with a liberal dose of Tongue In Cheek :) It looks like you received some good answers to your question, but let us know if your problems persist and we'll help you sort it out. Well said. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Array shape
Kelly Vincent wrote: I'm using Numpy to do some basic array manipulation, and I'm getting some unexpected behavior from shape. Specifically, I have some 3x3 and 2x2 matrices, and shape gives me (5, 3) and (3, 2) for their respective sizes. I was expecting (3, 3) and (2, 2), for number of rows, number of columns. I'm assuming I must either be misunderstanding what shape gives you or doing something wrong. Can anybody give me any advice? I'm using Python 2.5 and Numpy 1.1.0. Can you post a complete, minimal example that shows the problem you have? For an array object A, A.shape should give the shape you're expecting. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] profiling line by line
Ondrej Certik wrote: On Thu, Sep 18, 2008 at 1:01 PM, Robert Cimrman [EMAIL PROTECTED] wrote: It requires Cython and a C compiler to build. I'm still debating myself about the desired workflow for using it, but for now, it only profiles functions which you have registered with it. I have made the profiler work as a decorator to make this easy. E.g., many thanks for this! I have wanted to try out the profiler but failed to build it (changeset 6 0de294aa75bf): $ python setup.py install --root=/home/share/software/ running install running build running build_py creating build creating build/lib.linux-i686-2.4 copying line_profiler.py - build/lib.linux-i686-2.4 running build_ext cythoning _line_profiler.pyx to _line_profiler.c building '_line_profiler' extension creating build/temp.linux-i686-2.4 i486-pc-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -fPIC -I/usr/include/python2.4 -c -I/usr/include/python2.4 -c _line_profiler.c -o build/temp.linux-i686-2.4/_line_profiler.o _line_profiler.c:1614: error: 'T_LONGLONG' undeclared here (not in a function) error: command 'i486-pc-linux-gnu-gcc' failed with exit status 1 I have cython-0.9.8.1 and GCC 4.1.2, 32-bit machine. I am telling you all the time Robert to use Debian that it just works and you say, no no, gentoo is the best. :) And what's wrong with that? :) Once you get over the learning curve, Gentoo works just fine. Must be Robert K.'s fault. :) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A bug in loadtxt and how to convert a string array (hex data) to decimal?
frank wang wrote: Hi, All, I have found a bug in the loadtxt function. Here is the example. The file name is test.txt and contains: Thist is test 3FF 3fE 3Ef 3e8 3Df 3d9 3cF 3c7 In the Python 2.5.2, I type: test=loadtxt('test.txt',comments='',dtype='string',converters={0:lambda s:int(s,16)}) test will contain array([['102', '3fE'], ['100', '3e8'], ['991', '3d9'], ['975', '3c7']], dtype='|S3') It's because of how numpy handles strings arrays (which I admit I don't understand very well.) Basically, it's converting the numbers properly, but truncating them to 3 characters. Try this, which just forces it to expand to strings 4 characters wide: test=loadtxt('test.txt',comments='',dtype='|S4',converters={0:lambda s:int(s,16)}) HTH, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BUG in numpy.loadtxt?
David Huard wrote: Hi Ryan, I applied your patch in r5788 on the trunk. I noticed there was another bug occurring when both converters and usecols are provided. I've added regression tests for both bugs. Could you confirm that everything is fine on your side ? I can confirm that it works fine for me. Can you or someone else backport this to the 1.2 branch so that this bug is fixed in the next release? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BUG in numpy.loadtxt?
Thanks a bunch for getting these done. David Huard wrote: Done in r5790. On Fri, Sep 5, 2008 at 12:36 PM, Ryan May [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: David Huard wrote: Hi Ryan, I applied your patch in r5788 on the trunk. I noticed there was another bug occurring when both converters and usecols are provided. I've added regression tests for both bugs. Could you confirm that everything is fine on your side ? I can confirm that it works fine for me. Can you or someone else backport this to the 1.2 branch so that this bug is fixed in the next release? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org mailto:Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] BUG in numpy.loadtxt?
Stefan (or anyone else who can comment), It appears that the usecols argument to loadtxt no longer accepts numpy arrays: from StringIO import StringIO text = StringIO('1 2 3\n4 5 6\n') data = np.loadtxt(text, usecols=np.arange(1,3)) ValueErrorTraceback (most recent call last) /usr/lib64/python2.5/site-packages/numpy/lib/io.py in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack) 323 first_line = fh.readline() 324 first_vals = split_line(first_line) -- 325 N = len(usecols or first_vals) 326 327 dtype_types = flatten_dtype(dtype) ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all() data = np.loadtxt(text, usecols=np.arange(1,3).tolist()) data array([[ 2., 3.], [ 5., 6.]]) Was it a conscious design decision that the usecols no longer accept arrays? The new behavior (in 1.1.1) breaks existing code that one of my colleagues has. Can we get a patch in before 1.2 to get this working with arrays again? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] BUG in numpy.loadtxt?
Travis E. Oliphant wrote: Ryan May wrote: Stefan (or anyone else who can comment), It appears that the usecols argument to loadtxt no longer accepts numpy arrays: Could you enter a ticket so we don't lose track of this. I don't remember anything being intentional. Done: #905 http://scipy.org/scipy/numpy/ticket/905 I've attached a patch that does the obvious and coerces usecols to a list when it's not None, so it will work for any iterable. I don't think it was a conscious decision, just a consequence of the rewrite using different methods. There are two problems: 1) It's an API break, technically speaking 2) It currently doesn't even accept tuples, which are used in the docstring. Can we hurry and get this into 1.2? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] min() of array containing NaN
Availability of the NaN functionality in a method of ndarray The last point is key. The NaN behavior is central to analyzing real data containing unavoidable bad values, which is the bread and butter of a substantial fraction of the user base. In the languages they're switching from, handling NaNs is just part of doing business, and is an option of every relevant routine; there's no need for redundant sets of routines. In contrast, numpy appears to consider data analysis to be secondary, somehow, to pure math, and takes the NaN functionality out of routines like min() and std(). This means it's not possible to use many ndarray methods. If we're ready to handle a NaN by returning it, why not enable the more useful behavior of ignoring it, at user discretion? Maybe I missed this somewhere, but this seems like a better use for masked arrays, not NaN's. Masked arrays were specifically designed to add functions that work well with masked/invalid data points. Why reinvent the wheel here? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] 1.1.1rc1 to be tagged tonight
Jarrod Millman wrote: Hello, This is a reminder that 1.1.1rc1 will be tagged tonight. Chuck is planning to spend some time today fixing a few final bugs on the 1.1.x branch. If anyone else is planning to commit anything to the 1.1.x branch today, please let me know immediately. Obviously now is not the time to commit anything to the branch that could break anything, so please be extremely careful if you have to touch the branch. Once the release is tagged, Chris and David will create binary installers for both Windows and Mac. Hopefully, this will give us an opportunity to have much more widespread testing before releasing 1.1.1 final at the end of the month. Can I get anyone to look at this patch for loadtext()? I was trying to use loadtxt() today to read in some text data, and I had a problem when I specified a dtype that only contained as many elements as in columns in usecols. The example below shows the problem: import numpy as np import StringIO data = '''STID RELH TAIR JOE 70.1 25.3 BOB 60.5 27.9 ''' f = StringIO.StringIO(data) names = ['stid', 'temp'] dtypes = ['S4', 'f8'] arr = np.loadtxt(f, usecols=(0,2),dtype=zip(names,dtypes), skiprows=1) With current 1.1 (and SVN head), this yields: IndexErrorTraceback (most recent call last) /home/rmay/ipython console in module() /usr/lib64/python2.5/site-packages/numpy/lib/io.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack) 309 for j in xrange(len(vals))] 310 if usecols is not None: -- 311 row = [converterseq[j](vals[j]) for j in usecols] 312 else: 313 row = [converterseq[j](val) for j,val in enumerate(vals)] IndexError: list index out of range - I've added a patch that checks for usecols, and if present, correctly creates the converters dictionary to map each specified column with converter for the corresponding field in the dtype. With the attached patch, this works fine: arr array([('JOE', 25.301), ('BOB', 27.899)], dtype=[('stid', '|S4'), ('temp', 'f8')]) Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma --- io.py.bak 2008-07-18 18:12:17.0 -0400 +++ io.py 2008-07-16 22:49:13.0 -0400 @@ -292,8 +292,13 @@ if converters is None: converters = {} if dtype.names is not None: -converterseq = [_getconv(dtype.fields[name][0]) \ -for name in dtype.names] +if usecols is None: +converterseq = [_getconv(dtype.fields[name][0]) \ +for name in dtype.names] +else: +converters.update([(col,_getconv(dtype.fields[name][0])) \ +for col,name in zip(usecols, dtype.names)]) + for i,line in enumerate(fh): if iskiprows: continue ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Masked array fill_value
Hi, I just noticed this and found it surprising: In [8]: from numpy import ma In [9]: a = ma.array([1,2,3,4],mask=[False,False,True,False],fill_value=0) In [10]: a Out[10]: masked_array(data = [1 2 -- 4], mask = [False False True False], fill_value=0) In [11]: a[2] Out[11]: masked_array(data = --, mask = True, fill_value=1e+20) In [12]: np.__version__ Out[12]: '1.1.0' Is there a reason that the fill_value isn't inherited from the parent array? Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Masked array fill_value
Eric Firing wrote: Ryan May wrote: Hi, I just noticed this and found it surprising: In [8]: from numpy import ma In [9]: a = ma.array([1,2,3,4],mask=[False,False,True,False],fill_value=0) In [10]: a Out[10]: masked_array(data = [1 2 -- 4], mask = [False False True False], fill_value=0) In [11]: a[2] Out[11]: masked_array(data = --, mask = True, fill_value=1e+20) In [12]: np.__version__ Out[12]: '1.1.0' Is there a reason that the fill_value isn't inherited from the parent array? There was a thread about this a couple months ago, and Pierre GM explained it. I think the point was that indexing is giving you a new masked scalar, which is therefore taking the default mask value of the type. I don't see it as a problem; you can always specify the fill value explicitly when you need to. I thought it sounded familiar. You're right, it's not a big problem, it just seemed unintuitive. Thanks for the explaination. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] numpy.loadtext() fails with dtype + usecols
Hi, I was trying to use loadtxt() today to read in some text data, and I had a problem when I specified a dtype that only contained as many elements as in columns in usecols. The example below shows the problem: import numpy as np import StringIO data = '''STID RELH TAIR JOE 70.1 25.3 BOB 60.5 27.9 ''' f = StringIO.StringIO(data) names = ['stid', 'temp'] dtypes = ['S4', 'f8'] arr = np.loadtxt(f, usecols=(0,2),dtype=zip(names,dtypes), skiprows=1) With current 1.1 (and SVN head), this yields: IndexErrorTraceback (most recent call last) /home/rmay/ipython console in module() /usr/lib64/python2.5/site-packages/numpy/lib/io.pyc in loadtxt(fname, dtype, comments, delimiter, converters, skiprows, usecols, unpack) 309 for j in xrange(len(vals))] 310 if usecols is not None: -- 311 row = [converterseq[j](vals[j]) for j in usecols] 312 else: 313 row = [converterseq[j](val) for j,val in enumerate(vals)] IndexError: list index out of range -- I've added a patch that checks for usecols, and if present, correctly creates the converters dictionary to map each specified column with converter for the corresponding field in the dtype. With the attached patch, this works fine: arr array([('JOE', 25.301), ('BOB', 27.899)], dtype=[('stid', '|S4'), ('temp', 'f8')]) Comments? Can I get this in for 1.1.1? Thanks, Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma --- io.py.bak 2008-07-18 18:12:17.0 -0400 +++ io.py 2008-07-16 22:49:13.0 -0400 @@ -292,8 +292,13 @@ if converters is None: converters = {} if dtype.names is not None: -converterseq = [_getconv(dtype.fields[name][0]) \ -for name in dtype.names] +if usecols is None: +converterseq = [_getconv(dtype.fields[name][0]) \ +for name in dtype.names] +else: +converters.update([(col,_getconv(dtype.fields[name][0])) \ +for col,name in zip(usecols, dtype.names)]) + for i,line in enumerate(fh): if iskiprows: continue ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A correction to numpy trapz function
Nadav Horesh wrote: The function trapz accepts x axis vector only for axis=-1. Here is my modification (correction?) to let it accept a vector x for integration along any axis: def trapz(y, x=None, dx=1.0, axis=-1): Integrate y(x) using samples along the given axis and the composite trapezoidal rule. If x is None, spacing given by dx is assumed. If x is an array, it must have either the dimensions of y, or a vector of length matching the dimension of y along the integration axis. y = asarray(y) nd = y.ndim slice1 = [slice(None)]*nd slice2 = [slice(None)]*nd slice1[axis] = slice(1,None) slice2[axis] = slice(None,-1) if x is None: d = dx else: x = asarray(x) if x.ndim == 1: if len(x) != y.shape[axis]: raise ValueError('x length (%d) does not match y axis %d length (%d)' % (len(x), axis, y.shape[axis])) d = diff(x) return tensordot(d, (y[slice1]+y[slice2])/2.0,(0, axis)) d = diff(x, axis=axis) return add.reduce(d * (y[slice1]+y[slice2])/2.0,axis) What version were you working with originally? With 1.1, this is what I have: def trapz(y, x=None, dx=1.0, axis=-1): Integrate y(x) using samples along the given axis and the composite trapezoidal rule. If x is None, spacing given by dx is assumed. y = asarray(y) if x is None: d = dx else: d = diff(x,axis=axis) nd = len(y.shape) slice1 = [slice(None)]*nd slice2 = [slice(None)]*nd slice1[axis] = slice(1,None) slice2[axis] = slice(None,-1) return add.reduce(d * (y[slice1]+y[slice2])/2.0,axis) For me, this works fine with supplying x for axis != -1. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] A correction to numpy trapz function
Nadav Horesh wrote: Here is what I get with the orriginal trapz function: IDLE 1.2.2 import numpy as np np.__version__ '1.1.0' y = np.arange(24).reshape(6,4) x = np.arange(6) np.trapz(y, x, axis=0) Traceback (most recent call last): File pyshell#4, line 1, in module np.trapz(y, x, axis=0) File C:\Python25\Lib\site-packages\numpy\lib\function_base.py, line 1536, in trapz return add.reduce(d * (y[slice1]+y[slice2])/2.0,axis) ValueError: shape mismatch: objects cannot be broadcast to a single shape (Try not to top post on this list.) I can get it to work like this: import numpy as np y = np.arange(24).reshape(6,4) x = np.arange(6).reshape(-1,1) np.trapz(y, x, axis=0) From the text of the error message, you can see this is a problem with broadcasting. Due to broadcasting rules (which will *prepend* dimensions with size 1), you need to manually add an extra dimension to the end. Once I resize x, I can get this to work. You might want to look at this: http://www.scipy.org/EricsBroadcastingDoc Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Doctest items
Robert Kern wrote: On Tue, Jul 1, 2008 at 17:50, Fernando Perez [EMAIL PROTECTED] wrote: On Tue, Jul 1, 2008 at 1:41 PM, Pauli Virtanen [EMAIL PROTECTED] wrote: But it's a custom tweak to doctest, so it might break at some point in the future, and I don't love the monkeypatching here... Welcome to the joys of extending doctest/unittest. They hardcoded so much stuff in there that the only way to reuse that code is by copy/paste/monkeypatch. It's absolutely atrocious. We could always just make the plotting section one of those it's just an example not a doctest things and remove the (since it doesn't appear to provide any useful test coverage or anything). If possible, I'd like other possibilities be considered first before jumping this route. I think it would be nice to retain the ability to run also the matplotlib examples as (optional) doctests, to make sure also they execute correctly. Also, using two different markups in the documentation to work around a shortcoming of doctest is IMHO not very elegant. How about a much simpler approach? Just pre-populate the globals dict where doctest executes with an object called 'plt' that basically does def noop(*a,**k): pass class dummy(): def __getattr__(self,k): return noop plt = dummy() This would ensure that all calls to plt.anything() silently succeed in the doctests. Granted, we're not testing matplotlib, but it has the benefit of simplicity and of letting us keep consistent formatting, and examples that *users* can still paste into their sessions where plt refers to the real matplotlib. It's actually easier for users to paste the non-doctestable examples since they don't have the markers and any stdout the examples produce as a byproduct. I'm with Robert here. It's definitely easier as an example without the . I also don't see the utility of being able to have the matplotlib code as tests of anything. We're not testing matplotlib here and any behavior that matplotlib relies on (and hence tests) should be captured in a test for that behavior separate from matplotlib code. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Doctest items
Robert Kern wrote: On Tue, Jul 1, 2008 at 19:19, Ryan May [EMAIL PROTECTED] wrote: Robert Kern wrote: On Tue, Jul 1, 2008 at 17:50, Fernando Perez [EMAIL PROTECTED] wrote: On Tue, Jul 1, 2008 at 1:41 PM, Pauli Virtanen [EMAIL PROTECTED] wrote: But it's a custom tweak to doctest, so it might break at some point in the future, and I don't love the monkeypatching here... Welcome to the joys of extending doctest/unittest. They hardcoded so much stuff in there that the only way to reuse that code is by copy/paste/monkeypatch. It's absolutely atrocious. We could always just make the plotting section one of those it's just an example not a doctest things and remove the (since it doesn't appear to provide any useful test coverage or anything). If possible, I'd like other possibilities be considered first before jumping this route. I think it would be nice to retain the ability to run also the matplotlib examples as (optional) doctests, to make sure also they execute correctly. Also, using two different markups in the documentation to work around a shortcoming of doctest is IMHO not very elegant. How about a much simpler approach? Just pre-populate the globals dict where doctest executes with an object called 'plt' that basically does def noop(*a,**k): pass class dummy(): def __getattr__(self,k): return noop plt = dummy() This would ensure that all calls to plt.anything() silently succeed in the doctests. Granted, we're not testing matplotlib, but it has the benefit of simplicity and of letting us keep consistent formatting, and examples that *users* can still paste into their sessions where plt refers to the real matplotlib. It's actually easier for users to paste the non-doctestable examples since they don't have the markers and any stdout the examples produce as a byproduct. I'm with Robert here. It's definitely easier as an example without the . I also don't see the utility of being able to have the matplotlib code as tests of anything. We're not testing matplotlib here and any behavior that matplotlib relies on (and hence tests) should be captured in a test for that behavior separate from matplotlib code. To be clear, these aren't tests of the numpy code. The tests would be to make sure the examples still run. Right. I just don't think effort should be put into making examples using matplotlib run as doctests. If the behavior is important, numpy should have a standalone test for it. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ndarray methods vs numpy module functions
Robert Kern wrote: On Mon, Jun 23, 2008 at 18:10, Sebastian Haase [EMAIL PROTECTED] wrote: On Mon, Jun 23, 2008 at 10:31 AM, Bob Dowling [EMAIL PROTECTED] wrote: [ I'm new here and this has the feel of an FAQ but I couldn't find anything at http://www.scipy.org/FAQ . If I should have looked somewhere else a URL would be gratefully received. ] What's the reasoning behind functions like sum() and cumsum() being provided both as module functions (numpy.sum(data, axis=1)) and as object methods (data.sum(axis=1)) but other functions - and I stumbled over diff() - only being provided as module functions? Hi Bob, this is a very good question. I think the answers are a) historical reasons AND, more importantly, differing personal preferences b) I would file the missing data.diff() as a bug. It's not. Care to elaborate? -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug in numpy all() function
Dan Goodman wrote: Hi all, I think this is a bug (I'm running Numpy 1.0.3.1): from numpy import * def f(x): return False all(f(x) for x in range(10)) True I guess the all function doesn't know about generators? That's likely the problem. However, as of Python 2.5, there's a built in function that will do what you want. However, you would mask that builtin with the from numpy import *. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] New to ctypes. Some problems with loading shared library.
Lou Pecora wrote: I got ctypes installed and passing its own tests. But I cannot get the shared library to load. I am using Mac OS X 10.4.11, Python 2.4 running through the Terminal. I am using Albert Strasheim's example on http://scipy.org/Cookbook/Ctypes2 except that I had to remove the defined 'extern' for FOO_API since the gcc compiler complained about two 'externs' (I don't really understand what the extern does here anyway). My make file for generating the library is simple, # Link --- test1ctypes.so: test1ctypes.o test1ctypes.mak gcc -bundle -flat_namespace -undefined suppress -o test1ctypes.so test1ctypes.o # gcc C compile -- test1ctypes.o: test1ctypes.c test1ctypes.h test1ctypes.mak gcc -c test1ctypes.c -o test1ctypes.o This generates the file test1ctypes.so. But when I try to load it import numpy as N import ctypes as C _test1 = N.ctypeslib.load_library('test1ctypes', '.') I get the error message, OSError: dlopen(/Users/loupecora/test1ctypes.dylib, 6): image not found I've been googling for two hours trying to find the problem or other examples that would give me a clue, but no luck. Any ideas what I'm doing wrong? Thanks for any clues. Well, it's looking for test1ctypes.dylib, which I guess is a MacOSX shared library? Meanwhile, you made a test1ctypes.so, which is why it can't find it. You could try using this instead: _test1 = N.ctypeslib.load_library('test1ctypes.so', '.') or try to get gcc to make a test1ctypes.dylib. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Nasty bug using pre-initialized arrays
Charles R Harris wrote: On Jan 7, 2008 8:47 AM, Ryan May [EMAIL PROTECTED] mailto:[EMAIL PROTECTED] wrote: Stuart Brorson wrote: I realize NumPy != Matlab, but I'd wager that most users would think that this is the natural behavior.. Well, that behavior won't happen. We won't mutate the dtype of the array because of assignment. Matlab has copy(-on-write) semantics for things like slices while we have view semantics. We can't safely do the reallocation of memory [1]. That's fair enough. But then I think NumPy should consistently typecheck all assignmetns and throw an exception if the user attempts an assignment which looses information. Yeah, there's no doubt in my mind that this is a bug, if for no other reason than this inconsistency: One place where Numpy differs from MatLab is the way memory is handled. MatLab is always generating new arrays, so for efficiency it is worth preallocating arrays and then filling in the parts. This is not the case in Numpy where lists can be used for things that grow and subarrays are views. Consequently, preallocating arrays in Numpy should be rare and used when either the values have to be generated explicitly, which is what you see when using the indexes in your first example. As to assignment between arrays, it is a mixed question. The problem again is memory usage. For large arrays, it makes since to do automatic conversions, as is also the case in functions taking output arrays, because the typecast can be pushed down into C where it is time and space efficient, whereas explicitly converting the array uses up temporary space. However, I can imagine an explicit typecast function, something like a[...] = typecast(b) that would replace the current behavior. I think the typecast function could be implemented by returning a view of b with a castable flag set to true, that should supply enough information for the assignment operator to do its job. This might be a good addition for Numpy 1.1. While that seems like an ok idea, I'm still not sure what's wrong with raising an exception when there will be information loss. The exception is already raised with standard python complex objects. I can think of many times in my code where explicit looping is a necessity, so pre-allocating the array is the only way to go. -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] weibull distribution has only one parameter?
D.Hendriks (Dennis) wrote: Alan G Isaac wrote: On Mon, 12 Nov 2007, D.Hendriks (Dennis) apparently wrote: All of this makes me doubt the correctness of the formula you proposed. It is always a good idea to hesitate before doubting Robert. URL:http://en.wikipedia.org/wiki/Weibull_distribution#Generating_Weibull-distributed_random_variates hth, Alan Isaac So, you are saying that it was indeed correct? That still leaves the question why I can't seem to confirm that in the figure I mentioned (red and green lines). Also, if you refer to X = lambda*(-ln(U))^(1/k) as 'proof' for the validity of the formula, I have to ask if Weibull(a,Size) does actually correspond to (-ln(U))^(1/a)? Have you actually looked at a histogram of the random variates generated this way to see if they are wrong? Multiplying the the individual random values by a number changes the distribution differently than multiplying the distribution/density function by a number. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Convert array type
Gary Ruben wrote: Try using astype. This works: values = array(wavearray.split()).astype(float) Why not use numpy.fromstring? fromstring(string, dtype=float, count=-1, sep='') Return a new 1d array initialized from the raw binary data in string. If count is positive, the new array will have count elements, otherwise its size is determined by the size of string. If sep is not empty then the string is interpreted in ASCII mode and converted to the desired number type using sep as the separator between elements (extra whitespace is ignored). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Accessing a numpy array in a mmap fashion
Brian Donovan wrote: Hello all, I'm wondering if there is a way to use a numpy array that uses disk as a memory store rather than ram. I'm looking for something like mmap but which can be used like a numpy array. The general idea is this. I'm simulating a system which produces a large dataset over a few hours of processing time. Rather than store the numpy array in memory during processing I'd like to write the data directly to disk but still be able to treat the array as a numpy array. Is this possible? Any ideas? What you're looking for is numpy.memmap, though the documentation is eluding me at the moment. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] 16bit Integer Array/Scalar Inconsistency
Hi, I ran into this while debugging a script today: In [1]: import numpy as N In [2]: N.__version__ Out[2]: '1.0.3' In [3]: d = N.array([32767], dtype=N.int16) In [4]: d + 32767 Out[4]: array([-2], dtype=int16) In [5]: d[0] + 32767 Out[5]: 65534 In [6]: type(d[0] + 32767) Out[6]: type 'numpy.int64' In [7]: type(d[0]) Out[7]: type 'numpy.int16' It seems that numpy will automatically promote the scalar to avoid overflow, but not in the array case. Is this inconsistency a bug, just a (known) gotcha? I myself don't have any problems with the array not being promoted automatically, but the inconsistency with scalar operation made debugging my problem more difficult. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] scipy.io.loadmat incompatible with Numpy 1.0.2
Hi, As far as I can tell, the new Numpy 1.0.2 broke scipy.io.loadmat. Here's what I get when I try to open a file with using loadmat with numpy 1.0.2 (on gentoo AMD64): In [2]: loadmat('tep_iqdata.mat') --- exceptions.AttributeErrorTraceback (most recent call last) /usr/lib64/python2.4/site-packages/scipy/io/mio.py in loadmat(file_name, mdict, appendmat, basename, **kwargs) 94 ''' 95 MR = mat_reader_factory(file_name, appendmat, **kwargs) --- 96 matfile_dict = MR.get_variables() 97 if mdict is not None: 98 mdict.update(matfile_dict) /usr/lib64/python2.4/site-packages/scipy/io/miobase.py in get_variables(self, variable_names) 267 variable_names = [variable_names] 268 self.mat_stream.seek(0) -- 269 mdict = self.file_header() 270 mdict['__globals__'] = [] 271 while not self.end_of_stream(): /usr/lib64/python2.4/site-packages/scipy/io/mio5.py in file_header(self) 508 hdict = {} 509 hdr = self.read_dtype(self.dtypes['file_header']) -- 510 hdict['__header__'] = hdr['description'].strip(' \t\n\000') 511 v_major = hdr['version'] 8 512 v_minor = hdr['version'] 0xFF AttributeError: 'numpy.ndarray' object has no attribute 'strip' Reverting to numpy 1.0.1 works fine for the same code. So the question is, does scipy need an update, or did something unintended creep into Numpy 1.0.2? (Hence the cross-post) Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] scipy.io.loadmat incompatible with Numpy 1.0.2
Travis Oliphant wrote: Ryan May wrote: Hi, As far as I can tell, the new Numpy 1.0.2 broke scipy.io.loadmat. Yes, it was the one place that scipy used the fact that field selection of a 0-d array returned a scalar. This has been changed in NumPy 1.0.2 to return a 0-d array. The fix is in SciPy SVN. Just get the mio.py file from SVN and drop it in to your distribution and things should work fine. Or wait until a SciPy release is made. -Travis It worked if I also got the new mio5.py (rev. 2893). Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion