Re: [Numpy-discussion] add xirr to numpy financial functions?
On May 25, 2009, at 8:06 PM, josef.p...@gmail.com wrote: The problem is, if the functions are enhanced in the current numpy, then scikits.timeseries is not (yet) available. Mmh, I'm not following you here... The original question was how we can enhance numpy.financial, eg. np.irr So we are restricted to use only what is available in numpy and in standard python. Ah OK. But it seems that you're now running into a pb w/ dates handling, which might be a bit too specialized for numpy. Anyway, the call isn't mine. I looked at your moving functions, autocorrelation function and so on a while ago. That's were I learned how to use np.correlate or the scipy versions of it, and the filter functions. I've written the standard array versions for the moving functions and acf, ccf, in one of my experiments. The moving functions were written in C and they work even w/ timeseries (they work quite OK w/ pure MaskedArraysP. We put them in scikits.timeseries because it was easier to have them there than in scipy, for example. If Skipper has enough time in his google summer of code, we would like to include some basic timeseries econometrics (ARMA, VAR, ...?) however most likely only for regularly spaced data. Well, we can easily restrict the functions to the case were there's no missing data nor missing dates. Checking the mask is easy, and we have a method to chek the dates (is_valid) Anyhow, if the pb you have are just to specify dates, I really think you should give the scikits a try. And send feedback, of course... Skipper intends to write some examples to show how to work with the extensions to scipy.stats, which, I think, will include examples using time series, besides recarrays, and other array types. Dealing with TimeSeries is pretty much the same thing as dealing with MaskedArray, with the extra convenience of converting from one frequency to another and so forth Quite often, an analysis can be performed by dropping the .dates part, working on the .series part (the underlying MaskedArray), and repatching the dates at the end... Is there a time line for including the timeseries scikits in numpy/ scipy? With code that is intended for incorporation in numpy/scipy, we are restricted in our external dependencies. I can't tell, because the decision is not mine. For what I understood, there could be an inclusion in scipy if there's a need for it. For that, we need more users end more feedback If you catch my drift... Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] List/location of consecutive integers
On May 22, 2009, at 12:31 PM, Andrea Gavana wrote: Hi All, this should be a very easy question but I am trying to make a script run as fast as possible, so please bear with me if the solution is easy and I just overlooked it. I have a list of integers, like this one: indices = [1,2,3,4,5,6,7,8,9,255,256,257,258,10001,10002,10003,10004] From this list, I would like to find out which values are consecutive and store them in another list of tuples (begin_consecutive, end_consecutive) or a simple list: as an example, the previous list will become: new_list = [(1, 9), (255, 258), (10001, 10004)] Josef's and Chris's solutions are pretty neat in this case. I've been recently working on a more generic case where integers are grouped depending on some condition (equals, differing by 1 or 2...). A version in pure Python/numpy, the `Cluster` class is available in scikits.hydroclimpy.core.tools (hydroclimpy.sourceforge.net). Otherwise, here's a Cython version of the same class. Let me know if it works. And I'm not ultra happy with the name, so if you have any suggestions... cdef class Brackets: Groups consecutive data from an array according to a clustering condition. A cluster is defined as a group of consecutive values differing by at most the increment value. Missing values are **not** handled: the input sequence must therefore be free of missing values. Parameters -- darray : ndarray Input data array to clusterize. increment : {float}, optional Increment between two consecutive values to group. By default, use a value of 1. operator : {function}, optional Comparison operator for the definition of clusters. By default, use :func:`numpy.less_equal`. Attributes -- inishape Shape of the argument array (stored for resizing). inisize Size of the argument array. uniques : sequence List of unique cluster values, as they appear in chronological order. slices : sequence List of the slices corresponding to each cluster of data. starts : ndarray Lists of the indices at which the clusters start. ends : ndarray Lists of the indices at which the clusters end. clustered : list List of clustered data. Examples A = [0, 0, 1, 2, 2, 2, 3, 4, 3, 4, 4, 4] klust = cluster(A,0) [list(_) for _ in klust.clustered] [[0, 0], [1], [2, 2, 2], [3], [4], [3], [4, 4, 4]] klust.uniques array([0, 1, 2, 3, 4, 3, 4]) x = [ 1.8, 1.3, 2.4, 1.2, 2.5, 3.9, 1. , 3.8, 4.2, 3.3, ... 1.2, 0.2, 0.9, 2.7, 2.4, 2.8, 2.7, 4.7, 4.2, 0.4] Brackets(x,1).starts array([ 0, 2, 3, 4, 5, 6, 7, 10, 11, 13, 17, 19]) Brackets(x,1.5).starts array([ 0, 6, 7, 10, 13, 17, 19]) Brackets(x,2.5).starts array([ 0, 6, 7, 19]) Brackets(x,2.5,greater).starts array([ 0, 1, 2, 3, 4, 5, 8, 9, 10, ...11, 12, 13, 14, 15, 16, 17, 18]) y = [ 0, -1, 0, 0, 0, 1, 1, -1, -1, -1, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0] Brackets(y,1).starts array([ 0, 1, 2, 5, 7, 10, 12, 16, 18]) cdef readonly double increment cdef readonly np.ndarray data cdef readonly list _starts cdef readonly list _ends def __init__(Brackets self, object data, double increment=1, object operator=np.less_equal): cdef int i, n, ifirst, ilast, test cdef double last cdef list starts, ends # self.increment = increment self.data = np.asanyarray(data) data = np.asarray(data) # n = len(data) starts = [] ends = [] # last = data[0] ifirst = 0 ilast = 0 for 1 = i n: test = operator(abs(data[i] - last), increment) ilast = i if not test: starts.append(ifirst) ends.append(ilast-1) ifirst = i last = data[i] starts.append(ifirst) ends.append(n-1) self._starts = starts self._ends = ends def __len__(self): return len(self.starts) property starts: # def __get__(Brackets self): return np.asarray(self._starts) property ends: # def __get__(Brackets self): return np.asarray(self._ends) property sizes: # def __get__(Brackets self): return np.asarray(self._ends) - np.asarray(self._firsts) property slices: # def __get__(Brackets self): cdef int i cdef list starts = self._starts, ends = self._ends cdef list slices = [] for 0 = i len(starts):
Re: [Numpy-discussion] skiprows option in loadtxt
On May 20, 2009, at 11:04 AM, Nils Wagner wrote: Hi all, Is the value of skiprows in loadtxt restricted to values in [0-10] ? It doesn't work for skiprows=11. Please post an example ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] view takes no keyword arguments exception
On May 20, 2009, at 9:57 PM, Jochen Schroeder wrote: Sorry maybe I phrased my question wrongly. I don't want to change the code (This was just a short example). I just want to know why it is failing on his system and what he can do so that a.view(dtype='...') is working. I suspected it was an old numpy installation but the person is saying that he installed a new version and is still seeing the same problem (or does he just have an old version of numpy floating around). Likely to be the second possibiity, the ghost of a previous installation. AFAIR, the keywords in .view were introduced in 1.2 or just after. A safe way to check would be to install numpy 1.3 in a virtualenv and check that it works. If it does (expected), then you may want to ask your user to start afresh (remove 1.1.1 and 1.3 and then reinstall 1.3 from a clean slate). My 2c. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Are masked arrays slower for processing than ndarrays?
All, I just committed (r6994) some modifications to numpy.ma.getdata (Eric Firing's patch) and to the ufunc wrappers that were too slow with large arrays. We're roughly 3 times faster than we used to, but still slower than the equivalent classic ufuncs (no surprise here). Here's the catch: it's basically cheating. I got rid of the pre- processing (where a mask was calculated depending on the domain and the input set to a filling value depending on this mask, before the actual computation). Instead, I force np.seterr(divide='ignore',invalid='ignore') before calling the ufunc on the .data part, then mask the invalid values (if any) and reset the corresponding entries in .data to the input. Finally, I reset the error status. All in all, we're still data-friendly, meaning that the value below a masked entry is the same as the input, but we can't say that values initially masked are discarded (they're used in the computation but reset to their initial value)... This playing around with the error status may (or may not, I don't know) cause some problems down the road. It's still faaar faster than computing the domain (especially _DomainSafeDivide) when the inputs are large... I'd be happy if you could give it a try and send some feedback. Cheers P. On May 9, 2009, at 8:17 PM, Eric Firing wrote: Eric Firing wrote: Pierre, ... I pressed send too soon. There are test failures with the patch I attached to my last message. I think the basic ideas are correct, but evidently there are wrinkles to be worked out. Maybe putmask() has to be used instead of where() (putmask is much faster) to maintain the ability to do *= and similar, and maybe there are other adjustments. Somehow, though, it should be possible to get decent speed for simple multiplication and division; a 10x penalty relative to ndarray operations is just too much. Eric Eli Bressert wrote: Hi, I'm using masked arrays to compute large-scale standard deviation, multiplication, gaussian, and weighted averages. At first I thought using the masked arrays would be a great way to sidestep looping (which it is), but it's still slower than expected. Here's a snippet of the code that I'm using it for. [...] # Like the spatial_weight section, this takes about 20 seconds W = spatial_weight / Rho2 # Takes less than one second. Ave = np.average(av_good,axis=1,weights=W) Any ideas on why it would take such a long time for processing? A part of the slowdown is what looks to me like unnecessary copying in _MaskedBinaryOperation.__call__. It is using getdata, which applies numpy.array to its input, forcing a copy. I think the copy is actually unintentional, in at least one sense, and possibly two: first, because the default argument of getattr is always evaluated, even if it is not needed; and second, because the call to np.array is used where np.asarray or equivalent would suffice. The first file attached below shows the kernprof in the case of multiplying two masked arrays, shape (10,50), with no masked elements; 2/3 of the time is taken copying the data. Now, if there are actually masked elements in the arrays, it gets much worse: see the second attachment. The total time has increased by more than a factor of 3, and the culprit is numpy.which(), a very slow function. It looks to me like it is doing nothing useful at all; the numpy binary operation is still being executed for all elements, regardless of mask, contrary to the intention implied by the comment in the code. The third attached file has a patch that fixes the getdata problem and eliminates the which(). With this patch applied we get the profile in the 4th file, to be compared to the second profile. Much better. I am pretty sure it could still be sped up quite a bit, though. It looks like the masks are essentially being calculated twice for no good reason, but I don't completely understand all the mask considerations, so at this point I am not trying to fix that problem. Eric Especially the spatial_weight and W variables? Would there be a faster way to do this? Or is there a way that numpy.std can process ignore nan's when processing? Thanks, Eli Bressert ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Are masked arrays slower for processing than ndarrays?
On May 13, 2009, at 7:36 PM, Matt Knox wrote: Here's the catch: it's basically cheating. I got rid of the pre- processing (where a mask was calculated depending on the domain and the input set to a filling value depending on this mask, before the actual computation). Instead, I force np.seterr(divide='ignore',invalid='ignore') before calling the ufunc This isn't a thread safe approach and could cause wierd side effects in a multi-threaded application. I think modifying global options/ variables inside any function where it generally wouldn't be expected by the user is a bad idea. Whine. I was afraid of something like that... 2 options, then: * We revert to computing a mask beforehand. That looks like the part that takes the most time w/ domained operations (according to Robert K's profiler. Robert, you deserve a statue for this tool). And that doesn't solve the pb of power, anyway: how do you compute the domain of power ? * We reimplement masked versions of the ufuncs in C. Won't happen from me anytime soon (this fall or winter, maybe...) Also, importing numpy.ma currently calls numpy.seterr(all='ignore') anyway... So that's a -1 from Matt. Anybody else ? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Are masked arrays slower for processing than ndarrays?
On May 13, 2009, at 8:07 PM, Matt Knox wrote: hmm. While this doesn't affect me personally... I wonder if everyone is aware of this. Importing modules generally shouldn't have side effects either I would think. Has this always been the case for the masked array module? Well, can't remember, actually... I was indeed surprised to see it was there. I guess I must have added when working on the power section. I will get of rid on the next commit, this is clearly bad practice from my part. Bad, bad Pierre. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to merge or SQL join record arrays in Python?
On May 11, 2009, at 5:44 PM, Wei Su wrote: Coming from SAS and R, this is probably the first thing I want to do now that I can convert my data into record arrays. But I could not find any clues after googling for a while. Any hint or suggestions will be great! That depends what you want, actually, ut this should get you started http://docs.scipy.org/doc/numpy/user/basics.rec.html Note the slight difference between a structured array (fields accessible as items) and a recarray (fields accessible as items and attributes). ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to merge or SQL join record arrays in Python?
On May 11, 2009, at 6:18 PM, Wei Su wrote: Thanks for the reply. I can now actually turn a big list into a record array. My question is actually how to join related record arrays in Python.. This is done in SAS by MERGE and PROC SQL and by merge() in R. But I have no idea how to do it in Python. OK. Try numpy.lib.recfunctions.join_by, and let me know if you have any problem. It's a rewritten version of an equivalent function in matplotlib (matplotlib.mlab.rec_join), that should work (maybe not, there hasn't been enough testing feedback to judge...) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to merge or SQL join record arrays in Python?
On May 11, 2009, at 6:36 PM, Skipper Seabold wrote: On Mon, May 11, 2009 at 6:18 PM, Wei Su taste_o...@yahoo.com wrote: Hi, Pierre: Thanks for the reply. I can now actually turn a big list into a record array. My question is actually how to join related record arrays in Python.. This is done in SAS by MERGE and PROC SQL and by merge() in R. But I have no idea how to do it in Python. Thanks. Wei Su Does merge_arrays in numpy.lib.recfunctions do what you want? Probably not. merge_arrays is close to concatenate, and will raise an exception if 2 fields have the same name (in the flattened version). Testing R's merge(), join_by looks like the corresponding function. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Are masked arrays slower for processing than ndarrays?
Short answer to the subject: Oh yes. Basically, MaskedArrays in its current implementation is more of a convenience class than anything. Most of the functions manipulating masked arrays create a lot of temporaries. When performance is needed, I must advise you to work directly on the data and the mask. For example, let's examine the division of 2 MaskedArrays a b. * We take the 2 ndarrays of data (da and db) and the 2 ndarrays of mask (ma and mb) * we create a new array for db using np.where, putting 1 where db==0 and keeping db otherwise (if we were not doing that, we would get some NaNs down the road) * we create a new mask by combining ma and mb * we create the result array using np.where, using da where m is True, da/db otherwise (if we were not doing that, we would be processing the masked data and we may not want that) * Then, we add the mask to the result array. I suspect that the np.where functions are sub-optimal, and there might be a smarter way to achieve the same result while keeping all the functionalities (no NaNs (even masked) in the result, data kept when it should). I agree that these functionalities might be a bit overkill in simpler cases, such as yours. You may then want to use something like ma.masked_array(a.data/b.data, mask=(a.mask | b.mask | (b.data==0)) Using Eric's example, I have 229ms/loop when dividing 2 ndarrays, 2.83s/loop when dividing 2 masked arrays, and down to 493ms/loop when using the quick-and-dirty function above). So anyway, you'll still be slower using MA than ndarrays, but not as slow... On May 9, 2009, at 5:22 PM, Eli Bressert wrote: Hi, I'm using masked arrays to compute large-scale standard deviation, multiplication, gaussian, and weighted averages. At first I thought using the masked arrays would be a great way to sidestep looping (which it is), but it's still slower than expected. Here's a snippet of the code that I'm using it for. # Computing nearest neighbor distances. # Output will be about 270,000 rows long for the index # and 270,000x50 for the dist array. tree = ann.kd_tree(np.column_stack([l,b])) index, dist = tree.search(np.column_stack([l,b]),k=nth) # Clipping bad values by replacing them acceptable values av[np.where(av=-10)] = -10 av[np.where(av=50)] = 50 # Distance clipping and creating mask dist_arcsec = np.sqrt(dist)*3600 mask = dist_arcsec = d_thresh # Creating masked array av_good = ma.array(av[index],mask=mask) dist_good = ma.array(dist_arcsec,mask=mask) # Reason why I'm using masked arrays. If these were # ndarrays with nan's, then the output would be nan. Std = np.array(np.std(av_good,axis=1)) Var = Std*Std Rho = np.zeros( (len(av), nth) ) Rho2 = np.zeros( (len(av), nth) ) dist_std = np.std(dist_good,axis=1) for j in range(nth): Rho[:,j] = dist_std Rho2[:,j] = Var # This part takes about 20 seconds to compute for a 270,000x50 masked array. # Using ndarrays of the same size takes about 2 second spatial_weight = 1.0 / (Rho*np.sqrt(2*np.pi)) * np.exp( - dist_good / (2*Rho**2)) # Like the spatial_weight section, this takes about 20 seconds W = spatial_weight / Rho2 # Takes less than one second. Ave = np.average(av_good,axis=1,weights=W) Any ideas on why it would take such a long time for processing? Especially the spatial_weight and W variables? Would there be a faster way to do this? Or is there a way that numpy.std can process ignore nan's when processing? Thanks, Eli Bressert ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Are masked arrays slower for processing than ndarrays?
On May 9, 2009, at 8:17 PM, Eric Firing wrote: Eric Firing wrote: A part of the slowdown is what looks to me like unnecessary copying in _MaskedBinaryOperation.__call__. It is using getdata, which applies numpy.array to its input, forcing a copy. I think the copy is actually unintentional, in at least one sense, and possibly two: first, because the default argument of getattr is always evaluated, even if it is not needed; and second, because the call to np.array is used where np.asarray or equivalent would suffice. Yep, good call. the try/except should be better, and yes, I forgot to force copy=False (thought it was on by default...). I didn't know that getattr always evaluated the default, the docs are scarce on that subject... Pierre, ... I pressed send too soon. There are test failures with the patch I attached to my last message. I think the basic ideas are correct, but evidently there are wrinkles to be worked out. Maybe putmask() has to be used instead of where() (putmask is much faster) to maintain the ability to do *= and similar, and maybe there are other adjustments. Somehow, though, it should be possible to get decent speed for simple multiplication and division; a 10x penalty relative to ndarray operations is just too much. Quite agreed. It was a shock to realize that we were that slow. I gonna have to start testing w/ large arrays... I'm confident we can significantly speed up the _MaskedOperations without losing any of the features. Yes, putmask may be a better option. We could probably use the following MO: * result = a.data/b.data * putmask(result, m, a) However, I gonna need a good couple of weeks before being able to really look into it... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] How to download data directly from SQL into NumPy as a record array or structured array.
On May 5, 2009, at 2:42 PM, Wei Su wrote: Hi, Everyone: This is what I need to do everyday. Now I have to first save data as .csv file and the use csv2rec() to read the data as a record array. Anybody can give me some advice on how to directly get the data as record arrays? It will save me tons of time. Wei, Have a look to numpi.lib.io.genfromtxt, that should give you some ideas. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] after building from source, how to register numpy with synaptic?
On Apr 25, 2009, at 5:36 AM, Gael Varoquaux wrote: On Fri, Apr 24, 2009 at 10:11:07PM -0400, Chris Colbert wrote: Like the subject says, is there a way to register numpy with synaptic after building numpy from source? Don't play with the system's packaging system unless you really know what you are doing. Just install the numpy you are building outside of /usr/lib/... (you should never be installing home-build stuff in there). One link: http://www.doughellmann.com/projects/virtualenvwrapper/ I became a fan of virtualenvs, which lets you install different packages (not always compatible) without messing up the system's Python. Quite useful for tests and/or having multiple numpy versions in parallel. For instance install it in /usr/local: sudo python setup.py install --prefix /usr/local Now it will override the system's numpy. So you can install matplotlib, which will drag along the system's numpy, but you won't see it. On a side note, I tend to install home-built packages that overide system packages only in my home. I have a $HOME/usr directory, with a small directory hierarchy (usr/lib, usr/bin, ...), it is added in my PATH and PYTHONPATH, and I install there. Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Masking an array with another array
On Apr 22, 2009, at 5:21 PM, Gökhan SEVER wrote: Hello, Could you please give me some hints about how to mask an array using another arrays like in the following example. What about that ? numpy.logical_or.reduce([a==i for i in b]) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Masking an array with another array
On Apr 22, 2009, at 9:03 PM, josef.p...@gmail.com wrote: I prefer broad casting to list comprehension in numpy: Pretty neat! I still dont have the broadcasting reflex. Now, any idea which one is more efficient in terms of speed? in terms of temporaries? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] dtype field renaming
On Apr 8, 2009, at 5:57 PM, Elaine Angelino wrote: hi there -- for a numpy.recarray, is it possible to rename the fields in the dtype? Take a new view: a = np.array([(1,1)],dtype=[('a',int),('b',int)]) b = a.view([(A,int), ('b', int)]) or: use numpy.lib.recfunctions.rename_fields ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] dtype field renaming
On Apr 8, 2009, at 6:18 PM, Stéfan van der Walt wrote: 2009/4/9 Pierre GM pgmdevl...@gmail.com: for a numpy.recarray, is it possible to rename the fields in the dtype? Take a new view: a = np.array([(1,1)],dtype=[('a',int),('b',int)]) b = a.view([(A,int), ('b', int)]) or: use numpy.lib.recfunctions.rename_fields Or change the names tuple: a.dtype.names = ('c', 'd') Now that's wicked neat trick ! I love it ! Faster than taking a view for sure. Note that rename_fields should work also w/ nested fields (not that common, true). ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Construct symmetric matrix
All, I'm trying to build a relatively complicated symmetric matrix. I can build the upper-right block without pb. What is the fastest way to get the LL corner from the UR one ? Thanks a lot in advance for any idea. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] trouble building docs with sphinx-0.6.1
On Apr 1, 2009, at 11:57 AM, David Cournapeau wrote: preparing documents... done Exception occurred: 0%] contents File /usr/lib64/python2.6/site-packages/docutils/nodes.py, line 471, in __getitem__ return self.attributes[key] KeyError: 'entries' The full traceback has been saved in /tmp/sphinx-err-RDe0NL.log, if you want to report the issue to the author. Please also report this if it was a user error, so that a better error message can be provided next time. Send reports to sphinx-...@googlegroups.com mailto:sphinx-...@googlegroups.com. Thanks! This often happens for non-clean build. The only solution I got so far was to start the doc build from scratch... David, won't work here, there's a bug indeed. Part of it comes from numpydoc, that isn't completely compatible w/ Sphin-0.6.1. In particular, the code doesn't know what to do w/ this 'entries' parameter. part of it comes from Sphinx. Georg said he made the 'entries' parameter optional, but that doesn't solve everything. Matt Knox actually came across the 'best' solution Edit Sphinx/environment.py, L1051: replace refs = [(e[0], str(e[1])) for e in toctreenode['entries'])] by refs = [(e[0], str(e[1])) for e in toctreenode.get('entries', [])] Has anyone else tried building numpy's docs with sphinx-0.6.1? Is there any interest in sorting these issues out before 1.3 is released? I am afraid it is too late for the 1.3 release, cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Summer of Code: Proposal for Implementing date/time types in NumPy
Ciao Marty, Great idea indeed ! However, I'd really like to have an easy way to plug the suggested dtype w/ the existing Date class from the scikits.timeseries package (Date is implemented in C, you can find the sources through the link on http://pytseries.sourceforge.net). I agree that this particular aspect is not a priority, but it'd be nice to keep it in a corner of the mind. In any case, keep me in the loop. Cheers, P. On Mar 25, 2009, at 12:06 PM, Francesc Alted wrote: Hello Marty, A Tuesday 24 March 2009, Marty Fuhry escrigué: Hello, Sorry for any overlap, as I've been referred here from the scipi-dev mailing list. I was reading through the Summer of Code ideas and I'm terribly interested in date/time proposal (http://projects.scipy.org/numpy/browser/trunk/doc/neps/datetime-prop osal3.rst). I would love to work on this for a Google Summer of Code project. I'm a sophmore studying Computer Science and Mathematics at Kent State University in Ohio, so this project directly relates to my studies. Is there anyone looking into this proposal yet? To my knowledge, nobody is actively working on this anymore. As a matter of fact, during the discussions that led to the proposal, many people showed a real interested on the implementation of data/time types in NumPy. So it would be great if you can have a stab on this. Luck! -- Francesc Alted ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Help on subclassing numpy.ma: __array_wrap__
Kevin, Sorry for the delayed answer. (a) Is MA intended to be subclassed? Yes, that's actually the reason why the class was rewritten, to simplify subclassing. As Josef suggested, you can check the scikits.timeseries package that makes an extensive use of MaskedArray as baseclass. (b) If so, perhaps I'm missing something to make this work. Any pointers will be appreciated. As you've run a debugger on your sources, you must have noticed the calls to MaskedArray._update_from. In your case, the simplest is to define DTMA._update_from as such: _ def _update_from(self, obj): ma.MaskedArray._update_from(self, obj) self._attr = getattr(obj, '_attr', {'EmptyDict':[]}) _ Now, because MaskedArray.__array_wrap__() itself calls _update_from, you don't actually need a specific DTMA.__array_wrap__ (unless you have some specific operations to perform, but it doesn't seem to be the case). Now for a word of explanation: __array_wrap__ is intended to transform the output of a numpy function to an object of your class. When we use the numpy.ma functions, we don't need that, we just need to retrieve some of the attributes of the initial MA. That's why _update_from was introduced. Of course, I'm to blame not to have make that aspect explicit in the doc. I gonna try to correct that. In any case, let me know how it goes. P. On Mar 1, 2009, at 10:37 AM, Kevin Dunn wrote: Hi everyone, I'm subclassing Numpy's MaskedArray to create a data class that handles missing data, but adds some extra info I need to carrry around. However I've been having problems keeping this extra info attached to the subclass instances after performing operations on them. The bare-bones script that I've copied here shows the basic issue: http://pastebin.com/f69b979b8 There are 2 classes: one where I am able to subclass numpy (with help from the great description at http://www.scipy.org/Subclasses), and the other where I subclass numpy.ma, using the same ideas again. When stepping through the code in a debugger, lines 76 to 96, I can see that the numpy subclass, called DT, calls DT.__array_wrap__() after it completes unary and binary operations. But the numpy.ma subclass, called DTMA, does not seem to call DTMA.__array_wrap__(), especially line 111. Just to test this idea, I overrode the __mul__ function in my DTMA subclass to call DTMA.__array_wrap__() and it returns my extra attributes, in the same way that Numpy did. My questions are: (b) If so, perhaps I'm missing something to make this work. Any pointers will be appreciated. So far it seems the only way for me to sub-class numpy.ma is to override all numpy.ma functions of interest for my class and add a DTMA.__array_wrap() call to the end of them. Hopefully there is an easier way. Related to this question, was there are particular outcome from this archived discussion (I only joined the list recently): http://article.gmane.org/gmane.comp.python.numeric.general/24315 because that dictionary object would be exactly what I'm after here. Thanks, Kevin ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Proposed schedule for numpy 1.3.0
David, I also started to update the release notes: http://projects.scipy.org/scipy/numpy/browser/trunk/doc/release/1.3.0-notes.rst I get a 404. Anyhow, on the ma side: * structured arrays should now be fully supported by MaskedArray (r6463, r6324, r6305, r6300, r6294...) * Minor bug fixes (r6356, r6352, r6335, r6299, r6298) * Improved support for __iter__ (r6326) * made baseclass, sharedmask and hardmask accesible to the user (but read-only) + doc update ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] problem with assigning to recarrays
As a follow-up to Robert's answer: r[r.field1 == 1].field2 = 1 doesn't work, but r.field2[r.field1==1] = 1 does. So far, so good. Now I want to change the value of field2 for those same elements: In [128]: r[where(r.field1 == 1.)].field2 = 1 Ok, so now the values of field 2 have been changed, for those elements right? In [129]: r.field2 Out[129]: array([ 0., 0., 0., 0., 0.]) Wait. What? That can't be right. Let's check again: In [130]: print r[where(r.field1 == 1.)].field2 [ 0. 0.] Ok, so it appears that I can *access* fields in this array with an array of indices, but I can't assign new values to fields so accessed. However, I *can* change the values if I use a scalar index. This is different from the behavior of ordinary arrays, for which I can reassign elements' values either way. Moreover, when I try to reassign record array fields by indexing with an array of indices, it would appear that nothing at all happens. This syntax is equivalent to the pass command. So, my question is this: is there some reason for this behavior in record arrays, which is unexpectedly different from the behavior of normal arrays, and rather confusing. If so, why does the attempt to assign values to fields of an indexed subarray not raise some kind of error, rather than doing nothing? I think it's unlikely that I've actually found a bug in numpy, but this behavior does not make sense to me. Thanks for any insights, Brian ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] possible bug: __array_wrap__ is not called during arithmetic operations in some cases
On Feb 22, 2009, at 6:21 PM, Eric Firing wrote: Darren Dale wrote: Does anyone know why __array_wrap__ is not called for subclasses during arithmetic operations where an iterable like a list or tuple appears to the right of the subclass? When I do mine*[1,2,3], array_wrap is not called and I get an ndarray instead of a MyArray. [1,2,3]*mine is fine, as is mine*array([1,2,3]). I see the same issue with division, The masked array subclass does not show this behavior: Because MaskedArray.__mul__ and others are redefined. Darren, you can fix your problem by redefining MyArray.__mul__ as: def __mul__(self, other): return np.ndarray.__mul__(self, np.asanyarray(other)) forcing the second term to be a ndarray (or a subclass of). You can do the same thing for the other functions (__add__, __radd__, ...) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Filling gaps
On Feb 12, 2009, at 8:22 PM, A B wrote: Hi, Are there any routines to fill in the gaps in an array. The simplest would be by carrying the last known observation forward. 0,0,10,8,0,0,7,0 0,0,10,8,8,8,7,7 Or by somehow interpolating the missing values based on the previous and next known observations (mean). Thanks. The functions `forward_fill` and `backward_fill` in scikits.timeseries should do what you want. They work also on MaskedArray objects, meaning that you don't need to have actual series. The catch is that you need to install scikits.timeseries, of course. More info here:http://pytseries.sourceforge.net/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt: dtype=None and unpack=True
On Feb 11, 2009, at 11:38 PM, Ryan May wrote: Pierre, I noticed that using dtype=None with a heterogeneous set of data, trying to use unpack=True to get the columns into separate arrays (instead of a structured array) doesn't work. I've attached a patch that, in the case of dtype=None, unpacks the fields in the final array into a list of separate arrays. Does this seem like a good idea to you? Nope, as it breaks consistency: depending on some input parameters, you either get an array or a list. I think it's better to leave it as it is, maybe adding an extra line in the doc precising that unpack=True doesn't do anything for structured arrays. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ERROR: Test flat on masked_matrices
On Feb 7, 2009, at 8:03 AM, Nils Wagner wrote: == ERROR: Test flat on masked_matrices -- Traceback (most recent call last): File /usr/local/lib64/python2.5/site-packages/numpy/ma/tests/ test_core.py, line 1127, in test_flat test = ma.array(np.matrix([[1, 2, 3]]), mask=[0, 0, 1]) NameError: global name 'ma' is not defined Oops, sorry about that... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] question about ufuncs
On Feb 6, 2009, at 4:25 PM, Darren Dale wrote: I've been looking at how ma implements things like multiply() and MaskedArray.__mul__. I'm surprised that MaskedArray.__mul__ actually calls ma.multiply() rather than calling super(MaskedArray,self).__mul__(). There's some under-the-hood machinery to deal with the data, and we need to be able to manipulate it *before* the operation takes place. The super() approach calls __array_wrap__ on the result, so *after* the operation took place, and that's not what we wanted... Maybe that is the way ndarray does it, but I don't think this is the right approach for my quantity subclasses. If I want to make a MaskedQuantity (someday), MaskedQuantity.__mul__ should be calling super(MaskedQuantity,self).__mul__(), not reimplementations of numpy.multiply or ma.multiply, right? You'll end up calling ma.multiply anyway (super(MaskedQuantity,self).__mul__ will call MaskedArray.__mul__ which calls ma.multiply... So yes, I think you can stick to the super() approach in your case There are some cases where the default numpy function expects certain units on the way in, like the trig functions, which I think would have to be reimplemented. And you can probably define a generic class to deal with that instead of reimplementing the functions individually (and we're back to the initial advice). But aside from that, is there anything wrong with taking this approach? It seems to allow quantities to integrate pretty well with the numpy builtins. Go and try, the problems (if any) will show up... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Selection of only a certain number of fields
On Feb 5, 2009, at 6:08 PM, Travis E. Oliphant wrote: Hi all, I've been fairly quiet on this list for awhile due to work and family schedule, but I think about how things can improve regularly.One feature that's been requested by a few people is the ability to select multiple fields from a structured array. [...] +1 for #2. Note that we now have a drop_fields function in np.lib.recfunctions, a reimplementation of the equivalent function in matplotlib. It works along the lines of your proposition #1 (create a new array w/ a new dtype and fill it) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] poly1d left versus right multiplication with np numbers
On Feb 4, 2009, at 11:00 AM, josef.p...@gmail.com wrote: I just had a hard to find bug in my program. poly1d treats numpy scalars differently than python numbers when left or right multiplication is used. Essentially, if the first term is the numpy scalar, multiplied by a polynomial, then the result is an np.array. If the order is reversed, then the result is an instance of np.poly1d. The return types are also the same for numpy arrays, which is at least understandable, although a warning would be good) When using plain (python) numbers, then both left and right multiplication of the number with the polynomial returns a polynomial. Is this a bug or a feature? I didn't see it mentioned in the docs. Looks like yet another example of ticket #826: http://scipy.org/scipy/numpy/ticket/826 This one is getting quite a problem, and I have no idea how to fix it... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] ImportError: No module named dateutil.parser
On Feb 4, 2009, at 3:56 PM, Robert Kern wrote: No, rewrite the test to not use external libraries, please. Test the functionality without needing dateutils. OK then, should be fixed in r6340. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Renaming a field of an object array
All, I'm a tad puzzled by the following behavior (I'm trying to correct a bug in genfromtxt): I'm creating an empty structured ndarray, using np.object as dtype. a = np.empty(1,dtype=[('',np.object)]) array([(None,)], dtype=[('f0', '|O4')]) Now, I'd like to rename the field: a.view([('NAME',np.object)]) TypeError: Cannot change data-type for object array. I understand why I can't change the *type* of the field, but not why I can't change its name that way. What would be an option that wouldn't involve creating a new array ? Thx in advance. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genfromtxt view with object dtype
OK, Brent, try r6341. I fixed genfromtxt for cases like yours (explicit dtype involving a np.object). Note that the fix won't work if the dtype is nested and involves np.objects (as we would hit the pb of renaming fields we observed...). Let me know how it goes. P. On Feb 4, 2009, at 4:03 PM, Brent Pedersen wrote: On Wed, Feb 4, 2009 at 9:36 AM, Pierre GM pgmdevl...@gmail.com wrote: On Feb 4, 2009, at 12:09 PM, Brent Pedersen wrote: hi, i am using genfromtxt, with a dtype like this: [('seqid', '|S24'), ('source', '|S16'), ('type', '|S16'), ('start', 'i4'), ('end', 'i4'), ('score', 'f8'), ('strand', '|S1'), ('phase', 'i4'), ('attrs', '|O4')] Brent, Please post a simple, self-contained example with a few lines of the file you want to load. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion hi pierre, here is an example. thanks, -brent ## import numpy as np from cStringIO import StringIO gffstr = \ ##gff-version 3 1\tucb\tgene\t2234602\t2234702\t.\t-\t. \tID = grape_1_2234602_2234702 ;match = EVM_prediction_supercontig_1.248,EVM_prediction_supercontig_1.248.mRNA 1\tucb\tgene\t2300292\t2302123\t.\t+\t. \tID=grape_1_2300292_2302123;match=EVM_prediction_supercontig_244.8 1\tucb\tgene\t2303615\t2303967\t.\t+\t. \tID=grape_1_2303615_2303967;match=EVM_prediction_supercontig_244.8 1\tucb\tgene\t2303616\t2303966\t.\t+\t. \tParent=grape_1_2303615_2303967 1\tucb\tgene\t3596400\t3596503\t.\t-\t. \tID=grape_1_3596400_3596503;match=evm.TU.supercontig_167.27 1\tucb\tgene\t3600651\t3600977\t.\t-\t. \tmatch=evm.model.supercontig_1217.1,evm.model.supercontig_1217.1.mRNA dtype = {'names' : ('seqid', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase', 'attrs') , 'formats': ['S24', 'S16', 'S16', 'i4', 'i4', 'f8', 'S1', 'i4', 'S128']} #OK with S128 for attrs print np.genfromtxt(StringIO(gffstr), dtype = dtype) def _attr(kvstr): pairs = [kv.split(=) for kv in kvstr.split(;)] return dict(pairs) # change S128 to object to have col attrs as dictionary dtype['formats'][-1] = 'O' converters = {8: _attr } #NOT OK print np.genfromtxt(StringIO(gffstr), dtype = dtype, converters=converters) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Numpy 1.3 release date ?
All, When can we expect numpy 1.3 to be released ? Sincerely, P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt question
On Feb 3, 2009, at 11:24 AM, Ryan May wrote: Pierre, Should the following work? import numpy as np from StringIO import StringIO converter = {'date':lambda s: datetime.strptime(s,'%Y-%m-%d %H:%M: %SZ')} data = np.ndfromtxt(StringIO('2009-02-03 12:00:00Z,72214.0'), delimiter=',', names=['date','stid'], dtype=None, converters=converter) Well, yes, it should work. That's indeed a problem with the getsubdtype method of the converter. The problem is that we need to estimate the datatype of the output of the converter. In most cases, trying to convert '0' works properly, not in yours however. In r6338, I force the type to object if converting '0' does not work. That's a patch till the next corner case... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Operations on masked items
On Feb 3, 2009, at 4:00 PM, Ryan May wrote: Well, I guess I hit send too soon. Here's one easy solution (consistent with what you did for __radd__), change the code for __rmul__ to do: return multiply(self, other) instead of: return multiply(other, self) That fixes it for me, and I don't see how it would break anything. Good call, but once again: Thou shalt not put trust in ye masked values [1]. a = np.ma.array([1,2,3],mask=[0,1,0]) b = np.ma.array([10, 20, 30], mask=[0,1,0]) (a*b).data array([10, 2, 90]) (b*a).data array([10, 20, 90]) So yes, __mul__ is not commutative when you deal w/ masked arrays (at least, when you try to access the data under a mask). Nothing I can do. Remember that preventing the underlying data to be modified is NEVER guaranteed... [1] Epistle of Paul (Dubois). ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] question about ufuncs
On Feb 1, 2009, at 6:32 PM, Darren Dale wrote: Is there an analog to __array_wrap__ for preprocessing arrays on their way *into* a ufunc? For example, it would be nice if one could do something like: numpy.sin([1,2,3]*arcseconds) where we have the opportunity to inspect the context, convert the Quantity to units of radians, and then actually call the ufunc. Is this possible, or does one have to reimplement such functions? Just an idea: look at the code for numpy.ma ufuncs (in numpy.ma.core). By defining a few classes for unary, binary and domained functions, you could probably do what you want, without having to recode all the functions by hand. Another idea would be to define some specific __mul__ or __rmul__ rules for your units, so that the list would be transformed into a UnitArray... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] puzzle: generate index with many ranges
On Jan 30, 2009, at 1:11 PM, Raik Gruenberg wrote: Mhm, I got this far. But how do I get from here to a single index array [ 4, 5, 6, ... 10, 0, 1, 2, 3, 11, 12, 13, 14 ] ? np.concatenate([np.arange(aa,bb) for (aa,bb) in zip(a,b)]) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] puzzle: generate index with many ranges
On Jan 30, 2009, at 1:53 PM, Raik Gruenberg wrote: Pierre GM wrote: On Jan 30, 2009, at 1:11 PM, Raik Gruenberg wrote: Mhm, I got this far. But how do I get from here to a single index array [ 4, 5, 6, ... 10, 0, 1, 2, 3, 11, 12, 13, 14 ] ? np.concatenate([np.arange(aa,bb) for (aa,bb) in zip(a,b)]) exactly! Now, the question was, is there a way to do this only using numpy functions (sum, repeat, ...), that means without any python for loop? Can't really see it right now. Make np.arange(max(b)) and take the slices you need ? But you still have to look in 2 arrays to find the beginning and end of slices, so... Sorry about being so insistent on this one but, in my experience, eliminating those for loops makes a huge difference in terms of speed. The zip is probably also quite costly on a very large data set. yeah, but it's in a list comprehension, which may make things a tad faster. If you prefer, use itertools.izip instead of zip, but I wonder where the advantage would be. Anyway, are you sure this particular part is your bottleneck ? You know the saying about premature optimization... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Documentation: objects.inv ?
On Jan 29, 2009, at 3:17 AM, Pauli Virtanen wrote: Thu, 29 Jan 2009 00:28:46 -0500, Pierre GM wrote: Is there an objects.inv lying around for the numpy reference guide, or should I start one from scratch ? It's automatically generated by Sphinx, and can be found at http://docs.scipy.org/doc/numpy/objects.inv Let's make the promise that it shall be found there in the future, too. Got it, thanks a lot. Pauli, how often is the documentation on docs.scipy.org updated from SVN ? Thx again P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] optimise operation in array with datetime objects
On Jan 28, 2009, at 3:56 PM, Timmie wrote: ### this is the loop I would like to optimize: ### looping over arrays is considered inefficient. ### what could be a better way? hours_array = dates_array.copy() for i in range(0, dates_array.size): hours_array[i] = dates_array[i].hour You could try: np.fromiter((_.hour for _ in dates_li), dtype=np.int) or np.array([_.hour for _ in dates_li], dtype=np.int) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] optimise operation in array with datetime objects
On Jan 28, 2009, at 5:43 PM, Timmie wrote: You could try: np.fromiter((_.hour for _ in dates_li), dtype=np.int) or np.array([_.hour for _ in dates_li], dtype=np.int) I used dates_li only for the preparation of example data. So let's suppose I have the array dates_array returned from a a function. Just use dates_array instead of dates_li, then. hours_array = dates_array.copy() for i in range(0, dates_array.size): hours_array[i] = dates_array[i].hour * What's the point of making a copy of dates_array ? dates_array is a ndarray of object, right ? And you want to take the hours, so you should have an ndarray of integers for hours_array. * The issue I have with this part is that you have several calls to __getitem__ at each iteration. It might be faster to use create hours_array as a block: hours_array=np.array([_.hour for _ in dates_array], dtype=np.int) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recfunctions.stack_arrays
[Some background: we're talking about numpy.lib.recfunctions, a set of functions to manipulate structured arrays] Ryan, If the two files have the same structure, you can use that fact and specify the dtype of the output directly with the dtype parameter of mafromtxt. That way, you're sure that the two arrays will have the same dtype. If you don't know the structure beforehand, you could try to load one array and use its dtype as input of mafromtxt to load the second one. Now, we could also try to modify stack_arrays so that it would take the largest dtype when several fields have the same name. I'm not completely satisfied by this approach, as it makes dtype conversions under the hood. Maybe we could provide the functionality as an option (w/ a forced_conversion boolean input parameter) ? I'm a bit surprised by the error message you get. If I try: a = ma.array([(1,2,3)], mask=[(0,1,0)], dtype=[('a',int), ('b',bool), ('c',float)]) b = ma.array([(4, 5, 6)], dtype=[('a', int), ('b', float), ('c', float)]) test = np.stack_arrays((a, b)) I get a TypeError instead (the field 'b' hasn't the same type in a and b). Now, I get the 'two fields w/ the same name' when I use np.merge_arrays (with the flatten option). Could you send a small example ? P.S. Thanks so much for your work on putting those utility functions in recfunctions.py It makes it so much easier to have these functions available in the library itself rather than needing to reinvent the wheel over and over. Indeed. Note that most of the job had been done by John Hunter and the matplotlib developer in their matplotlib.mlab module, so you should thank them and not me. I just cleaned up some of the functions. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] recfunctions.stack_arrays
On Jan 27, 2009, at 4:23 PM, Ryan May wrote: I definitely wouldn't advocate magic by default, but I think it would be nice to be able to get the functionality if one wanted to. OK. Put on the TODO list. There is one problem I noticed, however. I found common_type and lib.mintypecode, but both raise errors when trying to find a dtype to match both bool and float. I don't know if there's another function somewhere that would work for what I want. I'm not familiar with these functions, I'll check that. Apparently, I get my error as a result of my use of titles in the dtype to store an alternate name for the field. (If you're not familiar with titles, they're nice because you can get fields by either name, so for the following example, a['a'] and a['A'] both return array([1]).) The following version of your case gives me the ValueError: Ah OK. You found a bug. There's a frustrating feature of dtypes: dtype.names doesn't always match [_[0] for _ in dtype.descr]. As a side question, do you have some local mods to your numpy SVN so that some of the functions in recfunctions are available in numpy's top level? Probably. I used the develop option of setuptools to install numpy on a virtual environment. On mine, I can't get to them except by importing them from numpy.lib.recfunctions. I don't see any mention of recfunctions in lib/__init__.py. Well, till some problems are ironed out, I'm not really in favor of advertising them too much... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Academic citation ?
JH, Thx for the links, but I'm afraid I need something more basic than that. For example, I'm referring to Python as: van Rossum, G. and Drake, F. L. (eds), 2006. Python Reference Manual, Python Software Foundation,. http://docs.python.org/ref/ref.html. I could indeed use http://www.scipy.org/Citing_SciPy to cite Scipy (although the citation is incomplete), and define something similar for Numpy... Or refer to the Computing in Science and Engineering special issue. I'm just a bit surprised there's no official standard. Thx, P. On Jan 26, 2009, at 10:56 AM, j...@physics.ucf.edu wrote: What is the most up-to-date way to cite Numpy and Scipy in an academic journal ? Cite our conference articles here: http://conference.scipy.org/proceedings/SciPy2008/index.html It would be nice if someone involved in the proceedings could post a bibtex on the citations page. And link the citations page to...something...easily navigated to from the front page. This brings up a related point: When someone goes to scipy.org, there is no way to navigate to conferences.scipy.org from scipy.org except by finding the link buried in the intro text. Ipython and all the whatever.scipy.org domains, except for docs.scipy.org, are completely absent; you have to know about them to find them. I don't even know where to find a complete list of these. They should all have a presence on at least the front page and maybe the navigation. --jh-- ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug with mafromtxt
On Jan 24, 2009, at 6:23 PM, Ryan May wrote: Ok, thanks. I've dug a little further, and it seems like the problem is that a column of all missing values ends up as a column of all None's. When you create a (masked) array from a list of None's, you end up with an object array. On one hand I'd love for things to behave differently in this case, but on the other I understand why things work this way. Ryan, Mind giving r6434 a try? As usual, don't hesitate to report any problem. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Academic citation ?
All, What is the most up-to-date way to cite Numpy and Scipy in an academic journal ? Thanks a lot in advance P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Academic citation ?
David, Thanks, but that's only part of what I need. I could also refer to Travis O's paper in Computing in Science and Engineering, but I wondered whether there wasn't something more up-to-date. So, other answers are still welcome. P. On Jan 25, 2009, at 8:17 PM, David Warde-Farley wrote: I believe this is what you're looking for: http://www.scipy.org/Citing_SciPy On 25-Jan-09, at 6:45 PM, Pierre GM wrote: All, What is the most up-to-date way to cite Numpy and Scipy in an academic journal ? Thanks a lot in advance P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Bug with mafromtxt
Ryan, Thanks for reporting. An idea would be to force the dtype of the masked column to the largest dtype of the other columns (in your example, that would be int). I'll try to see how easily it can be done early next week. Meanwhile, you can always give an explicit dtype at creation. On Jan 24, 2009, at 5:58 PM, Ryan May wrote: Pierre, I've found what I consider to be a bug in the new mafromtxt (though apparently it existed in earlier versions as well). If you have an entire column of data in a file that contains only masked data, and try to get mafromtxt to automatically choose the dtype, the dtype gets selected to be object type. In this case, I'd think the better behavior would be float, but I'm not sure how hard it would be to make this the case. Here's a test case: import numpy as np from StringIO import StringIO s = StringIO('1 2 3\n4 5 6\n') a = np.mafromtxt(s, missing='2,5', dtype=None) print a.dtype Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.array and subok kwarg
Darren, The type returned by np.array is ndarray, unless I specifically set subok=True, in which case I get a MyArray. The default value of subok is True, so I dont understand why I have to specify subok unless I want it to be False. Is my subclass missing something important? Blame the doc: the default for subok in array is False, as explicit in the _array_fromobject Cfunction (in multiarray). So no, you're not doing anything wrong. Note that by default subok=True for numpy.ma.array. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Failures in test_recfunctions.py
Interesting. The tests pass on my machine OS X, Python version 2.5.4 (r254:67916, Dec 29 2008, 17:02:44) [GCC 4.0.1 (Apple Inc. build 5488)] nose version 0.10.4 For File /home/nwagner/local/lib64/python2.6/site-packages/numpy/lib/tests/ test_recfunctions.py, line 34, in test_zip_descr np.dtype([('', 'i4'), ('', 'i4')])) I guess I can change 'i4' to int, which should work. For: == FAIL: Test the ignoremask option of find_duplicates -- Traceback (most recent call last): File /home/nwagner/local/lib64/python2.6/site-packages/numpy/lib/tests/ test_recfunctions.py, line 186, in test_find_duplicates_ignoremask assert_equal(test[-1], control) (mismatch 33.33%) x: array([0, 1, 3, 4, 2, 6]) y: array([0, 1, 3, 4, 6, 2]) there's obviously a machine-dependent element somewhere. I'd blame argsort: the last 2 indices that are switched correspond to the masked elements in the input of the test. Note that the result is basically correct. I should have access to a linux box, I'll see what I can do. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Failures in test_recfunctions.py
On Jan 22, 2009, at 1:31 PM, Nils Wagner wrote: Hi Pierre, Thank you. Works for me. You're welcome, thanks for reporting! ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] strange multiplication behavior with numpy.float64 and ndarray subclass
On Jan 21, 2009, at 11:34 AM, Darren Dale wrote: I have a simple test script here that multiplies an ndarray subclass with another number. Can anyone help me understand why each of these combinations returns a new instance of MyArray: mine = MyArray() print type(np.float32(1)*mine) print type(mine*np.float32(1)) print type(mine*np.float64(1)) print type(1*mine) print type(mine*1) but this one returns a np.float64 instance? FYI, that's the same behavior as observed in ticket #826. A first thread addressed that issue http://www.mail-archive.com/numpy-discussion@scipy.org/msg13235.html But so far, no answer has been suggested. Any help welcome. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genfromtxt
Brent, Currently, no, you won't be able to retrieve the header if it's commented. I'll see what I can do. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genfromtxt
Brent, Mind trying r6330 and let me know if it works for you ? Make sure that you use names=True to detect a header. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Examples for numpy.genfromtxt
Till I write some proper doc, you can check the examples in tests/ test_io (TestFromTxt suitcase) On Jan 20, 2009, at 4:17 AM, Nils Wagner wrote: Hi all, Where can I find some sophisticated examples for the usage of numpy.genfromtxt ? Nils ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] numpy.testing.asserts and masked array
On Jan 16, 2009, at 10:51 AM, josef.p...@gmail.com wrote: I have a regression result with masked arrays that produces a masked array output, estm5.yhat, and I want to test equality to the benchmark case, estm1.yhat, with the asserts in numpy.testing, but I am getting strange results. ... Whats the trick to assert_almost_equal for masked arrays? Us numpy.ma.testutils.assert_almost_equal instead. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] unique1d and asarray
On Jan 4, 2009, at 4:47 PM, Robert Kern wrote: On Sun, Jan 4, 2009 at 15:44, Pierre GM pgmdevl...@gmail.com wrote: If we used np.asanyarray instead, subclasses are recognized properly, the mask is recognized by argsort and the result correct. Is there a reason why we use np.asarray instead of np.asanyarray ? Probably not. So there wouldn't be any objections to make the switch ? We can wait a couple of days if anybody has a pb with that... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] genloadtxt : ready for inclusion
All, You'll probably remember that last December, I started rewriting np.loadtxt and ame up with a series of functions that support missing data. I tried to copy/paste the code in numpy.lib.io.py but ran into dependency problems and left it at that. I think that part of the reason is that the code relies on numpy.ma which can't be loaded when numpy.lib gets loaded. As I needed a way to grant access to the code to anybody, I created a small project on launchpad: you can access it to: https://code.launchpad.net/~pierregm/numpy/numpy_addons The loadtxt reimplementation functions can be found in the numpy.io.fromascii module, their unittest in the corresponding test directory. In addition, you'll find several other functions and their unittest to manipulate arrays w/ flexible data-type. They are basically rewritten version of some functions in matplotlib.mlab. Would anybody be willing to try inserting the new functions in numpy ? I was hoping the genfromtxt and consorts would make it to numpy 1.3.x (I'd need the code for the scikits.timeseries package). As usual, I'd need all the feedback you can share. Thanks a lot in advance. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
Jean-Baptiste, As you stated, everything depends on what you want to do. If you need to keep the correspondence ageweight for each entry, then yes, record arrays, or at least flexible-type arrays, are the best. (The difference between a recarray and a flexible-type array is that fields can be accessed by attributes (data.age) or items (data['age']) with recarrays, but only with items with felxible-type arrays). Using your example, you could very well do: data['age'] += 1 and still keep the correspondence ageweight. Your FieldArray class returns an object that is not a ndarray, which may have some undesired side-effects. As Ryan noted, flexible-type arrays are usually faster, because they lack the overhead brought by the possibiity of accessing data by attributes. So, if you don't mind using the 'access-by-fields' syntax, you're good to go. On Dec 29, 2008, at 10:58 AM, Jean-Baptiste Rudant wrote: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ... (age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 Thank you for the help, Jean-Baptiste Rudant ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] is there a sortrows
On Dec 21, 2008, at 10:19 PM, josef.p...@gmail.com wrote: From the examples that I tried out np.sort, sorts each column separately (with axis = 0). If the elements of a row is supposed to stay together, then np.sort doesn't work Well, if the elements are supposed to stay together, why wouldn't you tie them first, sort, and then untie them ? np.sort(a.view([('',int),('',int)]),0).view(int) The first view transforms your 2D array into a 1D array of tuples, the second one retransforms the 1D array to 2D. Not sure it's better than your lexsort, haven't timed it. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unexpected MaskedArray behavior
On Dec 17, 2008, at 12:13 PM, Jim Vickroy wrote: Sorry for being dense about this, but I really do not understand why masked values should not be trusted. If I apply a procedure to an array with elements designated as untouchable, I would expect that contract to be honored. What am I missing here? Thanks for your patience! -- jv Everything depends on your interpretation of masked data. Traditionally, masked data indicate invalid data, whatever the cause of the invalidity. Operations involving invalid data yield invalid data, hence the presence of a mask on the result. However, the value underneath the mask is still invalid, hence the statement don't trust masked values. Interpreting a mask as a way to prevent some elements of an array to be processed (designating them as untouchable) is a bit of a stretch. Nevertheless, I agree that this behavior is not intuitive, so I'll check what I can do. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
Ryan, OK, I'll look into that. I won't have time to address it before this next week, however. Option #2 looks like the best. In other news, I was considering renaming genloadtxt to genfromtxt, and using ndfromtxt, mafromtxt, recfromtxt, recfromcsv for the function names. That way, loadtxt is untouched. On Dec 16, 2008, at 6:07 PM, Ryan May wrote: Pierre GM wrote: All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. I have one more use issue that you may or may not want to fix. My problem is that missing values are specified by their string representation, so that a string representing a missing value, while having the same actual numeric value, may not compare equal when represented as a string. For instance, if you specify that -999.0 represents a missing value, but the value written to the file is -999.00, you won't end up masking the -999.00 data point. I'm sure a test case will help here: def test_withmissing_float(self): data = StringIO.StringIO('A,B\n0,1.5\n2,-999.00') test = mloadtxt(data, dtype=None, delimiter=',', missing='-999.0', names=True) control = ma.array([(0, 1.5), (2, -1.)], mask=[(False, False), (False, True)], dtype=[('A', np.int), ('B', np.float)]) print control print test assert_equal(test, control) assert_equal(test.mask, control.mask) Right now this fails with the latest version of genloadtxt. I've worked around this by specifying a whole bunch of string representations of the values, but I wasn't sure if you knew of a better way that this could be handled within genloadtxt. I can only think of two ways, though I'm not thrilled with either: 1) Call the converter on the string form of the missing value and compare against the converted value from the file to determine if missing. (Probably very slow) 2) Add a list of objects (ints, floats, etc.) to compare against after conversion to determine if they're missing. This might needlessly complicate the function, which I know you've already taken pains to optimize. If there's no good way to do it, I'm content to live with a workaround. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Unexpected MaskedArray behavior
On Dec 16, 2008, at 1:57 PM, Ryan May wrote: I just noticed the following and I was kind of surprised: a = ma.MaskedArray([1,2,3,4,5], mask=[False,True,True,False,False]) b = a*5 b masked_array(data = [5 -- -- 20 25], mask = [False True True False False], fill_value=99) b.data array([ 5, 10, 15, 20, 25]) I was expecting that the underlying data wouldn't get modified while masked. Is this actual behavior expected? Meh. Masked data shouldn't be trusted anyway, so I guess it doesn't really matter one way or the other. But I tend to agree, it'd make more sense leave masked data untouched (or at least, reset them to their original value after the operation), which would mimic the behavior of gimp/photoshop. Looks like there's a relatively easy fix. I need time to check whether it doesn't break anything elsewhere, nor that it slows things down too much. I won't have time to test all that before next week, though. In any case, that would be for 1.3.x, not for 1.2.x. In the meantime, if you need the functionality, use something like ma.where(a.mask,a,a*5) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Superclassing numpy.matrix: got an unexpected keyword argument 'dtype'
Robert, Transforming your matrix to a list before computation isn't very efficient. If you do need some extra parameters in your __init__ to be compatible with other functions such as asmatrix, well, just add them, or use a coverall **kwargs def __init__(self, instruments, **kwargs) No guarantee it'll work all the time. Otherwise, please have a look at: http://docs.scipy.org/doc/numpy/user/basics.subclassing.html and the other link at the top of that page. In your case, I'd try to put the initialization in the __array_finalize__. On Dec 10, 2008, at 7:15 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hello, I'm using numpy-1.1.1 for Python 2.3. I'm trying to create a class that acts just like the numpy.matrix class with my own added methods and attributes. I want to pass my class a list of custom instrument objects and do some math based on these objects to set the matrix. To this end I've done the following: from numpy import matrix class rcMatrix(matrix): def __init__(self,instruments): Do some calculations and set the values of the matrix. self[0,0] = 100 # Just an example self[0,1] = 100 # The real init method self[1,0] = 200 # Does some math based on the input objects self[1,1] = 300 # def __new__(cls,instruments): When creating a new instance begin by creating an NxN matrix of zeroes. len_ = len(instruments) return matrix.__new__(cls,[[0.0]*len_]*len_) It works great and I can, for example, multiply two of my custom matrices seamlessly. I can also get the transpose. However, when I try to get the inverse I get an error: rcm = rcMatrix(['instrument1','instrument2']) print rcm [[ 100. 100.] [ 200. 300.]] print rcm.T [[ 100. 200.] [ 100. 300.]] print [5,10] * rcm [[ 2500. 3500.]] print rcm.I Traceback (most recent call last): File [Standard]/deleteme, line 29, in ? File C:\Python23\Lib\site-packages\numpy\core\defmatrix.py, line 492, in getI return asmatrix(func(self)) File C:\Python23\Lib\site-packages\numpy\core\defmatrix.py, line 52, in asmatrix return matrix(data, dtype=dtype, copy=False) TypeError: __init__() got an unexpected keyword argument 'dtype' I've had to overwrite the getI function in order for things to work out: def getI(self): return matrix(self.tolist()).I I = property(getI, None, doc=inverse) Is this the correct way to achieve my goals? Please let me know if anything is unclear. Thanks, Robert Conde ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt : last call
On Dec 9, 2008, at 12:59 PM, Christopher Barker wrote: Jarrod Millman wrote: From the user's perspective, I would like all the NumPy IO code to be in the same place in NumPy; and all the SciPy IO code to be in the same place in SciPy. +1 So, no problem w/ importing numpy.ma and numpy.records in numpy.lib.io ? So I wonder if it would make sense to incorporate AstroAsciiData? Doesn't it overlap a lot with genloadtxt? If so, that's a bit confusing to new users. For the little I browsed, do we need it ? We could get the same thing with record arrays... 3. What about data source? Should we remove datasource? Start using it more? start using it more -- it sounds very handy. Didn't know it was around. I'll adapt genloadtxt to use it. Documentation - Let me try NumPy; this seems pretty good. Now let's see how to load in some of my data) totally key -- I have a colleague that has used Matlab a fair bi tin past that is starting a new project -- he asked me what to use. I, of course, suggested python+numpy+scipy. His first question was -- can I load data in from excel? So that would go in scipy.io ? One more comment -- for fast reading of lots of ascii data, fromfile() needs some help -- I wish I had more time for it -- maybe some day. I'm afraid you'd have to count me out on this one: I don't speak C (yet), and don't foresee learning it soon enough to be of any help... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Python2.4 support
All, * What versions of Python should be supported by what version of numpy ? Are we to expect users to rely on Python2.5 for the upcoming 1.3.x ? Could we have some kind of timeline on the trac site or elsewhere (and if such a timeline exists already, can I get the link?) ? * Talking about 1.3.x, what's the timeline? Are we still shooting for a release in 2008 or could we wait till mid Jan. 2009 ? Thx a lot in advance ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Python2.4 support
On Dec 7, 2008, at 4:21 PM, Jarrod Millman wrote: NumPy 1.3.x should work with Python 2.4, 2.5, and 2.6. At some point we can drop 2.4, but I would like to wait a bit since we just dropped 2.3 support. The timeline is on the trac site: http://projects.scipy.org/scipy/numpy/milestone/1.3.0 OK, great, thanks a lot. * Talking about 1.3.x, what's the timeline? Are we still shooting for a release in 2008 or could we wait till mid Jan. 2009 ? I am fine with pushing the release back, if there is interest in doing that. I have been mainly focusing on getting SciPy 0.7.x out, so I haven't been following the NumPy development closely. But it is good that you are asking for more concrete details about the next NumPy release. We need to start making plans. Does anyone have any suggestions about whether we should push the release back? Is 1 month long enough? What is left to do? Well, on my side, there's some doc to be updated, of course. Then, I'd like to put the rec_functions that were developed in matplotlib to manipulate recordarrays. I haven't started yet, might be able to do so before the end of the year (not much to do, just a clean up and some examples). And what should we do with the genloadtxt function ? Please feel free to update the release notes, which are checked into the trunk: http://scipy.org/scipy/numpy/browser/trunk/doc/release/1.3.0- notes.rst Will do in good time. Thx again ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] genloadtxt : last call
All, Here's the latest version of genloadtxt, with some recent corrections. With just a couple of tweaking, we end up with some decent speed: it's still slower than np.loadtxt, but only 15% so according to the test at the end of the package. And so, now what ? Should I put the module in numpy.lib.io ? Elsewhere ? Thx for any comment and suggestions. Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var, optional If a string, character used to delimit consecutive fields. If an integer or a sequence of integers, width(s) of each field. autostrip : boolean, optional Whether to strip each individual fields def autostrip(self, method): Wrapper to strip each member of the output of `method`. return lambda input: [_.strip() for _ in method(input)] # def __init__(self, delimiter=None, comments='#', autostrip=True): self.comments = comments # Delimiter is a character if (delimiter is None) or _is_string_like(delimiter): delimiter = delimiter or None _handyman = self._delimited_splitter # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): _handyman = self._variablewidth_splitter idx = np.cumsum([0]+list(delimiter)) delimiter = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): (_handyman, delimiter) = (self._fixedwidth_splitter, int(delimiter)) else: (_handyman, delimiter) = (self._delimited_splitter, None) self.delimiter = delimiter if autostrip: self._handyman = self.autostrip(_handyman) else: self._handyman = _handyman # def _delimited_splitter(self, line): line = line.split(self.comments)[0].strip() if not line: return [] return line.split(self.delimiter) # def _fixedwidth_splitter(self, line): line = line.split(self.comments)[0] if not line: return [] fixed = self.delimiter slices = [slice(i, i+fixed) for i in range(len(line))[::fixed]] return [line[s] for s in slices] # def _variablewidth_splitter(self, line): line = line.split(self.comments)[0] if not line: return [] slices = self.delimiter return [line[s] for s in slices] # def __call__(self, line): return self._handyman(line) class
[Numpy-discussion] genloadtxt: second serving
All, Here's the second round of genloadtxt. That's a tad cleaner version than the previous one, where I tried to take into account the different comments and suggestions that were posted. So, tabs should be supported and explicit whitespaces are not collapsed. FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit comparison: same input, no missing data, one with genloadtxt, one with np.loadtxt and a last one with matplotlib.mlab.csv2rec. As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but twice faster than csv2rec. One of the explanation for the slowness is indeed the use of classes for splitting lines and converting values. Instead of a basic function, we use the __call__ method of the class, which itself calls another function depending on the attribute values. I'd like to reduce this overhead, any suggestion is more than welcome, as usual. Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in numpy.ma, with an alias recfromcsv for John, using his defaults. Unless somebody comes with a brilliant optimization. Let me know how it goes, Cheers, P. Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var, optional If a string, character used to delimit consecutive fields. If an integer or a sequence of integers, width(s) of each field. autostrip : boolean, optional Whether to strip each individual fields def autostrip(self, method): Wrapper to strip each member of the output of `method`. return lambda input: [_.strip() for _ in method(input)] # def __init__(self, delimiter=None, comments='#', autostrip=True): self.comments = comments # Delimiter is a character if (delimiter is None) or _is_string_like(delimiter): delimiter = delimiter or None _called = self._delimited_splitter # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): _called = self._variablewidth_splitter idx = np.cumsum([0]+list(delimiter)) delimiter = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): (_called, delimiter) = (self._fixedwidth_splitter, int(delimiter)) else: (_called, delimiter) = (self._delimited_splitter, None) self.delimiter = delimiter if autostrip: self._called = self.autostrip(_called) else:
[Numpy-discussion] genloadtxt: second serving (tests)
And now for the tests: # pylint disable-msg=E1101, W0212, W0621 import numpy as np import numpy.ma as ma from numpy.ma.testutils import * from StringIO import StringIO from _preview import * class TestLineSplitter(TestCase): Tests the LineSplitter class. # def test_no_delimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter()(strg) assert_equal(test, ['1', '2', '3', '4', '5']) test = LineSplitter('')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) def test_space_delimiter(self): Test space delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) test = LineSplitter(' ')(strg) assert_equal(test, ['1 2 3 4', '5']) def test_tab_delimiter(self): Test tab delimiter strg= 1\t 2\t 3\t 4\t 5 6 test = LineSplitter('\t')(strg) assert_equal(test, ['1', '2', '3', '4', '5 6']) strg= 1 2\t 3 4\t 5 6 test = LineSplitter('\t')(strg) assert_equal(test, ['1 2', '3 4', '5 6']) def test_other_delimiter(self): Test LineSplitter on delimiter strg = 1,2,3,4,,5 test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1,2,3,4,,5 # test test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) def test_constant_fixed_width(self): Test LineSplitter w/ fixed-width fields strg = 1 2 3 4 5 # test test = LineSplitter(3)(strg) assert_equal(test, ['1', '2', '3', '4', '', '5', '']) # strg = 1 3 4 5 6# test test = LineSplitter(20)(strg) assert_equal(test, ['1 3 4 5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(30)(strg) assert_equal(test, ['1 3 4 5 6']) def test_variable_fixed_width(self): strg = 1 3 4 5 6# test test = LineSplitter((3,6,6,3))(strg) assert_equal(test, ['1', '3', '4 5', '6']) # strg = 1 3 4 5 6# test test = LineSplitter((6,6,9))(strg) assert_equal(test, ['1', '3 4', '5 6']) #--- class TestNameValidator(TestCase): # def test_case_sensitivity(self): Test case sensitivity names = ['A', 'a', 'b', 'c'] test = NameValidator().validate(names) assert_equal(test, ['A', 'a', 'b', 'c']) test = NameValidator(case_sensitive=False).validate(names) assert_equal(test, ['A', 'A_1', 'B', 'C']) # def test_excludelist(self): Test excludelist names = ['dates', 'data', 'Other Data', 'mask'] validator = NameValidator(excludelist = ['dates', 'data', 'mask']) test = validator.validate(names) assert_equal(test, ['dates_', 'data_', 'Other_Data', 'mask_']) #--- class TestStringConverter(TestCase): Test StringConverter # def test_creation(self): Test creation of a StringConverter converter = StringConverter(int, -9) assert_equal(converter._status, 1) assert_equal(converter.default, -9) # def test_upgrade(self): Tests the upgrade method. converter = StringConverter() assert_equal(converter._status, 0) converter.upgrade('0') assert_equal(converter._status, 1) converter.upgrade('0.') assert_equal(converter._status, 2) converter.upgrade('0j') assert_equal(converter._status, 3) converter.upgrade('a') assert_equal(converter._status, len(converter._mapper)-1) # def test_missing(self): Tests the use of missing values. converter = StringConverter(missing_values=('missing','missed')) converter.upgrade('0') assert_equal(converter('0'), 0) assert_equal(converter(''), converter.default) assert_equal(converter('missing'), converter.default) assert_equal(converter('missed'), converter.default) try: converter('miss') except ValueError: pass # def test_upgrademapper(self): Tests updatemapper import dateutil.parser import datetime dateparser = dateutil.parser.parse StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1)) convert = StringConverter(dateparser, datetime.date(2000, 1, 1)) test = convert('2001-01-01') assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00)) #--- class TestLoadTxt(TestCase): # def test_record(self): Test w/ explicit
Re: [Numpy-discussion] genloadtxt: second serving
On Dec 4, 2008, at 7:22 AM, Manuel Metz wrote: Will loadtxt in that case remain as is? Or will the _faulttolerantconv class be used? No idea, we need to discuss it. There's a problem with _faulttolerantconv: using np.nan as default value will not work in Python2.6 if the output is to be int, as an exception will be raised. Therefore, we'd need to change the default to something else when defining _faulttolerantconv. The easiest would be to define a class and set the argument at instantiation, but then we're going back dangerously close to StringConverter... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Nov 25, 2008, at 12:23 PM, Pierre GM wrote: All, Sorry to bump my own post, and I was kinda threadjacking anyway: Some functions of numy.ma (eg, ma.max, ma.min...) accept explicit outputs that may not be MaskedArrays. When such an explicit output is not a MaskedArray, a value that should have been masked is transformed into np.nan. That worked great in 2.5, with np.nan automatically transformed to 0 when the explicit output had a int dtype. With Python 2.6, a ValueError is raised instead, as np.nan can no longer be casted to int. What should be the recommended behavior in this case ? Raise a ValueError or some other exception, to follow the new Python2.6 convention, or silently replace np.nan by some value acceptable by int dtype (0, or something else) ? Second bump, sorry. Any consensus on what the behavior should be ? Raise a ValueError (even in 2.5, therefore risking to break something) or just go with the flow and switch np.nan to an acceptable value (like 0), under the hood ? I'd like to close the corresponding ticket... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Dec 4, 2008, at 3:24 PM, [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 2:40 PM, Jarrod Millman [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote: Raise a ValueError (even in 2.5, therefore risking to break something) +1 +1 OK then, I'll do that and update the SVN later tonight or early tmw... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 12:48 PM, Christopher Barker wrote: Pierre GM wrote: I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. fair enough, though I'm not sure when I'll have time to do it. Oh, don;t worry, nothing too fancy: give me a couple lines of input data and a line with what you expect. Using Ryan's recent example: f = StringIO('stid stnm relh tair\nnrmn 121 45 9.1') test = loadtxt(f, usecols=('stid', 'relh', 'tair'), names=True, dtype=None) control=array(('nrmn', 45, 9.0996), dtype=[('stid', '|S4'), ('relh', 'i8'), ('tair', 'f8')]) That's quite enough for a test. I do wonder if anyone else thinks it would be useful to have multiple delimiters as an option. I got the idea because with fromfile(), if you specify, say ',' as the delimiter, it won't use '\n', only a comma, so there is no way to quickly read a whole bunch of comma delimited data like: 1,2,3,4 5,6,7,8 so I'd like to be able to say to use either ',' or '\n' as the delimiter. I'm not quite sure I follow you. Do you want to delimiters, one for the field of a record (','), one for the records (\n) ? However, if I understand loadtxt() correctly, it's handling the new lines separately anyway (to get a 2-d array), so this use case isn't an issue. So how likely is it that someone would have: 1 2 3, 4, 5 6 7 8, 8, 9 and want to read that into a single 2-d array? With the current behaviour, you gonna have [(1 2 3, 4, 5), (6 7 8, 8, 9)] if you use , as a delimiter, [(1,2,3,,4,,5),(6,7,8,,8,,9)] if you use as a delimiter. Mixing delimiter is doable, but I don't think it's that a good idea. I'm in favor of sticking to one and only field delimiter, and the default line spearator for record delimiter. In other terms, not changing anythng. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 12:32 PM, Alan G Isaac wrote: If I know my data is already clean and is handled nicely by the old loadtxt, will I be able to turn off and the special handling in order to retain the old load speed? Hopefully. I'm looking for the best way to do it. Do you have an example you could send me off-list so that I can play with timers ? Thx in advance. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 3, 2008, at 1:00 PM, Christopher Barker wrote: by the way, should this work: io.loadtxt('junk.dat', delimiter=' ') for more than one space between numbers, like: 1 2 3 4 5 6 7 8 9 10 On the version I'm working on, both delimiter='' and delimiter=None (default) would give you the expected output. delimiter=' ' would fail, delimiter=' ' would work. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Manuel, Looks nice, I gonna try to see how I can incorporate yours. Note that returning np.nan by default will not work w/ Python 2.6 if you want an int... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] bug in ma.masked_all()?
Eric, That's quite a handful you have with this dtype... So yes, the fix I gave works with nested dtypes and flexible dtypes with a simple name (string, not tuple). I'm a bit surprised with numpy, here. Consider: dt.names ('P', 'D', 'T', 'w', 'S', 'sigtheta', 'theta') So we lose the tuple and get a single string instead, corresponding to the right-hand element of the name.. But this single string is one of the keys of dt.fields, whereas the tuple is not. Puzzling. I'm sure there must be some reference in the numpy book, but I can't look for it now. Anyway: Prior to version 6127, make_mask_descr was substituting the 2nd element of each tuple of a dtype.descr by a bool. Which failed for nested dtypes. Now, we check the field corresponding to a name, which fails in our particular case. I'll be working on it... On Dec 2, 2008, at 1:59 AM, Eric Firing wrote: dt = np.dtype([((' Pressure, Digiquartz [db]', 'P'), 'f4'), ((' Depth [salt water, m]', 'D'), 'f4'), ((' Temperature [ITS-90, deg C]', 'T'), 'f4'), ((' Descent Rate [m/s]', 'w'), 'f4'), ((' Salinity [PSU]', 'S'), 'f4'), ((' Density [sigma-theta, Kg/m^3]', 'sigtheta'), 'f4'), ((' Potential Temperature [ITS-90, deg C]', 'theta'), 'f4')]) np.ma.zeros((2,2), dt) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] bug in ma.masked_all()?
On Dec 2, 2008, at 4:26 AM, Eric Firing wrote: From page 132 in the numpy book: The fields dictionary is indexed by keys that are the names of the fields. Each entry in the dictionary is a tuple fully describing the field: (dtype, offset[,title]). If present, the optional title can actually be any object (if it is string or unicode then it will also be a key in the fields dictionary, otherwise it’s meta-data). I should read it more often... I put the titles in as a sort of additional documentation, and thinking that they might be useful for labeling plots; That's actually quite a good idea... but it is rather hard to get the titles back out, since they are not directly accessible as an attribute, like names. Probably I should just omit them. We could perhaps try a function: def gettitle(dtype, name): try: field = dtype.fields[name] except (TypeError, KeyError): return None else: if len(field) 2: return field[-1] return None ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
On Dec 2, 2008, at 3:12 PM, Ryan May wrote: Pierre GM wrote: Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. A couple of quick nitpicks: 1) On line 186 (in the NameValidator class), you use excludelist.append() to append a list to the end of a list. I think you meant to use excludelist.extend() Good call. 2) When validating a list of names, why do you insist on lower casing them? (I'm referring to the call to lower() on line 207). On one hand, this would seem nicer than all upper case, but on the other hand this can cause confusion for someone who sees certain casing of names in the file and expects that data to be laid out the same. I recall a life where names were case-insensitives, so 'dates' and 'Dates' and 'DATES' were the same field. It should be easy enough to get rid of that limitations, or add a parameter for case-sensitivity On Dec 2, 2008, at 2:47 PM, Zachary Pincus wrote: Specifically, on line 115 in LineSplitter, we have: self.delimiter = delimiter.strip() or None so if I pass in, say, '\t' as the delimiter, self.delimiter gets set to None, which then causes the default behavior of any-whitespace-is- delimiter to be used. This makes lines like Gene Name\tPubMed ID \tStarting Position get split wrong, even when I explicitly pass in '\t' as the delimiter! OK, I'll check that. I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... Other than those, it's working fine for me here. Ryan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Chris, I can try, but in that case, please write me a unittest, so that I have a clear and unambiguous idea of what you expect. ANFSCD, have you tried the missing_values option ? On Dec 2, 2008, at 5:36 PM, Christopher Barker wrote: Pierre GM wrote: I think that treating an explicitly-passed-in ' ' delimiter as identical to 'no delimiter' is a bad idea. If I say that ' ' is the delimiter, or '\t' is the delimiter, this should be treated *just* like ',' being the delimiter, where the expected output is: ['1', '2', '3', '4', '', '5'] Valid point. Well, all, stay tuned for yet another yet another implementation... While we're at it, it might be nice to be able to pass in more than one delimiter: ('\t',' '). though maybe that only combination that I'd really want would be something and '\n', which I think is being treated specially already. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] np.loadtxt : yet a new implementation...
All, Please find attached to this message another implementation of np.loadtxt, which focuses on missing values. It's basically a combination of John Hunter's et al mlab.csv2rec, Ryan May's patches and pieces of code I'd been working on over the last few weeks. Besides some helper classes (StringConverter to convert a string into something else, NameValidator to check names..._), you'll find 3 functions: * `genloadtxt` is the base function that makes all the work. It outputs 2 arrays, one for the data (missing values being substituted by the appropriate default) and one for the mask. It would go in np.lib.io * `loadtxt` would replace the current np.loadtxt. It outputs a ndarray, where missing data being filled. It would also go in np.lib.io * `mloadtxt` would go into np.ma.io (to be created) and renamed `loadtxt`. Right now, I needed a different name to avoid conflicts. It combines the outputs of `genloadtxt` into a single masked array. You'll also several series of tests, that you can use as examples. Please give it a try and send me some feedback (bugs, wishes, suggestions). I'd like it to make the 1.3.0 release (I need some of the functionalities to improve the corresponding function in scikits.timeseries, currently fubar...) P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
And now for the tests: Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. from genload_proposal import * from numpy.ma.testutils import * import StringIO class TestLineSplitter(TestCase): # def test_nodelimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) test = LineSplitter()(strg) assert_equal(test, ['1', '2', '3', '4', '5']) # def test_delimiter(self): Test LineSplitter on delimiter strg = 1,2,3,4,,5 test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1,2,3,4,,5 # test test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) # def test_fixedwidth(self): Test LineSplitter w/ fixed-width fields strg = 1 2 3 4 5 # test test = LineSplitter(3)(strg) assert_equal(test, ['1', '2', '3', '4', '', '5', '']) # strg = 1 3 4 5 6# test test = LineSplitter((3,6,6,3))(strg) assert_equal(test, ['1', '3', '4 5', '6']) # strg = 1 3 4 5 6# test test = LineSplitter((6,6,9))(strg) assert_equal(test, ['1', '3 4', '5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(20)(strg) assert_equal(test, ['1 3 4 5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(30)(strg) assert_equal(test, ['1 3 4 5 6']) class TestStringConverter(TestCase): Test StringConverter # def test_creation(self): Test creation of a StringConverter converter = StringConverter(int, -9) assert_equal(converter._status, 1) assert_equal(converter.default, -9) # def test_upgrade(self): Tests the upgrade method. converter = StringConverter() assert_equal(converter._status, 0) converter.upgrade('0') assert_equal(converter._status, 1) converter.upgrade('0.') assert_equal(converter._status, 2) converter.upgrade('0j') assert_equal(converter._status, 3) converter.upgrade('a') assert_equal(converter._status, len(converter._mapper)-1) # def test_missing(self): Tests the use of missing values. converter = StringConverter(missing_values=('missing','missed')) converter.upgrade('0') assert_equal(converter('0'), 0) assert_equal(converter(''), converter.default) assert_equal(converter('missing'), converter.default) assert_equal(converter('missed'), converter.default) try: converter('miss') except ValueError: pass # def test_upgrademapper(self): Tests updatemapper import dateutil.parser import datetime dateparser = dateutil.parser.parse StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1)) convert = StringConverter(dateparser, datetime.date(2000, 1, 1)) test = convert('2001-01-01') assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00)) class TestLoadTxt(TestCase): # def test_record(self): Test w/ explicit dtype data = StringIO.StringIO('1 2\n3 4') #data.seek(0) test = loadtxt(data, dtype=[('x', np.int32), ('y', np.int32)]) control = np.array([(1, 2), (3, 4)], dtype=[('x', 'i4'), ('y', 'i4')]) assert_equal(test, control) # data = StringIO.StringIO('M 64.0 75.0\nF 25.0 60.0') #data.seek(0) descriptor = {'names': ('gender','age','weight'), 'formats': ('S1', 'i4', 'f4')} control = np.array([('M', 64.0, 75.0), ('F', 25.0, 60.0)], dtype=descriptor) test = loadtxt(data, dtype=descriptor) assert_equal(test, control) def test_array(self): Test outputing a standard ndarray data = StringIO.StringIO('1 2\n3 4') control = np.array([[1,2],[3,4]], dtype=int) test = loadtxt(data, dtype=int) assert_array_equal(test, control) # data.seek(0) control = np.array([[1,2],[3,4]], dtype=float) test = np.loadtxt(data, dtype=float) assert_array_equal(test, control) def test_1D(self): Test squeezing to 1D control = np.array([1, 2, 3, 4], int) # data = StringIO.StringIO('1\n2\n3\n4\n') test = loadtxt(data, dtype=int) assert_array_equal(test, control) # data = StringIO.StringIO('1,2,3,4\n') test = loadtxt(data, dtype=int, delimiter=',')
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
Well, looks like the attachment is too big, so here's the implementation. The tests will come in another message. Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var def __init__(self, delimiter=None, comments='#'): self.comments = comments # Delimiter is a character if delimiter is None: self._isfixed = False self.delimiter = None elif _is_string_like(delimiter): self._isfixed = False self.delimiter = delimiter.strip() or None # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): self._isfixed = True idx = np.cumsum([0]+list(delimiter)) self.slices = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): self._isfixed = True self.slices = None self.delimiter = delimiter else: self._isfixed = False self.delimiter = None # def __call__(self, line): # Strip the comments line = line.split(self.comments)[0] if not line: return [] # Fixed-width fields if self._isfixed: # Fields have different widths if self.slices is None: fixed = self.delimiter slices = [slice(i, i+fixed) for i in range(len(line))[::fixed]] else: slices = self.slices return [line[s].strip() for s in slices] else: return [s.strip() for s in line.split(self.delimiter)] Splits the line at each current delimiter. Comments are stripped beforehand. class NameValidator: Validates a list of strings to use as field names. The strings are stripped of any non alphanumeric character, and spaces are replaced by `_`. During instantiation, the user can define a list of names to exclude, as well as a list of invalid characters. Names in the exclude list are appended a '_' character. Once an instance has been created, it can be called with a list of names and a list of valid names will be created. The `__call__` method accepts an optional keyword, `default`, that sets the default name in case of ambiguity. By default, `default = 'f'`, so that names will default to `f0`, `f1` Parameters -- excludelist : sequence, optional A list of names to
[Numpy-discussion] Fwd: np.loadtxt : yet a new implementation...
(Sorry about that, I pressed Reply instead of Reply all. Not my day for emails...) On Dec 1, 2008, at 1:54 PM, John Hunter wrote: It looks like I am doing something wrong -- trying to parse a CSV file with dates formatted like '2008-10-14', with:: import datetime, sys import dateutil.parser StringConverter.upgrade_mapper(dateutil.parser.parse, default=datetime.date(1900,1,1)) r = loadtxt(sys.argv[1], delimiter=',', names=True) John, The problem you have is that the default dtype is 'float' (for backwards compatibility w/ the original np.loadtxt). What you want is to automatically change the dtype according to the content of your file: you should use dtype=None r = loadtxt(sys.argv[1], delimiter=',', names=True, dtype=None) As you'll want a recarray, we could make a np.records.loadtxt function where dtype=None would be the default... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Fwd: np.loadtxt : yet a new implementation...
On Dec 1, 2008, at 2:26 PM, John Hunter wrote OK, that worked great. I do think some a default impl in np.rec which returned a recarray would be nice. It might also be nice to have a method like np.rec.fromcsv which defaults to a delimiter=',', names=True and dtype=None. Since csv is one of the most common data interchange format in the world, it would be nice to have some obvious function that works with it with little or no customization required. Quite agreed. Personally, I'd ditch the default dtype=float in favor of dtype=None, but compatibility is an issue. However, if we all agree on genloadtxt, we can use tailored-made version in different modules, like you suggest. There's an extra issue for which we have an solution I'm not completely satisfied with: names=True. It might be simpler for basic user not to set names=True, and have the first header recognized as names or not if needed (by processing the first line after the others, and using it as header if it's found to be a list of names, or inserting it back at the beginning otherwise)... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.loadtxt : yet a new implementation...
I agree, genloadtxt is a bit blotted, and it's not a surprise it's slower than the initial one. I think that in order to be fair, comparisons must be performed with matplotlib.mlab.csv2rec, that implements as well the autodetection of the dtype. I'm quite in favor of keeping a lite version around. On Dec 1, 2008, at 4:47 PM, Stéfan van der Walt wrote: I haven't investigated the code in too much detail, but wouldn't it be possible to implement the current set of functionality in a base-class, which is then specialised to add the rest? That way, one could always instantiate TextReader yourself for some added speed. Well, one of the issues is that we need to keep the function compatible w/ urllib.urlretrieve (Ryan, am I right?), which means not being able to go back to the beginning of a file (no call to .seek). Another issue comes from the possibility to define the dtype automatically: you need to keep track of the converters, then have to do a second loop on the data. Those converters are likely the bottleneck, as you need to check whether each value can be interpreted as missing or not and respond appropriately. I thought about creating a base class, with a specific subclass taking care of the missing values. I found out it would have duplicated a lot of code In any case, I think that's secondary: we can always optimize pieces of the code afterwards. I'd like more feedback on corner cases and usage... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] bug in ma.masked_all()?
On Dec 1, 2008, at 6:09 PM, Eric Firing wrote: Pierre, ma.masked_all does not seem to work with fancy dtypes and more then one dimension: Eric, Should be fixed in SVN (r6130). There were indeed problems with nested dtypes. Tricky beasts they are. Thanks for reporting! ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Manuel, Give me the week-end to come up with something. What you want is already doable with the current implementation of np.loadtxt, through the converter keyword. Support for missing data will be covered in a separate function, most likely to be put in numpy.ma.io at term. On Nov 28, 2008, at 5:42 AM, Manuel Metz wrote: Pierre GM wrote: On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote: Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed- length fields -- and is probably very special in that sense. But such data-files exists out there... Well, if you have a non-space delimiter, it doesn't matter if the fields have a fixed length or not, does it? Each field is stripped anyway. Yes. It would already be _very_ helpful (without changing loadtxt too much) if the current implementation uses a converter like this def fval(val): try: return float(val) except: return numpy.nan instead of float(val) by default. mm The real issue is when the delimiter is ' '... I should be able to take care of that over the week-end (which started earlier today over here :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 27, 2008, at 3:08 AM, Manuel Metz wrote: Certainly, yes! Dealing with fixed-length fields would be necessary. The case I had in mind had both -- a separator (|) __and__ fixed-length fields -- and is probably very special in that sense. But such data-files exists out there... Well, if you have a non-space delimiter, it doesn't matter if the fields have a fixed length or not, does it? Each field is stripped anyway. The real issue is when the delimiter is ' '... I should be able to take care of that over the week-end (which started earlier today over here :) ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 26, 2008, at 5:55 PM, Ryan May wrote: Manuel Metz wrote: Ryan May wrote: 3) Better support for missing values. The docstring mentions a way of handling missing values by passing in a converter. The problem with this is that you have to pass in a converter for *every column* that will contain missing values. If you have a text file with 50 columns, writing this dictionary of converters seems like ugly and needless boilerplate. I'm unsure of how best to pass in both what values indicate missing values and what values to fill in their place. I'd love suggestions Hi Ryan, this would be a great feature to have !!! About missing values: * I don't think missing values should be supported in np.loadtxt. That should go into a specific np.ma.io.loadtxt function, a preview of which I posted earlier. I'll modify it taking Ryan's new function into account, and Chrisopher's suggestion (defining a dictionary {column name : missing values}. * StringConverter already defines some default filling values for each dtype. In np.ma.io.loadtxt, these values can be overwritten. Note that you should also be able to define a filling value by specifying a converter (think float(x or 0) for example) * Missing values on space-separated fields are very tricky to handle: take a line like a,,,d. With a comma as separator, it's clear that the 2nd and 3rd fields are missing. Now, imagine that commas are actually spaces ( a d): 'd' is now seen as the 2nd field of a 2-field record, not as the 4th field of a 4- field record with 2 missing values. I thought about it, and kicked in touch * That said, there should be a way to deal with fixed-length fields, probably by taking consecutive slices of the initial string. That way, we should be able to keep track of missing data... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] What happened to numpy-docs ?
All, I'd like to update routines.ma.rst on the numpy/numpy-docs/trunk SVN, but the whole trunk seems to be MIA... Where has it gone ? How can I (where should I) commit changes ? Thx in advance. P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What happened to numpy-docs ?
On Nov 27, 2008, at 12:32 AM, Robert Kern wrote: On Wed, Nov 26, 2008 at 23:27, Pierre GM [EMAIL PROTECTED] wrote: All, I'd like to update routines.ma.rst on the numpy/numpy-docs/trunk SVN, but the whole trunk seems to be MIA... Where has it gone ? How can I (where should I) commit changes ? It got moved into the numpy trunk under docs/. Duh... Guess I fell right at the time of the change. Robert, thx a lot! Pauli, do you think you could put your numpyext in the doc/ directory as well ? Cheers, P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] What happened to numpy-docs ?
On Nov 27, 2008, at 1:39 AM, Scott Sinclair wrote: Looking at some recent changes made to docstrings in SVN by Pierre (r6110 r6111), these are not yet reflected in the doc wiki. Well, I haven't committed my version yet. I'm polishing a couple of issues with functions that are not recognized as such by inspect (because they're actually instances of a factory class). ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Numpy on Mac OS X python 2.6
FYI, I can't reproduce David's failures on my machine (intel core2 duo w/ 10.5.5) * python 2.6 from macports * numpy svn 6098 * GCC 4.0.1 (Apple Inc. build 5488) I have only 1 failure: FAIL: test_umath.TestComplexFunctions.test_against_cmath -- Traceback (most recent call last): File /opt/local/lib/python2.6/site-packages/nose-0.10.4-py2.6.egg/ nose/case.py, line 182, in runTest self.test(*self.arg) File /Users/pierregm/Computing/.pythonenvs/default26/lib/python2.6/ site-packages/numpy/core/tests/test_umath.py, line 423, in test_against_cmath assert abs(a - b) atol, %s %s: %s; cmath: %s%(fname,p,a,b) AssertionError: arcsin 2: (1.57079632679-1.31695789692j); cmath: (1.57079632679+1.31695789692j) -- (Well, there's another one in numpy.ma.min, but that's a different matter). On Nov 25, 2008, at 2:19 AM, David Cournapeau wrote: On Mon, 2008-11-24 at 22:06 -0700, Charles R Harris wrote: Well, it may not be that easy to figure. The (generated) pyconfig-32.h has /* Define to 1 if your processor stores words with the most significant byte first (like Motorola and SPARC, unlike Intel and VAX). The block below does compile-time checking for endianness on platforms that use GCC and therefore allows compiling fat binaries on OSX by using '-arch ppc -arch i386' as the compile flags. The phrasing was choosen such that the configure-result is used on systems that don't use GCC. */ #ifdef __BIG_ENDIAN__ #define WORDS_BIGENDIAN 1 #else #ifndef __LITTLE_ENDIAN__ /* #undef WORDS_BIGENDIAN */ #endif #endif Hm, interesting: just by grepping, I do have WORDS_BIGENDIAN defined to 1 on *both* python 2.5 and python 2.6 on Mac OS X (running Intel). Looking closer, I do have the above code (conditional) in 2.5, but not in 2.6: it is inconditionally defined to BIGENDIAN on 2.6 !! That's actually part of something I have wondered for quite some time about fat binaries: how do you handle config headers, since they are generated only once for every fat binary, but they should really be generated for each arch. And I guess that __BIG_ENDIAN__ is a compiler flag, it isn't in any of the include files. In any case, this looks like a Python bug or the Python folks have switched their API on us. Hm, actually, it is a bug in numpy as much as in python: python should NOT include any config.h in their public namespace, and we should not rely on it. But with this info, it should be relatively easy to fix (by setting the correct endianness by ourselves with some detection code) David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
Ryan, FYI, I've been coding over the last couple of weeks an extension of loadtxt for a better support of masked data, with the option to read column names in a header. Please find an example below (I also have unittest). Most of the work is actually inspired from matplotlib's mlab.csv2rec. It might be worth not duplicating efforts. Cheers, P. :mod:`_preview` A collection of utilities from incoming versions of numpy.ma import itertools import numpy as np import numpy.ma as ma _string_like = np.lib.io._string_like def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(dtp): Unpack a structured data-type. if dtp.names is None: return [dtp] else: types = [] for field in dtp.names: (typ, _) = dtp.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types class LineReader: File reader that automatically split each line. This reader behaves like an iterator. Parameters -- fhd : filehandle File handle of the underlying file. comment : string, optional The character used to indicate the start of a comment. delimiter : string, optional The string used to separate values. By default, this is any whitespace. # def __init__(self, fhd, comment='#', delimiter=None): self.fh = fhd self.comment = comment self.delimiter = delimiter if delimiter == ' ': self.delimiter = None # def close(self): Close the current reader. self.fh.close() # def seek(self, arg): Moves to a new position in the file. See Also file.seek self.fh.seek(arg) # def splitter(self, line): Splits the line at each current delimiter. Comments are stripped beforehand. line = line.split(self.comment)[0].strip() delimiter = self.delimiter if line: return line.split(delimiter) else: return [] # def next(self): Moves to the next line or raises :exc:StopIteration. return self.splitter(self.fh.next()) # def __iter__(self): for line in self.fh: yield self.splitter(line) def readline(self): Returns the next line of the file, splitted at the delimiter and stripped of comments. return self.splitter(self.fh.readline()) def skiprows(self, nbrows=1): Skips `nbrows` from the file. for i in range(nbrows): self.fh.readline() def get_first_valid_row(self): Returns the values in the first valid (uncommented and not empty) line of the file. first_values = None while not first_values: first_line = self.fh.readline() if first_line == '': # EOF reached raise IOError('End-of-file reached before encountering data.') first_values = self.splitter(first_line) return first_values itemdictionary = {'return': 'return_', 'file':'file_', 'print':'print_' } def process_header(headers): Validates a list of strings to use as field names. The strings are stripped of any non alphanumeric character, and spaces are replaced by `_` # # Define the characters to delete from the headers delete = set([EMAIL PROTECTED]*()-=+~\|]}[{';: /?.,) delete.add('') names = [] seen = dict() for i, item in enumerate(headers): item = item.strip().lower().replace(' ', '_') item = ''.join([c for c in item if c not in delete]) if not len(item): item = 'column%d' % i item = itemdictionary.get(item, item) cnt = seen.get(item, 0) if cnt 0: names.append(item + '_%d'%cnt) else: names.append(item) seen[item] = cnt+1 return
[Numpy-discussion] in(np.nan) on python 2.6
All, Sorry to bump my own post, and I was kinda threadjacking anyway: Some functions of numy.ma (eg, ma.max, ma.min...) accept explicit outputs that may not be MaskedArrays. When such an explicit output is not a MaskedArray, a value that should have been masked is transformed into np.nan. That worked great in 2.5, with np.nan automatically transformed to 0 when the explicit output had a int dtype. With Python 2.6, a ValueError is raised instead, as np.nan can no longer be casted to int. What should be the recommended behavior in this case ? Raise a ValueError or some other exception, to follow the new Python2.6 convention, or silently replace np.nan by some value acceptable by int dtype (0, or something else) ? Thanks for any suggestion, P. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] More loadtxt() changes
On Nov 25, 2008, at 12:30 PM, Christopher Barker wrote: missing : string, optional A string representing a missing value, irrespective of the column where it appears (e.g., ``'missing'`` or ``'unused'``. It might be nice if missing could be a sequence of strings, if there is more than one value for missing values, that are not clearly mapped to a particular field. OK, easy enough. missing_values : {None, dictionary}, optional A dictionary mapping a column number to a string indicating whether the corresponding field should be masked. would it possible to specify column header, rather than number here? A la mlab.csv2rec ? It could work with a bit more tweaking, basically following John Hunter's et al. path. What happens when the column names are unknown (read from the header) or wrong ? Actually, I'd like John to comment on that, hence the CC. More generally, wouldn't be useful to push the recarray manipulating functions from matplotlib.mlab to numpy ? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion