[Numpy-discussion] Py3k and numpy
I noticed that the Python 3000 final was released today... is there any sense of how long it will take to get numpy working under 3k? I would imagine it'll be a lot to adapt given the low-level change, but is the work already in progress? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] genloadtxt: second serving
All, Here's the second round of genloadtxt. That's a tad cleaner version than the previous one, where I tried to take into account the different comments and suggestions that were posted. So, tabs should be supported and explicit whitespaces are not collapsed. FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit comparison: same input, no missing data, one with genloadtxt, one with np.loadtxt and a last one with matplotlib.mlab.csv2rec. As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but twice faster than csv2rec. One of the explanation for the slowness is indeed the use of classes for splitting lines and converting values. Instead of a basic function, we use the __call__ method of the class, which itself calls another function depending on the attribute values. I'd like to reduce this overhead, any suggestion is more than welcome, as usual. Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in numpy.ma, with an alias recfromcsv for John, using his defaults. Unless somebody comes with a brilliant optimization. Let me know how it goes, Cheers, P. Proposal : Here's an extension to np.loadtxt, designed to take missing values into account. import itertools import numpy as np import numpy.ma as ma def _is_string_like(obj): Check whether obj behaves like a string. try: obj + '' except (TypeError, ValueError): return False return True def _to_filehandle(fname, flag='r', return_opened=False): Returns the filehandle corresponding to a string or a file. If the string ends in '.gz', the file is automatically unzipped. Parameters -- fname : string, filehandle Name of the file whose filehandle must be returned. flag : string, optional Flag indicating the status of the file ('r' for read, 'w' for write). return_opened : boolean, optional Whether to return the opening status of the file. if _is_string_like(fname): if fname.endswith('.gz'): import gzip fhd = gzip.open(fname, flag) elif fname.endswith('.bz2'): import bz2 fhd = bz2.BZ2File(fname) else: fhd = file(fname, flag) opened = True elif hasattr(fname, 'seek'): fhd = fname opened = False else: raise ValueError('fname must be a string or file handle') if return_opened: return fhd, opened return fhd def flatten_dtype(ndtype): Unpack a structured data-type. names = ndtype.names if names is None: return [ndtype] else: types = [] for field in names: (typ, _) = ndtype.fields[field] flat_dt = flatten_dtype(typ) types.extend(flat_dt) return types def nested_masktype(datatype): Construct the dtype of a mask for nested elements. names = datatype.names if names: descr = [] for name in names: (ndtype, _) = datatype.fields[name] descr.append((name, nested_masktype(ndtype))) return descr # Is this some kind of composite a la (np.float,2) elif datatype.subdtype: mdescr = list(datatype.subdtype) mdescr[0] = np.dtype(bool) return tuple(mdescr) else: return np.bool class LineSplitter: Defines a function to split a string at a given delimiter or at given places. Parameters -- comment : {'#', string} Character used to mark the beginning of a comment. delimiter : var, optional If a string, character used to delimit consecutive fields. If an integer or a sequence of integers, width(s) of each field. autostrip : boolean, optional Whether to strip each individual fields def autostrip(self, method): Wrapper to strip each member of the output of `method`. return lambda input: [_.strip() for _ in method(input)] # def __init__(self, delimiter=None, comments='#', autostrip=True): self.comments = comments # Delimiter is a character if (delimiter is None) or _is_string_like(delimiter): delimiter = delimiter or None _called = self._delimited_splitter # Delimiter is a list of field widths elif hasattr(delimiter, '__iter__'): _called = self._variablewidth_splitter idx = np.cumsum([0]+list(delimiter)) delimiter = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])] # Delimiter is a single integer elif int(delimiter): (_called, delimiter) = (self._fixedwidth_splitter, int(delimiter)) else: (_called, delimiter) = (self._delimited_splitter, None) self.delimiter = delimiter if autostrip: self._called = self.autostrip(_called) else:
[Numpy-discussion] genloadtxt: second serving (tests)
And now for the tests: # pylint disable-msg=E1101, W0212, W0621 import numpy as np import numpy.ma as ma from numpy.ma.testutils import * from StringIO import StringIO from _preview import * class TestLineSplitter(TestCase): Tests the LineSplitter class. # def test_no_delimiter(self): Test LineSplitter w/o delimiter strg = 1 2 3 4 5 # test test = LineSplitter()(strg) assert_equal(test, ['1', '2', '3', '4', '5']) test = LineSplitter('')(strg) assert_equal(test, ['1', '2', '3', '4', '5']) def test_space_delimiter(self): Test space delimiter strg = 1 2 3 4 5 # test test = LineSplitter(' ')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) test = LineSplitter(' ')(strg) assert_equal(test, ['1 2 3 4', '5']) def test_tab_delimiter(self): Test tab delimiter strg= 1\t 2\t 3\t 4\t 5 6 test = LineSplitter('\t')(strg) assert_equal(test, ['1', '2', '3', '4', '5 6']) strg= 1 2\t 3 4\t 5 6 test = LineSplitter('\t')(strg) assert_equal(test, ['1 2', '3 4', '5 6']) def test_other_delimiter(self): Test LineSplitter on delimiter strg = 1,2,3,4,,5 test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) # strg = 1,2,3,4,,5 # test test = LineSplitter(',')(strg) assert_equal(test, ['1', '2', '3', '4', '', '5']) def test_constant_fixed_width(self): Test LineSplitter w/ fixed-width fields strg = 1 2 3 4 5 # test test = LineSplitter(3)(strg) assert_equal(test, ['1', '2', '3', '4', '', '5', '']) # strg = 1 3 4 5 6# test test = LineSplitter(20)(strg) assert_equal(test, ['1 3 4 5 6']) # strg = 1 3 4 5 6# test test = LineSplitter(30)(strg) assert_equal(test, ['1 3 4 5 6']) def test_variable_fixed_width(self): strg = 1 3 4 5 6# test test = LineSplitter((3,6,6,3))(strg) assert_equal(test, ['1', '3', '4 5', '6']) # strg = 1 3 4 5 6# test test = LineSplitter((6,6,9))(strg) assert_equal(test, ['1', '3 4', '5 6']) #--- class TestNameValidator(TestCase): # def test_case_sensitivity(self): Test case sensitivity names = ['A', 'a', 'b', 'c'] test = NameValidator().validate(names) assert_equal(test, ['A', 'a', 'b', 'c']) test = NameValidator(case_sensitive=False).validate(names) assert_equal(test, ['A', 'A_1', 'B', 'C']) # def test_excludelist(self): Test excludelist names = ['dates', 'data', 'Other Data', 'mask'] validator = NameValidator(excludelist = ['dates', 'data', 'mask']) test = validator.validate(names) assert_equal(test, ['dates_', 'data_', 'Other_Data', 'mask_']) #--- class TestStringConverter(TestCase): Test StringConverter # def test_creation(self): Test creation of a StringConverter converter = StringConverter(int, -9) assert_equal(converter._status, 1) assert_equal(converter.default, -9) # def test_upgrade(self): Tests the upgrade method. converter = StringConverter() assert_equal(converter._status, 0) converter.upgrade('0') assert_equal(converter._status, 1) converter.upgrade('0.') assert_equal(converter._status, 2) converter.upgrade('0j') assert_equal(converter._status, 3) converter.upgrade('a') assert_equal(converter._status, len(converter._mapper)-1) # def test_missing(self): Tests the use of missing values. converter = StringConverter(missing_values=('missing','missed')) converter.upgrade('0') assert_equal(converter('0'), 0) assert_equal(converter(''), converter.default) assert_equal(converter('missing'), converter.default) assert_equal(converter('missed'), converter.default) try: converter('miss') except ValueError: pass # def test_upgrademapper(self): Tests updatemapper import dateutil.parser import datetime dateparser = dateutil.parser.parse StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1)) convert = StringConverter(dateparser, datetime.date(2000, 1, 1)) test = convert('2001-01-01') assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00)) #--- class TestLoadTxt(TestCase): # def test_record(self): Test w/ explicit
Re: [Numpy-discussion] genloadtxt: second serving
Pierre GM wrote: All, Here's the second round of genloadtxt. That's a tad cleaner version than the previous one, where I tried to take into account the different comments and suggestions that were posted. So, tabs should be supported and explicit whitespaces are not collapsed. FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit comparison: same input, no missing data, one with genloadtxt, one with np.loadtxt and a last one with matplotlib.mlab.csv2rec. As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but twice faster than csv2rec. One of the explanation for the slowness is indeed the use of classes for splitting lines and converting values. Instead of a basic function, we use the __call__ method of the class, which itself calls another function depending on the attribute values. I'd like to reduce this overhead, any suggestion is more than welcome, as usual. Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in numpy.ma, with an alias recfromcsv for John, using his defaults. Unless somebody comes with a brilliant optimization. Will loadtxt in that case remain as is? Or will the _faulttolerantconv class be used? mm ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Broadcasting question
Hi list, Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3. My first take was to use a python level loop: from numpy import * c = array([sum((a_i - b) ** 2, axis=1) for a_i in a]) But this is too slow and allocate a useless temporary list of python references. To avoid the python level loop I then tried to use broadcasting as follows: c = sum((a[:,newaxis,:] - b) ** 2, axis=2) But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3... Do you have any better idea? I would like to simulate a runtime behavior similar to: c = dot(a, b.T) but for for squared euclidian norms instead of dotproducts. I can always write a the code in C and wrap it with ctypes but I wondered whether this is possible only with numpy. -- Olivier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Broadcasting question
Hi Olivier 2008/12/4 Olivier Grisel [EMAIL PROTECTED]: To avoid the python level loop I then tried to use broadcasting as follows: c = sum((a[:,newaxis,:] - b) ** 2, axis=2) But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3... Does numpy.lib.broadcast_arrays do what you need? Regards Stéfan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Compiler options for mingw?
I needed it to help me fixing a couple of bugs for old CPU, so it ended up being implemented in the nsis script for scipy now (I will add it to numpy installers too). So from now, any newly releases of both numpy and scipy installers could be overriden: installer-name.exe /arch native - default behavior installer-name.exe /arch nosse - Force installation wo sse, even if SSE-cpu is detected. It does not check that the option is valid, so you can end up requesting SSE3 installer on a SSE2 CPU. But well... Cool! Thanks! This will be really useful... Zach ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED]wrote: I noticed that the Python 3000 final was released today... is there any sense of how long it will take to get numpy working under 3k? I would imagine it'll be a lot to adapt given the low-level change, but is the work already in progress? I read that announcement too. My feeling is that we can only support one branch at a time, i.e., the python 2.x or python 3.x series. So the easiest path to 3.x looked to be waiting until python 2.6 was widely distributed, making it the required version, doing the needed updates to numpy, and then using the automatic conversion to python 3.x. I expect f2py, nose, and other tools will also need fixups. Guido suggests an approach like this for those needing to support both series and I really don't see an alternative unless someone wants to fork numpy ;) Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Broadcasting question
2008/12/4 Stéfan van der Walt [EMAIL PROTECTED]: Hi Olivier 2008/12/4 Olivier Grisel [EMAIL PROTECTED]: To avoid the python level loop I then tried to use broadcasting as follows: c = sum((a[:,newaxis,:] - b) ** 2, axis=2) But this build a useless and huge (d1, d2, d3) temporary array that does not fit in memory for large values of d1, d2 and d3... Does numpy.lib.broadcast_arrays do what you need? That looks exactly what I am looking for. Apparently this is new in 1.2 since I cannot find it in the 1.1 version of my system. Thanks, -- Olivier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Broadcasting question
On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel [EMAIL PROTECTED]wrote: Hi list, Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3. Just to clarify the problem a bit, it looks like you want to compute the squared euclidean distance between every vector in a and every vector in b, i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3? If you *are* looking to compute the distance matrix I suspect your end goal is something beyond that. Could you describe what you are trying to do? I could be that scipy.spatial or scipy.cluster are what you should look at. snip Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt: second serving
On Dec 4, 2008, at 7:22 AM, Manuel Metz wrote: Will loadtxt in that case remain as is? Or will the _faulttolerantconv class be used? No idea, we need to discuss it. There's a problem with _faulttolerantconv: using np.nan as default value will not work in Python2.6 if the output is to be int, as an exception will be raised. Therefore, we'd need to change the default to something else when defining _faulttolerantconv. The easiest would be to define a class and set the argument at instantiation, but then we're going back dangerously close to StringConverter... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Broadcasting question
2008/12/4 Charles R Harris [EMAIL PROTECTED]: On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel [EMAIL PROTECTED] wrote: Hi list, Suppose I have array a with dimensions (d1, d3) and array b with dimensions (d2, d3). I want to compute array c with dimensions (d1, d2) holding the squared euclidian norms of vectors in a and b with size d3. Just to clarify the problem a bit, it looks like you want to compute the squared euclidean distance between every vector in a and every vector in b, i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3? I would target d1 d2 ~ d3 with d1 as large as possible to fit in memory and d2 and d3 in the order of a couple hundreds or thousands for a start. If you *are* looking to compute the distance matrix I suspect your end goal is something beyond that. Could you describe what you are trying to do? My end goal it to compute the activation of an array of Radial Basis Function units where the activation of unit with center b_j for data vector a_i is given by: f(a_i, b_j) = exp(-||a_i - bj|| ** 2 / (2 * sigma)) The end goal is to have building blocks of various parameterized array of homogeneous units (linear, sigmoid and RBF) along with their gradient in parameter space so as too build various machine learning algorithms such as multi layer perceptrons with various training strategies such as Stochastic Gradient Descent. That code might be integrated into the Modular Data Processing (MPD toolkit) project [1] at some point. The current stat of the python code is here: http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/scalar.py You can find an SSE optimized C implementation wrapped with ctypes here: http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/sse.py http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/sse.c It could be that scipy.spatial or scipy.cluster are what you should look at. I'll have a look at those, thanks for the pointer. [1] http://mdp-toolkit.sourceforge.net/ -- Olivier ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
On Thu, Dec 4, 2008 at 9:39 AM, Charles R Harris [EMAIL PROTECTED]wrote: On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED]wrote: I noticed that the Python 3000 final was released today... is there any sense of how long it will take to get numpy working under 3k? I would imagine it'll be a lot to adapt given the low-level change, but is the work already in progress? I read that announcement too. My feeling is that we can only support one branch at a time, i.e., the python 2.x or python 3.x series. So the easiest path to 3.x looked to be waiting until python 2.6 was widely distributed, making it the required version, doing the needed updates to numpy, and then using the automatic conversion to python 3.x. I expect f2py, nose, and other tools will also need fixups. Guido suggests an approach like this for those needing to support both series and I really don't see an alternative unless someone wants to fork numpy ;) Looks like python 2.6 just went into Fedora rawhide, so it should be in the May Fedora 11 release. I expect Ubuntu and other leading edge Linux distros to have it about the same time. This probably means numpy needs to be running on python 2.6 by early Spring. Dropping support for earlier versions of python might be something to look at for next Fall. So I'm guessing about a year will be the earliest we might have Python 3.0 support. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
On Thu, Dec 4, 2008 at 12:57, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 9:39 AM, Charles R Harris [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED] wrote: I noticed that the Python 3000 final was released today... is there any sense of how long it will take to get numpy working under 3k? I would imagine it'll be a lot to adapt given the low-level change, but is the work already in progress? I read that announcement too. My feeling is that we can only support one branch at a time, i.e., the python 2.x or python 3.x series. So the easiest path to 3.x looked to be waiting until python 2.6 was widely distributed, making it the required version, doing the needed updates to numpy, and then using the automatic conversion to python 3.x. I expect f2py, nose, and other tools will also need fixups. Guido suggests an approach like this for those needing to support both series and I really don't see an alternative unless someone wants to fork numpy ;) Looks like python 2.6 just went into Fedora rawhide, so it should be in the May Fedora 11 release. I expect Ubuntu and other leading edge Linux distros to have it about the same time. This probably means numpy needs to be running on python 2.6 by early Spring. It does. What problems are people seeing? Is it just the Windows build that causes people to say numpy doesn't work with Python 2.6? -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Nov 25, 2008, at 12:23 PM, Pierre GM wrote: All, Sorry to bump my own post, and I was kinda threadjacking anyway: Some functions of numy.ma (eg, ma.max, ma.min...) accept explicit outputs that may not be MaskedArrays. When such an explicit output is not a MaskedArray, a value that should have been masked is transformed into np.nan. That worked great in 2.5, with np.nan automatically transformed to 0 when the explicit output had a int dtype. With Python 2.6, a ValueError is raised instead, as np.nan can no longer be casted to int. What should be the recommended behavior in this case ? Raise a ValueError or some other exception, to follow the new Python2.6 convention, or silently replace np.nan by some value acceptable by int dtype (0, or something else) ? Second bump, sorry. Any consensus on what the behavior should be ? Raise a ValueError (even in 2.5, therefore risking to break something) or just go with the flow and switch np.nan to an acceptable value (like 0), under the hood ? I'd like to close the corresponding ticket... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote: Raise a ValueError (even in 2.5, therefore risking to break something) +1 -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
On Dec 4, 2008, at 2:03 PM, Robert Kern wrote: It does. What problems are people seeing? Is it just the Windows build that causes people to say numpy doesn't work with Python 2.6? There is currently no official Mac OSX binary for numpy for python 2.6, but you can build it from source. Is there any time table for generating a 2.6 Mac OS X binary? Cheers Tommy ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Thu, Dec 4, 2008 at 2:40 PM, Jarrod Millman [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote: Raise a ValueError (even in 2.5, therefore risking to break something) +1 +1 I'm not yet a serious user of numpy/scipy, but when debugging the discrete distributions, it took me a while to figure out that some mysteriously appearing zeros were nans that were silently converted during casting to int. In matlab, I encode different types of missing values (in the data) by numbers that I know are not in my dataset, e.g -2**20, -2**21,... but that depends on the dataset. (hand made nan handling, before data is cleaned). When I see then a weird number, I know that there is a problem, if it the nan is zero, I wouldn't know if it's a missing value or really a zero. Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
On Thu, Dec 4, 2008 at 12:15 PM, Tommy Grav [EMAIL PROTECTED] wrote: On Dec 4, 2008, at 2:03 PM, Robert Kern wrote: It does. What problems are people seeing? Is it just the Windows build that causes people to say numpy doesn't work with Python 2.6? There is currently no official Mac OSX binary for numpy for python 2.6, but you can build it from source. Is there any time table for generating a 2.6 Mac OS X binary? My intention was to make 2.6 Mac binaries for the NumPy 1.3 release. We haven't finalized a timetable for the 1.3 release yet, but the current plan was to try and get the release out near the end of December. Once SciPy 0.7 is out, I will turn my attention to the next NumPy release. -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
On Dec 4, 2008, at 3:24 PM, [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 2:40 PM, Jarrod Millman [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote: Raise a ValueError (even in 2.5, therefore risking to break something) +1 +1 OK then, I'll do that and update the SVN later tonight or early tmw... ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt: second serving
Pierre GM wrote: All, Here's the second round of genloadtxt. That's a tad cleaner version than the previous one, where I tried to take into account the different comments and suggestions that were posted. So, tabs should be supported and explicit whitespaces are not collapsed. Looks pretty good, but there's one breakage against what I had working with my local copy (with mods). When adding the filtering of names read from the file using usecols, there's a reason I set a flag and fixed it later: converters specified by name. If we have usecols and converters specified by name, and we read the names from a file, we have the following sequence: 1) Read names 2) Convert usecols names to column numbers. 3) Filter name list using usecols. Indices of names list no longer map to column numbers. 4) Change converters from mapping names-funcs to mapping col#-func using indices from namesOOPS. It's an admittedly complex combination, but it allows flexibly reading text files since you're only basing on field names, no column numbers. Here's a test case: def test_autonames_usecols_and_converter(self): Tests names and usecols data = StringIO.StringIO('A B C D\n 121 45 9.1') test = loadtxt(data, usecols=('A', 'C', 'D'), names=True, dtype=None, converters={'C':lambda s: 2 * int(s)}) control = np.array(('', 90, 9.1), dtype=[('A', '|S4'), ('C', int), ('D', float)]) assert_equal(test, control) This fails with your current implementation, but works for me when: 1) Set a flag when reading names from header line in file 2) Filter names from file using usecols (if the flag is true) *after* remapping the converters. There may be a better approach, but this is the simplest I've come up with so far. FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit comparison: same input, no missing data, one with genloadtxt, one with np.loadtxt and a last one with matplotlib.mlab.csv2rec. As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but twice faster than csv2rec. One of the explanation for the slowness is indeed the use of classes for splitting lines and converting values. Instead of a basic function, we use the __call__ method of the class, which itself calls another function depending on the attribute values. I'd like to reduce this overhead, any suggestion is more than welcome, as usual. Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in numpy.ma, with an alias recfromcsv for John, using his defaults. Unless somebody comes with a brilliant optimization. Why only in numpy.ma and not somewhere in core numpy itself (missing values aside)? You have a pretty good masked array agnostic wrapper that IMO could go in numpy, though maybe not as loadtxt. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] in(np.nan) on python 2.6
[EMAIL PROTECTED] wrote: Raise a ValueError (even in 2.5, therefore risking to break something) +1 as well it took me a while to figure out that some mysteriously appearing zeros were nans that were silently converted during casting to int. and this is why -- a zero is a perfectly valid and useful number, NaN should never get cast to a zero (or any other valid number) unless the user explicitly asks it to be. I think the right choice was made for python 2.6 here. -Chris -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/ORR(206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception [EMAIL PROTECTED] ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Apply a function to an array elementwise
I want to apply a function (myfunc which takes and returns a scalar) to each element in a multi-dimensioned array (data): I can do this: newdata = numpy.array([myfunc(d) for d in data.flat]).reshape(data.shape) But I'm wondering if there's a faster more numpy way. I've looked at the vectorize function but can't work it out. from numpy import vectorize new_func = vectorize(myfunc) newdata = new_func(data) This seems be some sort of FAQ. Maybe the term vectorize is not known to all (newbie) users. At least finding its application in the docs doesn't seem easy. Here a more threads: * optimising single value functions for array calculations - http://article.gmane.org/gmane.comp.python.numeric.general/26543 * vectorized function inside a class - http://article.gmane.org/gmane.comp.python.numeric.general/16438 Most newcomers learn at some point to develop functions for single values (scalars) but to connect this with computation of full array and be efficient is another step. Some short note has been written on the cookbook: http://www.scipy.org/Cookbook/Autovectorize Regards, Timmie ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] PyArray_EMPTY and Cython
On Tue, Dec 2, 2008 at 9:57 PM, Gabriel Gellner [EMAIL PROTECTED]wrote: After some discussion on the Cython lists I thought I would try my hand at writing some Cython accelerators for empty and zeros. This will involve using PyArray_EMPTY, I have a simple prototype I would like to get working, but currently it segfaults. Any tips on what I might be missing? I took a look at this, but I'm admittedly a cython newbie, but will be using code like this in the future. Have you had any luck? Kurt import numpy as np cimport numpy as np cdef extern from numpy/arrayobject.h: PyArray_EMPTY(int ndims, np.npy_intp* dims, int type, bint fortran) cdef np.ndarray empty(np.npy_intp length): cdef np.ndarray[np.double_t, ndim=1] ret cdef int type = np.NPY_DOUBLE cdef int ndims = 1 cdef np.npy_intp* dims dims = length print dims[0] print type ret = PyArray_EMPTY(ndims, dims, type, False) return ret def test(): cdef np.ndarray[np.double_t, ndim=1] y = empty(10) return y The code seems to print out the correct dims and type info but segfaults when the PyArray_EMPTY call is made. Thanks, Gabriel ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] checksum on numpy float array
My app reads in one or more float arrays from a binary file. Sometimes due to network timeouts etc the array is not read correctly. What would be the best way of checking the validity of the data? Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
On Thu, Dec 4, 2008 at 17:17, Brennan Williams [EMAIL PROTECTED] wrote: My app reads in one or more float arrays from a binary file. Sometimes due to network timeouts etc the array is not read correctly. What would be the best way of checking the validity of the data? Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data? Just use a generic hash on the file's bytes (ignoring their format). MD5 is sufficient for these purposes. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams [EMAIL PROTECTED] wrote: My app reads in one or more float arrays from a binary file. Sometimes due to network timeouts etc the array is not read correctly. What would be the best way of checking the validity of the data? Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data? If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time. Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] genloadtxt: second serving
I am not familiar with this, but it looks quite useful: http://www.stecf.org/software/PYTHONtools/astroasciidata/ or (http://www.scipy.org/AstroAsciiData) Within the AstroAsciiData project we envision a module which can be used to work on all kinds of ASCII tables. The module provides a convenient tool such that the user easily can: * read in ASCII tables; * manipulate table elements; * save the modified ASCII table; * read and write meta data such as column names and units; * combine several tables; * delete/add rows and columns; * manage metadata in the table headers. Is anyone familiar with this package? Would make sense to investigate including this or adopting some of its interface/features? -- Jarrod Millman Computational Infrastructure for Research Labs 10 Giannini Hall, UC Berkeley phone: 510.643.4014 http://cirl.berkeley.edu/ ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
[EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams [EMAIL PROTECTED] wrote: My app reads in one or more float arrays from a binary file. Sometimes due to network timeouts etc the array is not read correctly. What would be the best way of checking the validity of the data? Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data? If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time. The file has multiple arrays stored in it. So I want to have some sort of validity check on just the array that I'm reading. I will need to add a check on the file as well as of course network problems could affect writing to the file as well as reading from the file. Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
On Thu, Dec 4, 2008 at 17:43, Brennan Williams [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams [EMAIL PROTECTED] wrote: My app reads in one or more float arrays from a binary file. Sometimes due to network timeouts etc the array is not read correctly. What would be the best way of checking the validity of the data? Would some sort of checksum approach be a good idea? Would that work with an array of floating point values? Or are checksums more for int,byte,string type data? If you want to verify the file itself, then python provides several more or less secure checksums, my experience was that zlib.crc32 was pretty fast on moderate file sizes. crc32 is common inside archive files and for binary newsgroups. If you have large files transported over the network, e.g. GB size, I would work with par2 repair files, which verifies and repairs at the same time. The file has multiple arrays stored in it. So I want to have some sort of validity check on just the array that I'm reading. So do it on the bytes of the individual arrays. Just don't bother implementing new type-specific checksums. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
I didn't check what this does behind the scenes, but try this m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200))) m2 = hashlib.md5() m2.update(np.array(range(100))) m2.update(np.array(range(200))) print m.hexdigest() print m2.hexdigest() assert m.hexdigest() == m2.hexdigest() m3 = hashlib.md5() m3.update(np.array(range(100))) m3.update(np.array(range(199))) print m3.hexdigest() assert m.hexdigest() == m3.hexdigest() Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
On Thu, Dec 4, 2008 at 6:57 PM, [EMAIL PROTECTED] wrote: I didn't check what this does behind the scenes, but try this I forgot to paste: import hashlib #standard python library Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
Thanks [EMAIL PROTECTED] wrote: I didn't check what this does behind the scenes, but try this import hashlib #standard python library import numpy as np m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200))) m2 = hashlib.md5() m2.update(np.array(range(100))) m2.update(np.array(range(200))) print m.hexdigest() print m2.hexdigest() assert m.hexdigest() == m2.hexdigest() m3 = hashlib.md5() m3.update(np.array(range(100))) m3.update(np.array(range(199))) print m3.hexdigest() assert m.hexdigest() == m3.hexdigest() Josef ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
On Thu, Dec 4, 2008 at 18:54, Brennan Williams [EMAIL PROTECTED] wrote: Thanks [EMAIL PROTECTED] wrote: I didn't check what this does behind the scenes, but try this import hashlib #standard python library import numpy as np m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200))) I would recommend doing this on the strings before you make arrays from them. You don't know if the network cut out in the middle of an 8-byte double. Of course, sending the lengths and other metadata first, then the data would let you check without needing to do expensivish hashes or checksums. If truncation is your problem rather than corruption, then that would be sufficient. You may also consider using the NPY format in numpy 1.2 to implement that. -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] checksum on numpy float array
Robert Kern wrote: On Thu, Dec 4, 2008 at 18:54, Brennan Williams [EMAIL PROTECTED] wrote: Thanks [EMAIL PROTECTED] wrote: I didn't check what this does behind the scenes, but try this import hashlib #standard python library import numpy as np m = hashlib.md5() m.update(np.array(range(100))) m.update(np.array(range(200))) I would recommend doing this on the strings before you make arrays from them. You don't know if the network cut out in the middle of an 8-byte double. Of course, sending the lengths and other metadata first, then the data would let you check without needing to do expensivish hashes or checksums. If truncation is your problem rather than corruption, then that would be sufficient. You may also consider using the NPY format in numpy 1.2 to implement that. Thanks for the ideas. I'm definitely going to add some more basic checks on lengths etc as well. Unfortunately the problem is happening at a client site so (a) I can't reproduce it and (b) most of the time they can't reproduce it either. This is a Windows Python app running on Citrix reading/writing data to a Linux networked drive. Brennan ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Py3k and numpy
From my experience working on my own projects and Cython: * the C code making Python C-API calls could be made to version-agnostic by using preprocessor macros, and even some compatibility header conditionally included. Perhaps the later would be the easiest for C-API calls (we have a lot already distilled in Cython sources). Preprocessor conditionals would still be needed when filling structs. * Regarding Python code, I believe the only sane way to go is to make the 2to3 tool to convert all the 2.x to 3.x code right. * The all-new buffer interface as implemented in core Py3.0 needs carefull review and fixes. * The now-all-strings-are-unicode is going to make some headaches ;-) * No idea how to deal with the now-all-integers-are-python-longs. On Thu, Dec 4, 2008 at 5:20 AM, Erik Tollerud [EMAIL PROTECTED] wrote: I noticed that the Python 3000 final was released today... is there any sense of how long it will take to get numpy working under 3k? I would imagine it'll be a lot to adapt given the low-level change, but is the work already in progress? ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion