[Numpy-discussion] Py3k and numpy

2008-12-04 Thread Erik Tollerud
I noticed that the Python 3000 final was released today... is there
any sense of how long it will take to get numpy working under 3k?  I
would imagine it'll be a lot to adapt given the low-level change, but
is the work already in progress?
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] genloadtxt: second serving

2008-12-04 Thread Pierre GM

All,
Here's the second round of genloadtxt. That's a tad cleaner version  
than the previous one, where I tried to take  into account the  
different comments and suggestions that were posted. So, tabs should  
be supported and explicit whitespaces are not collapsed.
FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit  
comparison: same input, no missing data, one with genloadtxt, one with  
np.loadtxt and a last one with matplotlib.mlab.csv2rec.


As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but  
twice faster than csv2rec. One of the explanation for the slowness is  
indeed the use of classes for splitting lines and converting values.  
Instead of a basic function, we use the __call__ method of the class,  
which itself calls another function depending on the attribute values.  
I'd like to reduce this overhead, any suggestion is more than welcome,  
as usual.


Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in  
numpy.ma, with an alias recfromcsv for John, using his defaults.  
Unless somebody comes with a brilliant optimization.


Let me know how it goes,
Cheers,
P.





Proposal : 
Here's an extension to np.loadtxt, designed to take missing values into account.





import itertools
import numpy as np
import numpy.ma as ma


def _is_string_like(obj):

Check whether obj behaves like a string.

try:
obj + ''
except (TypeError, ValueError):
return False
return True

def _to_filehandle(fname, flag='r', return_opened=False):

Returns the filehandle corresponding to a string or a file.
If the string ends in '.gz', the file is automatically unzipped.

Parameters
--
fname : string, filehandle
Name of the file whose filehandle must be returned.
flag : string, optional
Flag indicating the status of the file ('r' for read, 'w' for write).
return_opened : boolean, optional
Whether to return the opening status of the file.

if _is_string_like(fname):
if fname.endswith('.gz'):
import gzip
fhd = gzip.open(fname, flag)
elif fname.endswith('.bz2'):
import bz2
fhd = bz2.BZ2File(fname)
else:
fhd = file(fname, flag)
opened = True
elif hasattr(fname, 'seek'):
fhd = fname
opened = False
else:
raise ValueError('fname must be a string or file handle')
if return_opened:
return fhd, opened
return fhd


def flatten_dtype(ndtype):

Unpack a structured data-type.


names = ndtype.names
if names is None:
return [ndtype]
else:
types = []
for field in names:
(typ, _) = ndtype.fields[field]
flat_dt = flatten_dtype(typ)
types.extend(flat_dt)
return types


def nested_masktype(datatype):

Construct the dtype of a mask for nested elements.


names = datatype.names
if names:
descr = []
for name in names:
(ndtype, _) = datatype.fields[name]
descr.append((name, nested_masktype(ndtype)))
return descr
# Is this some kind of composite a la (np.float,2)
elif datatype.subdtype:
mdescr = list(datatype.subdtype)
mdescr[0] = np.dtype(bool)
return tuple(mdescr)
else:
return np.bool



class LineSplitter:

Defines a function to split a string at a given delimiter or at given places.

Parameters
--
comment : {'#', string}
Character used to mark the beginning of a comment.
delimiter : var, optional
If a string, character used to delimit consecutive fields.
If an integer or a sequence of integers, width(s) of each field.
autostrip : boolean, optional
Whether to strip each individual fields


def autostrip(self, method):
Wrapper to strip each member of the output of `method`.
return lambda input: [_.strip() for _ in method(input)]
#
def __init__(self, delimiter=None, comments='#', autostrip=True):
self.comments = comments
# Delimiter is a character
if (delimiter is None) or _is_string_like(delimiter):
delimiter = delimiter or None
_called = self._delimited_splitter
# Delimiter is a list of field widths
elif hasattr(delimiter, '__iter__'):
_called = self._variablewidth_splitter
idx = np.cumsum([0]+list(delimiter))
delimiter = [slice(i,j) for (i,j) in zip(idx[:-1], idx[1:])]
# Delimiter is a single integer
elif int(delimiter):
(_called, delimiter) = (self._fixedwidth_splitter, int(delimiter))
else:
(_called, delimiter) = (self._delimited_splitter, None)
self.delimiter = delimiter
if autostrip:
self._called = self.autostrip(_called)
else:

[Numpy-discussion] genloadtxt: second serving (tests)

2008-12-04 Thread Pierre GM

And now for the tests:

# pylint disable-msg=E1101, W0212, W0621

import numpy as np
import numpy.ma as ma

from numpy.ma.testutils import *

from StringIO import StringIO

from _preview import *


class TestLineSplitter(TestCase):
Tests the LineSplitter class.
#
def test_no_delimiter(self):
Test LineSplitter w/o delimiter
strg =  1 2 3 4  5 # test
test = LineSplitter()(strg)
assert_equal(test, ['1', '2', '3', '4', '5'])
test = LineSplitter('')(strg)
assert_equal(test, ['1', '2', '3', '4', '5'])

def test_space_delimiter(self):
Test space delimiter
strg =  1 2 3 4  5 # test
test = LineSplitter(' ')(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5'])
test = LineSplitter('  ')(strg)
assert_equal(test, ['1 2 3 4', '5'])

def test_tab_delimiter(self):
Test tab delimiter
strg=  1\t 2\t 3\t 4\t 5  6
test = LineSplitter('\t')(strg)
assert_equal(test, ['1', '2', '3', '4', '5  6'])
strg=  1  2\t 3  4\t 5  6
test = LineSplitter('\t')(strg)
assert_equal(test, ['1  2', '3  4', '5  6'])

def test_other_delimiter(self):
Test LineSplitter on delimiter
strg = 1,2,3,4,,5
test = LineSplitter(',')(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5'])
#
strg =  1,2,3,4,,5 # test
test = LineSplitter(',')(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5'])

def test_constant_fixed_width(self):
Test LineSplitter w/ fixed-width fields
strg =   1  2  3  4 5   # test
test = LineSplitter(3)(strg)
assert_equal(test, ['1', '2', '3', '4', '', '5', ''])
#
strg =   1 3  4  5  6# test
test = LineSplitter(20)(strg)
assert_equal(test, ['1 3  4  5  6'])
#
strg =   1 3  4  5  6# test
test = LineSplitter(30)(strg)
assert_equal(test, ['1 3  4  5  6'])

def test_variable_fixed_width(self):
strg =   1 3  4  5  6# test
test = LineSplitter((3,6,6,3))(strg)
assert_equal(test, ['1', '3', '4  5', '6'])
#
strg =   1 3  4  5  6# test
test = LineSplitter((6,6,9))(strg)
assert_equal(test, ['1', '3  4', '5  6'])


#---

class TestNameValidator(TestCase):
#
def test_case_sensitivity(self):
Test case sensitivity
names = ['A', 'a', 'b', 'c']
test = NameValidator().validate(names)
assert_equal(test, ['A', 'a', 'b', 'c'])
test = NameValidator(case_sensitive=False).validate(names)
assert_equal(test, ['A', 'A_1', 'B', 'C'])
#
def test_excludelist(self):
Test excludelist
names = ['dates', 'data', 'Other Data', 'mask']
validator = NameValidator(excludelist = ['dates', 'data', 'mask'])
test = validator.validate(names)
assert_equal(test, ['dates_', 'data_', 'Other_Data', 'mask_'])


#---

class TestStringConverter(TestCase):
Test StringConverter
#
def test_creation(self):
Test creation of a StringConverter
converter = StringConverter(int, -9)
assert_equal(converter._status, 1)
assert_equal(converter.default, -9)
#
def test_upgrade(self):
Tests the upgrade method.
converter = StringConverter()
assert_equal(converter._status, 0)
converter.upgrade('0')
assert_equal(converter._status, 1)
converter.upgrade('0.')
assert_equal(converter._status, 2)
converter.upgrade('0j')
assert_equal(converter._status, 3)
converter.upgrade('a')
assert_equal(converter._status, len(converter._mapper)-1)
#
def test_missing(self):
Tests the use of missing values.
converter = StringConverter(missing_values=('missing','missed'))
converter.upgrade('0')
assert_equal(converter('0'), 0)
assert_equal(converter(''), converter.default)
assert_equal(converter('missing'), converter.default)
assert_equal(converter('missed'), converter.default)
try:
converter('miss')
except ValueError:
pass
#
def test_upgrademapper(self):
Tests updatemapper
import dateutil.parser
import datetime
dateparser = dateutil.parser.parse
StringConverter.upgrade_mapper(dateparser, datetime.date(2000,1,1))
convert = StringConverter(dateparser, datetime.date(2000, 1, 1))
test = convert('2001-01-01')
assert_equal(test, datetime.datetime(2001, 01, 01, 00, 00, 00))


#---

class TestLoadTxt(TestCase):
#
def test_record(self):
Test w/ explicit 

Re: [Numpy-discussion] genloadtxt: second serving

2008-12-04 Thread Manuel Metz
Pierre GM wrote:
 All,
 Here's the second round of genloadtxt. That's a tad cleaner version than 
 the previous one, where I tried to take  into account the different 
 comments and suggestions that were posted. So, tabs should be supported 
 and explicit whitespaces are not collapsed.
 FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit 
 comparison: same input, no missing data, one with genloadtxt, one with 
 np.loadtxt and a last one with matplotlib.mlab.csv2rec.
 
 As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but 
 twice faster than csv2rec. One of the explanation for the slowness is 
 indeed the use of classes for splitting lines and converting values. 
 Instead of a basic function, we use the __call__ method of the class, 
 which itself calls another function depending on the attribute values. 
 I'd like to reduce this overhead, any suggestion is more than welcome, 
 as usual.
 
 Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in 
 numpy.ma, with an alias recfromcsv for John, using his defaults. Unless 
 somebody comes with a brilliant optimization.

Will loadtxt in that case remain as is? Or will the _faulttolerantconv 
class be used?

mm
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Broadcasting question

2008-12-04 Thread Olivier Grisel
Hi list,

Suppose I have array a with dimensions (d1, d3) and array b with
dimensions (d2, d3). I want to compute array c with dimensions (d1,
d2) holding the squared euclidian norms of vectors in a and b with
size d3.

My first take was to use a python level loop:

 from numpy import *
 c = array([sum((a_i - b) ** 2, axis=1) for a_i in a])

But this is too slow and allocate a useless temporary list of python references.

To avoid the python level loop I then tried to use broadcasting as follows:

 c = sum((a[:,newaxis,:] - b) ** 2, axis=2)

But this build a useless and huge (d1, d2, d3) temporary array that
does not fit in memory for large values of d1, d2 and d3...

Do you have any better idea? I would like to simulate a runtime
behavior similar to:

 c = dot(a, b.T)

but for for squared euclidian norms instead of dotproducts.

I can always write a the code in C and wrap it with ctypes but I
wondered whether this is possible only with numpy.

-- 
Olivier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Broadcasting question

2008-12-04 Thread Stéfan van der Walt
Hi Olivier

2008/12/4 Olivier Grisel [EMAIL PROTECTED]:
 To avoid the python level loop I then tried to use broadcasting as follows:

 c = sum((a[:,newaxis,:] - b) ** 2, axis=2)

 But this build a useless and huge (d1, d2, d3) temporary array that
 does not fit in memory for large values of d1, d2 and d3...

Does  numpy.lib.broadcast_arrays do what you need?

Regards
Stéfan
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Compiler options for mingw?

2008-12-04 Thread Zachary Pincus
 I needed it to help me fixing a couple of bugs for old CPU, so it
 ended up being implemented in the nsis script for scipy now (I will
 add it to numpy installers too). So from now, any newly releases of
 both numpy and scipy installers could be overriden:

 installer-name.exe /arch native - default behavior
 installer-name.exe /arch nosse - Force installation wo sse, even if
 SSE-cpu is detected.

 It does not check that the option is valid, so you can end up
 requesting SSE3 installer on a SSE2 CPU. But well...

Cool! Thanks! This will be really useful...

Zach
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Charles R Harris
On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED]wrote:

 I noticed that the Python 3000 final was released today... is there
 any sense of how long it will take to get numpy working under 3k?  I
 would imagine it'll be a lot to adapt given the low-level change, but
 is the work already in progress?


I read that announcement too. My feeling is that we can only support one
branch at a time, i.e., the python 2.x or python 3.x series. So the easiest
path to 3.x looked to be waiting until python 2.6 was widely distributed,
making it the required version, doing the needed updates to numpy, and then
using the automatic conversion to python 3.x. I expect f2py, nose, and other
tools will also need fixups. Guido suggests an approach like this for those
needing to support both series and I really don't see an alternative unless
someone wants to fork numpy ;)

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Broadcasting question

2008-12-04 Thread Olivier Grisel
2008/12/4 Stéfan van der Walt [EMAIL PROTECTED]:
 Hi Olivier

 2008/12/4 Olivier Grisel [EMAIL PROTECTED]:
 To avoid the python level loop I then tried to use broadcasting as follows:

 c = sum((a[:,newaxis,:] - b) ** 2, axis=2)

 But this build a useless and huge (d1, d2, d3) temporary array that
 does not fit in memory for large values of d1, d2 and d3...

 Does  numpy.lib.broadcast_arrays do what you need?

That looks exactly what I am looking for. Apparently this is new in
1.2 since I cannot find it in the 1.1 version of my system.

Thanks,

-- 
Olivier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Broadcasting question

2008-12-04 Thread Charles R Harris
On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel [EMAIL PROTECTED]wrote:

 Hi list,

 Suppose I have array a with dimensions (d1, d3) and array b with
 dimensions (d2, d3). I want to compute array c with dimensions (d1,
 d2) holding the squared euclidian norms of vectors in a and b with
 size d3.


Just to clarify the problem a bit, it looks like you want to compute the
squared euclidean distance between every vector in a and every vector in b,
i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3?

If you *are* looking to compute the distance matrix I suspect your end goal
is something beyond that. Could you describe what you are trying to do? I
could be that scipy.spatial or scipy.cluster are what you should look at.

snip

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt: second serving

2008-12-04 Thread Pierre GM

On Dec 4, 2008, at 7:22 AM, Manuel Metz wrote:

 Will loadtxt in that case remain as is? Or will the _faulttolerantconv
 class be used?

No idea, we need to discuss it. There's a problem with  
_faulttolerantconv: using np.nan as default value will not work in  
Python2.6 if the output is to be int, as an exception will be raised.  
Therefore, we'd need to change the default to something else when  
defining _faulttolerantconv. The easiest would be to define a class  
and set the argument at instantiation, but then we're going back  
dangerously close to StringConverter...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Broadcasting question

2008-12-04 Thread Olivier Grisel
2008/12/4 Charles R Harris [EMAIL PROTECTED]:


 On Thu, Dec 4, 2008 at 8:26 AM, Olivier Grisel [EMAIL PROTECTED]
 wrote:

 Hi list,

 Suppose I have array a with dimensions (d1, d3) and array b with
 dimensions (d2, d3). I want to compute array c with dimensions (d1,
 d2) holding the squared euclidian norms of vectors in a and b with
 size d3.

 Just to clarify the problem a bit, it looks like you want to compute the
 squared euclidean distance between every vector in a and every vector in b,
 i.e., a distance matrix. Is that correct? Also, how big are d1,d2,d3?

I would target d1  d2 ~ d3 with d1 as large as possible to fit in
memory and d2 and d3 in the order of a couple hundreds or thousands
for a start.

 If you *are* looking to compute the distance matrix I suspect your end goal
 is something beyond that. Could you describe what you are trying to do?

My end goal it to compute the activation of an array of Radial Basis
Function units where the activation of unit with center b_j for data
vector a_i is given by:

f(a_i, b_j) = exp(-||a_i - bj|| ** 2 / (2 * sigma))

The end goal is to have building blocks of various parameterized array
of homogeneous units (linear, sigmoid and RBF) along with their
gradient in parameter space so as too build various machine learning
algorithms such as multi layer perceptrons with various training
strategies such as Stochastic Gradient Descent. That code might be
integrated into the Modular Data Processing (MPD toolkit) project [1]
at some point.

The current stat of the python code is here:

http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/scalar.py

You can find an SSE optimized C implementation wrapped with ctypes here:

http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/sse.py
http://www.bitbucket.org/ogrisel/oglab/src/186eab341408/simdkernel/src/simdkernel/sse.c

 It could be that scipy.spatial or scipy.cluster are what you should look at.

I'll have a look at those, thanks for the pointer.

[1] http://mdp-toolkit.sourceforge.net/

-- 
Olivier
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Charles R Harris
On Thu, Dec 4, 2008 at 9:39 AM, Charles R Harris
[EMAIL PROTECTED]wrote:



 On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED]wrote:

 I noticed that the Python 3000 final was released today... is there
 any sense of how long it will take to get numpy working under 3k?  I
 would imagine it'll be a lot to adapt given the low-level change, but
 is the work already in progress?


 I read that announcement too. My feeling is that we can only support one
 branch at a time, i.e., the python 2.x or python 3.x series. So the easiest
 path to 3.x looked to be waiting until python 2.6 was widely distributed,
 making it the required version, doing the needed updates to numpy, and then
 using the automatic conversion to python 3.x. I expect f2py, nose, and other
 tools will also need fixups. Guido suggests an approach like this for those
 needing to support both series and I really don't see an alternative unless
 someone wants to fork numpy ;)


Looks like python 2.6 just went into Fedora rawhide, so it should be in the
May Fedora 11 release. I expect Ubuntu and other leading edge Linux distros
to have it about the same time. This probably means numpy needs to be
running on python 2.6 by early Spring. Dropping support for earlier versions
of python might be something to look at for next Fall. So I'm guessing about
a year will be the earliest we might have Python 3.0 support.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Robert Kern
On Thu, Dec 4, 2008 at 12:57, Charles R Harris
[EMAIL PROTECTED] wrote:


 On Thu, Dec 4, 2008 at 9:39 AM, Charles R Harris [EMAIL PROTECTED]
 wrote:


 On Thu, Dec 4, 2008 at 1:20 AM, Erik Tollerud [EMAIL PROTECTED]
 wrote:

 I noticed that the Python 3000 final was released today... is there
 any sense of how long it will take to get numpy working under 3k?  I
 would imagine it'll be a lot to adapt given the low-level change, but
 is the work already in progress?

 I read that announcement too. My feeling is that we can only support one
 branch at a time, i.e., the python 2.x or python 3.x series. So the easiest
 path to 3.x looked to be waiting until python 2.6 was widely distributed,
 making it the required version, doing the needed updates to numpy, and then
 using the automatic conversion to python 3.x. I expect f2py, nose, and other
 tools will also need fixups. Guido suggests an approach like this for those
 needing to support both series and I really don't see an alternative unless
 someone wants to fork numpy ;)

 Looks like python 2.6 just went into Fedora rawhide, so it should be in the
 May Fedora 11 release. I expect Ubuntu and other leading edge Linux distros
 to have it about the same time. This probably means numpy needs to be
 running on python 2.6 by early Spring.

It does. What problems are people seeing? Is it just the Windows build
that causes people to say numpy doesn't work with Python 2.6?

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in(np.nan) on python 2.6

2008-12-04 Thread Pierre GM

On Nov 25, 2008, at 12:23 PM, Pierre GM wrote:

 All,
 Sorry to bump my own post, and I was kinda threadjacking anyway:

 Some functions of numy.ma (eg, ma.max, ma.min...) accept explicit  
 outputs that may not be MaskedArrays.
 When such an explicit output is not a MaskedArray, a value that  
 should have been masked is transformed into np.nan.

 That worked great in 2.5, with np.nan automatically transformed to 0  
 when the explicit output had a int dtype. With Python 2.6, a  
 ValueError is raised instead, as np.nan can no longer be casted to  
 int.

 What should be the recommended behavior in this case ? Raise a  
 ValueError or some other exception, to follow the new Python2.6  
 convention, or silently replace np.nan by some value acceptable by  
 int dtype (0, or something else) ?


Second bump, sorry. Any consensus on what the behavior should be ?  
Raise a ValueError (even in 2.5, therefore risking to break something)  
or just go with the flow and switch np.nan to an acceptable value  
(like 0), under the hood ? I'd like to close the corresponding ticket...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in(np.nan) on python 2.6

2008-12-04 Thread Jarrod Millman
On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote:
 Raise a ValueError (even in 2.5, therefore risking to break something)

+1

-- 
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
http://cirl.berkeley.edu/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Tommy Grav
On Dec 4, 2008, at 2:03 PM, Robert Kern wrote:
 It does. What problems are people seeing? Is it just the Windows build
 that causes people to say numpy doesn't work with Python 2.6?

There is currently no official Mac OSX binary for numpy for python 2.6,
but you can build it from source. Is there any time table for generating
a 2.6 Mac OS X binary?

Cheers
  Tommy


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in(np.nan) on python 2.6

2008-12-04 Thread josef . pktd
On Thu, Dec 4, 2008 at 2:40 PM, Jarrod Millman [EMAIL PROTECTED] wrote:
 On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED] wrote:
 Raise a ValueError (even in 2.5, therefore risking to break something)

 +1


+1

I'm not yet a serious user of numpy/scipy, but when debugging the
discrete distributions, it took me a while to figure out that some
mysteriously appearing zeros were nans that were silently converted
during casting to int.

In matlab, I encode different types of missing values (in the data) by
numbers that I know are not in my dataset, e.g -2**20, -2**21,... but
that depends on the dataset. (hand made nan handling, before data is
cleaned). When I see then a weird number, I know that there is a
problem, if it the nan is zero, I wouldn't know if it's a missing
value or really a zero.

Josef
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Jarrod Millman
On Thu, Dec 4, 2008 at 12:15 PM, Tommy Grav [EMAIL PROTECTED] wrote:
 On Dec 4, 2008, at 2:03 PM, Robert Kern wrote:
 It does. What problems are people seeing? Is it just the Windows build
 that causes people to say numpy doesn't work with Python 2.6?

 There is currently no official Mac OSX binary for numpy for python 2.6,
 but you can build it from source. Is there any time table for generating
 a 2.6 Mac OS X binary?

My intention was to make 2.6 Mac binaries for the NumPy 1.3 release.
We haven't finalized a timetable for the 1.3 release yet, but the
current plan was to try and get the release out near the end of
December.  Once SciPy 0.7 is out, I will turn my attention to the next
NumPy release.

-- 
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
http://cirl.berkeley.edu/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in(np.nan) on python 2.6

2008-12-04 Thread Pierre GM

On Dec 4, 2008, at 3:24 PM, [EMAIL PROTECTED] wrote:

 On Thu, Dec 4, 2008 at 2:40 PM, Jarrod Millman  
 [EMAIL PROTECTED] wrote:
 On Thu, Dec 4, 2008 at 11:14 AM, Pierre GM [EMAIL PROTECTED]  
 wrote:
 Raise a ValueError (even in 2.5, therefore risking to break  
 something)

 +1


 +1

OK then, I'll do that and update the SVN later tonight or early tmw...
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt: second serving

2008-12-04 Thread Ryan May
Pierre GM wrote:
 All,
 Here's the second round of genloadtxt. That's a tad cleaner version than 
 the previous one, where I tried to take  into account the different 
 comments and suggestions that were posted. So, tabs should be supported 
 and explicit whitespaces are not collapsed.

Looks pretty good, but there's one breakage against what I had working 
with my local copy (with mods).  When adding the filtering of names read 
from the file using usecols, there's a reason I set a flag and fixed it 
later: converters specified by name.  If we have usecols and converters 
specified by name, and we read the names from a file, we have the 
following sequence:

1) Read names
2) Convert usecols names to column numbers.
3) Filter name list using usecols. Indices of names list no longer map 
to column numbers.
4) Change converters from mapping names-funcs to mapping col#-func 
using indices from namesOOPS.

It's an admittedly complex combination, but it allows flexibly reading 
text files since you're only basing on field names, no column numbers. 
Here's a test case:

 def test_autonames_usecols_and_converter(self):
 Tests names and usecols
 data = StringIO.StringIO('A B C D\n  121 45 9.1')
 test = loadtxt(data, usecols=('A', 'C', 'D'), names=True,  
 
  dtype=None, converters={'C':lambda s: 2 * int(s)})
 control = np.array(('', 90, 9.1),
 dtype=[('A', '|S4'), ('C', int), ('D', float)])
 assert_equal(test, control)

This fails with your current implementation, but works for me when:

1) Set a flag when reading names from header line in file
2) Filter names from file using usecols (if the flag is true) *after*
remapping the converters. There may be a better approach, but this is 
the simplest I've come up with so far.

 FYI, in the __main__ section, you'll find 2 hotshot tests and a timeit 
 comparison: same input, no missing data, one with genloadtxt, one with 
 np.loadtxt and a last one with matplotlib.mlab.csv2rec.
 
 As you'll see, genloadtxt is roughly twice slower than np.loadtxt, but 
 twice faster than csv2rec. One of the explanation for the slowness is 
 indeed the use of classes for splitting lines and converting values. 
 Instead of a basic function, we use the __call__ method of the class, 
 which itself calls another function depending on the attribute values. 
 I'd like to reduce this overhead, any suggestion is more than welcome, 
 as usual.
 
 Anyhow: as we do need speed, I suggest we put genloadtxt somewhere in 
 numpy.ma, with an alias recfromcsv for John, using his defaults. Unless 
 somebody comes with a brilliant optimization.

Why only in numpy.ma and not somewhere in core numpy itself (missing 
values aside)?  You have a pretty good masked array agnostic wrapper 
that IMO could go in numpy, though maybe not as loadtxt.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] in(np.nan) on python 2.6

2008-12-04 Thread Christopher Barker
[EMAIL PROTECTED] wrote:
 Raise a ValueError (even in 2.5, therefore risking to break something)

+1 as well

 it took me a while to figure out that some
 mysteriously appearing zeros were nans that were silently converted
 during casting to int.

and this is why -- a zero is a perfectly valid and useful number, NaN 
should never get cast to a zero (or any other valid number) unless the 
user explicitly asks it to be.

I think the right choice was made for python 2.6 here.

-Chris



-- 
Christopher Barker, Ph.D.
Oceanographer

Emergency Response Division
NOAA/NOS/ORR(206) 526-6959   voice
7600 Sand Point Way NE   (206) 526-6329   fax
Seattle, WA  98115   (206) 526-6317   main reception

[EMAIL PROTECTED]
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Apply a function to an array elementwise

2008-12-04 Thread Tim Michelsen
 I want to apply a function (myfunc which takes and returns a scalar) to each
 element in a multi-dimensioned array (data):

 I can do this:

 newdata = numpy.array([myfunc(d) for d in data.flat]).reshape(data.shape)

 But I'm wondering if there's a faster more numpy way. I've looked at the
 vectorize function but can't work it out.

   
 
 from numpy import vectorize
 
 new_func = vectorize(myfunc)
 newdata = new_func(data)
This seems be some sort of FAQ. Maybe the term vectorize is not known to 
all (newbie) users. At least finding its application in the docs doesn't 
seem easy.

Here a more threads:
* optimising single value functions for array calculations - 
http://article.gmane.org/gmane.comp.python.numeric.general/26543
* vectorized function inside a class - 
http://article.gmane.org/gmane.comp.python.numeric.general/16438

Most newcomers learn at some point to develop functions for single 
values (scalars) but to connect this with computation of full array and 
be efficient is another step.

Some short note has been written on the cookbook:
http://www.scipy.org/Cookbook/Autovectorize

Regards,
Timmie

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] PyArray_EMPTY and Cython

2008-12-04 Thread Kurt Smith
On Tue, Dec 2, 2008 at 9:57 PM, Gabriel Gellner [EMAIL PROTECTED]wrote:

 After some discussion on the Cython lists I thought I would try my hand at
 writing some Cython accelerators for empty and zeros. This will involve
 using
 PyArray_EMPTY, I have a simple prototype I would like to get working, but
 currently it segfaults. Any tips on what I might be missing?


I took a look at this, but I'm admittedly a cython newbie, but will be using
code like this in the future.  Have you had any luck?

Kurt




 import numpy as np
 cimport numpy as np

 cdef extern from numpy/arrayobject.h:
PyArray_EMPTY(int ndims, np.npy_intp* dims, int type, bint fortran)

 cdef np.ndarray empty(np.npy_intp length):
cdef np.ndarray[np.double_t, ndim=1] ret
cdef int type = np.NPY_DOUBLE
cdef int ndims = 1

cdef np.npy_intp* dims
dims = length

print dims[0]
print type

ret = PyArray_EMPTY(ndims, dims, type, False)

return ret

 def test():
cdef np.ndarray[np.double_t, ndim=1] y = empty(10)

return y


 The code seems to print out the correct dims and type info but segfaults
 when
 the PyArray_EMPTY call is made.

 Thanks,

 Gabriel

 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Brennan Williams
My app reads in one or more float arrays from a binary file.

Sometimes due to network timeouts etc the array is not read correctly.

What would be the best way of checking the validity of the data?

Would some sort of checksum approach be a good idea?
Would that work with an array of floating point values?
Or are checksums more for int,byte,string type data?



___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Robert Kern
On Thu, Dec 4, 2008 at 17:17, Brennan Williams
[EMAIL PROTECTED] wrote:
 My app reads in one or more float arrays from a binary file.

 Sometimes due to network timeouts etc the array is not read correctly.

 What would be the best way of checking the validity of the data?

 Would some sort of checksum approach be a good idea?
 Would that work with an array of floating point values?
 Or are checksums more for int,byte,string type data?

Just use a generic hash on the file's bytes (ignoring their format).
MD5 is sufficient for these purposes.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread josef . pktd
On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams
[EMAIL PROTECTED] wrote:
 My app reads in one or more float arrays from a binary file.

 Sometimes due to network timeouts etc the array is not read correctly.

 What would be the best way of checking the validity of the data?

 Would some sort of checksum approach be a good idea?
 Would that work with an array of floating point values?
 Or are checksums more for int,byte,string type data?


If you want to verify the file itself, then python provides several
more or less secure checksums, my experience was that zlib.crc32 was
pretty fast on moderate file sizes. crc32 is common inside archive
files and for binary newsgroups. If you have large files transported
over the network, e.g. GB size, I would work with par2 repair files,
which verifies and repairs at the same time.

Josef
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] genloadtxt: second serving

2008-12-04 Thread Jarrod Millman
I am not familiar with this, but it looks quite useful:
http://www.stecf.org/software/PYTHONtools/astroasciidata/
or (http://www.scipy.org/AstroAsciiData)

Within the AstroAsciiData project we envision a module which can be
used to work on all kinds of ASCII tables. The module provides a
convenient tool such that the user easily can:

* read in ASCII tables;
* manipulate table elements;
* save the modified ASCII table;
* read and write meta data such as column names and units;
* combine several tables;
* delete/add rows and columns;
* manage metadata in the table headers.

Is anyone familiar with this package?  Would make sense to investigate
including this or adopting some of its interface/features?

-- 
Jarrod Millman
Computational Infrastructure for Research Labs
10 Giannini Hall, UC Berkeley
phone: 510.643.4014
http://cirl.berkeley.edu/
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Brennan Williams
[EMAIL PROTECTED] wrote:
 On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams
 [EMAIL PROTECTED] wrote:
   
 My app reads in one or more float arrays from a binary file.

 Sometimes due to network timeouts etc the array is not read correctly.

 What would be the best way of checking the validity of the data?

 Would some sort of checksum approach be a good idea?
 Would that work with an array of floating point values?
 Or are checksums more for int,byte,string type data?

 

 If you want to verify the file itself, then python provides several
 more or less secure checksums, my experience was that zlib.crc32 was
 pretty fast on moderate file sizes. crc32 is common inside archive
 files and for binary newsgroups. If you have large files transported
 over the network, e.g. GB size, I would work with par2 repair files,
 which verifies and repairs at the same time.

   
The file has multiple arrays stored in it.

So I want to have some sort of validity check on just the array that I'm 
reading.

I will need to add a check on the file as well as of course network 
problems could affect writing to the file
as well as reading from the file.


 Josef
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Robert Kern
On Thu, Dec 4, 2008 at 17:43, Brennan Williams
[EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
 On Thu, Dec 4, 2008 at 6:17 PM, Brennan Williams
 [EMAIL PROTECTED] wrote:

 My app reads in one or more float arrays from a binary file.

 Sometimes due to network timeouts etc the array is not read correctly.

 What would be the best way of checking the validity of the data?

 Would some sort of checksum approach be a good idea?
 Would that work with an array of floating point values?
 Or are checksums more for int,byte,string type data?



 If you want to verify the file itself, then python provides several
 more or less secure checksums, my experience was that zlib.crc32 was
 pretty fast on moderate file sizes. crc32 is common inside archive
 files and for binary newsgroups. If you have large files transported
 over the network, e.g. GB size, I would work with par2 repair files,
 which verifies and repairs at the same time.


 The file has multiple arrays stored in it.

 So I want to have some sort of validity check on just the array that I'm
 reading.

So do it on the bytes of the individual arrays. Just don't bother
implementing new type-specific checksums.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread josef . pktd
I didn't check what this does behind the scenes, but try this

m = hashlib.md5()
m.update(np.array(range(100)))
m.update(np.array(range(200)))

m2 = hashlib.md5()
m2.update(np.array(range(100)))
m2.update(np.array(range(200)))

print m.hexdigest()
print m2.hexdigest()

assert  m.hexdigest() == m2.hexdigest()

m3 = hashlib.md5()
m3.update(np.array(range(100)))
m3.update(np.array(range(199)))

print m3.hexdigest()

assert  m.hexdigest() == m3.hexdigest()

Josef
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread josef . pktd
On Thu, Dec 4, 2008 at 6:57 PM,  [EMAIL PROTECTED] wrote:
 I didn't check what this does behind the scenes, but try this


I forgot to paste:

import hashlib #standard python library

Josef
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Brennan Williams
Thanks

[EMAIL PROTECTED] wrote:
 I didn't check what this does behind the scenes, but try this

   
import hashlib #standard python library
import numpy as np
 m = hashlib.md5()
 m.update(np.array(range(100)))
 m.update(np.array(range(200)))

 m2 = hashlib.md5()
 m2.update(np.array(range(100)))
 m2.update(np.array(range(200)))

 print m.hexdigest()
 print m2.hexdigest()

 assert  m.hexdigest() == m2.hexdigest()

 m3 = hashlib.md5()
 m3.update(np.array(range(100)))
 m3.update(np.array(range(199)))

 print m3.hexdigest()

 assert  m.hexdigest() == m3.hexdigest()

 Josef
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

   

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Robert Kern
On Thu, Dec 4, 2008 at 18:54, Brennan Williams
[EMAIL PROTECTED] wrote:
 Thanks

 [EMAIL PROTECTED] wrote:
 I didn't check what this does behind the scenes, but try this


 import hashlib #standard python library
 import numpy as np
 m = hashlib.md5()
 m.update(np.array(range(100)))
 m.update(np.array(range(200)))

I would recommend doing this on the strings before you make arrays
from them. You don't know if the network cut out in the middle of an
8-byte double.

Of course, sending the lengths and other metadata first, then the data
would let you check without needing to do expensivish hashes or
checksums. If truncation is your problem rather than corruption, then
that would be sufficient. You may also consider using the NPY format
in numpy 1.2 to implement that.

-- 
Robert Kern

I have come to believe that the whole world is an enigma, a harmless
enigma that is made terrible by our own mad attempt to interpret it as
though it had an underlying truth.
  -- Umberto Eco
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] checksum on numpy float array

2008-12-04 Thread Brennan Williams
Robert Kern wrote:
 On Thu, Dec 4, 2008 at 18:54, Brennan Williams
 [EMAIL PROTECTED] wrote:
   
 Thanks

 [EMAIL PROTECTED] wrote:
 
 I didn't check what this does behind the scenes, but try this


   
 import hashlib #standard python library
 import numpy as np
 
 m = hashlib.md5()
 m.update(np.array(range(100)))
 m.update(np.array(range(200)))
   

 I would recommend doing this on the strings before you make arrays
 from them. You don't know if the network cut out in the middle of an
 8-byte double.

 Of course, sending the lengths and other metadata first, then the data
 would let you check without needing to do expensivish hashes or
 checksums. If truncation is your problem rather than corruption, then
 that would be sufficient. You may also consider using the NPY format
 in numpy 1.2 to implement that.

   
Thanks for the ideas. I'm definitely going to add some more basic checks 
on lengths etc as well.
Unfortunately the problem is happening at a client site  so  (a) I can't 
reproduce it and (b) most of the
time they can't reproduce it either. This is a Windows Python app 
running on Citrix reading/writing data
to a Linux networked drive.

Brennan


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Py3k and numpy

2008-12-04 Thread Lisandro Dalcin
From my experience working on my own projects and Cython:

* the C code making Python C-API calls could be made to
version-agnostic by using preprocessor macros, and even some
compatibility header conditionally included. Perhaps the later would
be the easiest for C-API calls (we have a lot already distilled in
Cython sources). Preprocessor conditionals would still be needed when
filling structs.

* Regarding Python code, I believe the only sane way to go is to make
the 2to3 tool to convert all the 2.x to 3.x code right.

* The all-new buffer interface as implemented in core Py3.0 needs
carefull review and fixes.

* The now-all-strings-are-unicode is going to make some headaches ;-)

* No idea how to deal with the now-all-integers-are-python-longs.



On Thu, Dec 4, 2008 at 5:20 AM, Erik Tollerud [EMAIL PROTECTED] wrote:
 I noticed that the Python 3000 final was released today... is there
 any sense of how long it will take to get numpy working under 3k?  I
 would imagine it'll be a lot to adapt given the low-level change, but
 is the work already in progress?
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion




-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion