Re: [Numpy-discussion] Should object arrays have a buffer interface?

2008-12-29 Thread Andreas Klöckner
On Montag 29 Dezember 2008, Robert Kern wrote:
 You could wrap the wrappers in Python and check the dtype. You'd have
 a similar bug if you passed a wrong non-object dtype, too.
 Checking/communicating the dtype is something you always have to do
 when using the 2.x buffer protocol. I'm inclined not to make object a
 special case. When you ask for the raw bytes, you should get the raw
 bytes.

Ok, fair enough.

Andreas


signature.asc
Description: This is a digitally signed message part.
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Alternative to record array

2008-12-29 Thread Jean-Baptiste Rudant
Hello,

I like to use record arrays to access fields by their name, and because they 
are esay to use with pytables. But I think it's not very effiicient for what I 
have to do. Maybe I'm misunderstanding something.

Example : 

import numpy as np
age = np.random.randint(0, 99, 10e6)
weight = np.random.randint(0, 200, 10e6)
data = np.rec.fromarrays((age, weight), names='age, weight')
# the kind of operations I do is :
data.age += data.age + 1
# but it's far less efficient than doing :
age += 1
# because I think the record array stores [(age_0, weight_0) ...(age_n, 
weight_n)]
# and not [age0 ... age_n] then [weight_0 ... weight_n].

So I think I don't use record arrays for the right purpose. I only need 
something which would make me esasy to manipulate data by accessing fields by 
their name.

Am I wrong ? Is their something in numpy for my purpose ? Do I have to 
implement my own class, with something like :



class FieldArray:
def __init__(self, array_dict):
self.array_list = array_dict

def __getitem__(self, field):
return self.array_list[field]

def __setitem__(self, field, value):
self.array_list[field] = value

my_arrays = {'age': age, 'weight' : weight}
data = FieldArray(my_arrays)

data['age'] += 1

Thank you for the help,

Jean-Baptiste Rudant


  ___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Alternative to record array

2008-12-29 Thread Jim Vickroy

Jean-Baptiste Rudant wrote:

Hello,

I like to use record arrays to access fields by their name, and 
because they are esay to use with pytables. But I think it's not very 
effiicient for what I have to do. Maybe I'm misunderstanding something.


Example : 


import numpy as np
age = np.random.randint(0, 99, 10e6)
weight = np.random.randint(0, 200, 10e6)
data = np.rec.fromarrays((age, weight), names='age, weight')
# the kind of operations I do is :
data.age += data.age + 1
# but it's far less efficient than doing :
age += 1
# because I think the record array stores [(age_0, weight_0) 
...(age_n, weight_n)]

# and not [age0 ... age_n] then [weight_0 ... weight_n].
Sorry I am not able to answer your question; I am really a new user of 
numpy also.


It does seem the addition operation is more than 4 times slower, when 
using record arrays, based on the following:


 import numpy, sys, timeit
 sys.version
'2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]'
 numpy.__version__
'1.2.1'
 count = 10e6
 ages  = numpy.random.randint(0,100,count)
 weights = numpy.random.randint(1,200,count)
 data = numpy.rec.fromarrays((ages,weights),names='ages,weights')

 timer = timeit.Timer('data.ages += 1','from __main__ import data')
 timer.timeit(number=100)
30.110649537860262

 timer = timeit.Timer('ages += 1','from __main__ import ages')
 timer.timeit(number=100)
6.9850710076280507



So I think I don't use record arrays for the right purpose. I only 
need something which would make me esasy to manipulate data by 
accessing fields by their name.


Am I wrong ? Is their something in numpy for my purpose ? Do I have to 
implement my own class, with something like :



class FieldArray:
def __init__(self, array_dict):
self.array_list = array_dict

def __getitem__(self, field):

return self.array_list[field]

def __setitem__(self, field, value):

self.array_list[field] = value

my_arrays = {'age': age, 'weight' : weight}

data = FieldArray(my_arrays)

data['age'] += 1

Thank you for the help,

Jean-Baptiste Rudant







___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion
  


___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Alternative to record array

2008-12-29 Thread Ryan May
Jean-Baptiste Rudant wrote:
 Hello,
 
 I like to use record arrays to access fields by their name, and because
 they are esay to use with pytables. But I think it's not very effiicient
 for what I have to do. Maybe I'm misunderstanding something.
 
 Example : 
 
 import numpy as np
 age = np.random.randint(0, 99, 10e6)
 weight = np.random.randint(0, 200, 10e6)
 data = np.rec.fromarrays((age, weight), names='age, weight')
 # the kind of operations I do is :
 data.age += data.age + 1
 # but it's far less efficient than doing :
 age += 1
 # because I think the record array stores [(age_0, weight_0) ...(age_n,
 weight_n)]
 # and not [age0 ... age_n] then [weight_0 ... weight_n].
 
 So I think I don't use record arrays for the right purpose. I only need
 something which would make me esasy to manipulate data by accessing
 fields by their name.
 
 Am I wrong ? Is their something in numpy for my purpose ? Do I have to
 implement my own class, with something like :
 
 
 class FieldArray:
 def __init__(self, array_dict):
 self.array_list = array_dict
 
 def __getitem__(self, field):
 return self.array_list[field]
 
 def __setitem__(self, field, value):
 self.array_list[field] = value
 
 my_arrays = {'age': age, 'weight' : weight}
 data = FieldArray(my_arrays)
 
 data['age'] += 1

You can accomplish what your FieldArray class does using numpy dtypes:

import numpy as np
dt = np.dtype([('age', np.int32), ('weight', np.int32)])
N = int(10e6)
data = np.empty(N, dtype=dt)
data['age'] = np.random.randint(0, 99, 10e6)
data['weight'] = np.random.randint(0, 200, 10e6)

data['age'] += 1

Timing for recarrays (your code):

In [10]: timeit data.age += 1
10 loops, best of 3: 221 ms per loop

Timing for my example:

In [2]: timeit data['age']+=1
10 loops, best of 3: 150 ms per loop

Hope this helps.

Ryan

-- 
Ryan May
Graduate Research Assistant
School of Meteorology
University of Oklahoma
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Alternative to record array

2008-12-29 Thread Pierre GM
Jean-Baptiste,
As you stated, everything depends on what you want to do.
If you need to keep the correspondence ageweight for each entry,  
then yes, record arrays, or at least flexible-type arrays, are the  
best. (The difference between a recarray and a flexible-type array is  
that fields can be accessed by attributes (data.age) or items  
(data['age']) with recarrays, but only with items with felxible-type  
arrays).

Using your example, you could very well do:
data['age'] += 1
and still keep the correspondence ageweight.

Your FieldArray class returns an object that is not a ndarray, which  
may have some undesired side-effects.

As Ryan noted, flexible-type arrays are usually faster, because they  
lack the overhead brought by the possibiity of accessing data by  
attributes. So, if you don't mind using the 'access-by-fields' syntax,  
you're good to go.


On Dec 29, 2008, at 10:58 AM, Jean-Baptiste Rudant wrote:

 Hello,

 I like to use record arrays to access fields by their name, and  
 because they are esay to use with pytables. But I think it's not  
 very effiicient for what I have to do. Maybe I'm misunderstanding  
 something.

 Example :

 import numpy as np
 age = np.random.randint(0, 99, 10e6)
 weight = np.random.randint(0, 200, 10e6)
 data = np.rec.fromarrays((age, weight), names='age, weight')
 # the kind of operations I do is :
 data.age += data.age + 1
 # but it's far less efficient than doing :
 age += 1
 # because I think the record array stores [(age_0, weight_0) ... 
 (age_n, weight_n)]
 # and not [age0 ... age_n] then [weight_0 ... weight_n].

 So I think I don't use record arrays for the right purpose. I only  
 need something which would make me esasy to manipulate data by  
 accessing fields by their name.

 Am I wrong ? Is their something in numpy for my purpose ? Do I have  
 to implement my own class, with something like :


 class FieldArray:
 def __init__(self, array_dict):
 self.array_list = array_dict

 def __getitem__(self, field):
 return self.array_list[field]

 def __setitem__(self, field, value):
 self.array_list[field] = value

 my_arrays = {'age': age, 'weight' : weight}
 data = FieldArray(my_arrays)

 data['age'] += 1

 Thank you for the help,

 Jean-Baptiste Rudant
   




 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code

2008-12-29 Thread Luis Pedro Coelho
Hello,

I coincidently started my own implementation of a system to manage 
intermediate results last week, which I called jug. I wasn't planning to make 
such an alpha version public just now, but it seems to be on topic.

The main idea is to use hashes to map function arguments to paths on the 
filesystem, which store the result (nothing extraordinary here). I also added 
the capability of having tasks (the basic unit) take the results of other 
tasks and defining an implicit dependency DAG. A simple locking mechanism 
enables light-weight task-level parellization (this was the second of my 
goals: help me make my stuff parallel).

A trick that helps is that I don't really use the argument values to hash 
(which would be unwieldy for big arrays). I use the computation path (e.g., 
this is the value obtained from f(g('something'),2)). Since, at least in my 
problems, things tend to always map back into simple file-system paths, the 
hash computation doesn't even need to load the intermediate results.

I will make the git repository publicly available once I figure out how to do 
that.

I append the tutorial  I wrote, which explains the system.

HTH,
Luís Pedro Coelho
PhD Student in Computational Biology
Carnegie Mellon University


Jug Tutorial


What is jug?


Jug is a simple way to write easily parallelisable programs in Python. It also 
handles intermediate results for you.

Example
---

This is a simple worked-through example which illustrates what jug does.

Problem
~~~

Assume that I want to do the following to a collection of images:

(1) for each image, compute some features
(2) cluster these features using k-means. In order to find out the number 
of clusters, I try several values and pick the best result. For each value of 
k, because of the random initialisation, I run the clustering 10 times.

I could write the following simple code:

::

imgs = glob('*.png')
features = [computefeatures(img,parameter=2) for img in imgs]
clusters = []
bics = []
for k in xrange(2,200):
for repeat in xrange(10):
clusters.append(kmeans(features,k=k,random_seed=repeat))
bics.append(compute_bic(clusters[-1]))
Nr_clusters = argmin(bics) // 10

Very simple and solves the problem. However, if I want to take advantage of 
the obvious parallelisation of the problem, then I need to write much more 
complicated code. My traditional approach is to break this down into smaller 
scripts. I'd have one to compute features for some images, I'd have another to 
merge all the results together and do some of the clustering, and, finally, one 
to merge all the results of the different clusterings. These would need to be 
called with different parameters to explore different areas of the parameter 
space, so I'd have a couple of scripts just for calling the main computation 
scripts. Intermediate results would be saved and loaded by the different 
processes.

This has several problems. The biggest are

(1) The need to manage intermediate files. These are normally files with 
long names like *features_for_img_0_with_parameter_P.pp*.
(2) The code gets much more complex.

There are minor issues with having to issue several jobs (and having the 
cluster be idle in the meanwhile), or deciding on how to partition the jobs so 
that they take roughly the same amount of time, but the two above are the main 
ones.

Jug solves all these problems!

Tasks
~

The main unit of jug is a Task. Any function can be used to generate a Task. A 
Task can depend on the results of other Tasks.

The original idea for jug was a Makefile-like environment for declaring Tasks. 
I have moved beyond that, but it might help you think about what Tasks are.

You create a Task by giving it a function which performs the work and its 
arguments. The arguments can be either literal values or other tasks (in which 
case, the function will be called with the *result* of those tasks!). Jug also 
understands lists of tasks (all standard Python containers will be supported 
in a later version). For example, the following code declares the necessary 
tasks for our problem:

::

imgs = glob('*.png')
feature_tasks = [Task(computefeatures,img,parameter=2) for img in imgs]
cluster_tasks = []
bic_tasks = []
for k in xrange(2,200):
for repeat in xrange(10):

cluster_tasks.append(Task(kmeans,feature_tasks,k=k,random_seed=repeat))
bic_tasks.append(Task(compute_bic,cluster_tasks[-1]))
Nr_clusters = Task(argmin,bic_tasks)

Task Generators
~~~

In the code above, there is a lot of code of the form *Task(function,args)*, 
so maybe it should read *function(args)*.  A simple helper function aids this 
process:

::

from jug.task import Task

def TaskGenerator(function):
def gen(*args,**kwargs):
return Task(function,*args,**kwargs)

return gen

computefeatures = 

Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code

2008-12-29 Thread Luis Pedro Coelho
On Monday 29 December 2008 14:51:48 Luis Pedro Coelho wrote:
 I will make the git repository publicly available once I figure out how to
 do that.

You can get my code with:

git clone http://coupland.cbi.cmu.edu/jug

As I said, I consider this alpha code and am only making it publicly available 
at this stage because it came up. The license is LGPL.

bye,
Luis
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code

2008-12-29 Thread Zachary Pincus
This looks really cool -- thanks Luis.

Definitely keep us posted as this progresses, too.

Zach


On Dec 29, 2008, at 4:41 PM, Luis Pedro Coelho wrote:

 On Monday 29 December 2008 14:51:48 Luis Pedro Coelho wrote:
 I will make the git repository publicly available once I figure out  
 how to
 do that.

 You can get my code with:

 git clone http://coupland.cbi.cmu.edu/jug

 As I said, I consider this alpha code and am only making it publicly  
 available
 at this stage because it came up. The license is LGPL.

 bye,
 Luis
 ___
 Numpy-discussion mailing list
 Numpy-discussion@scipy.org
 http://projects.scipy.org/mailman/listinfo/numpy-discussion

___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code

2008-12-29 Thread Gael Varoquaux
Hi Luis,

On Mon, Dec 29, 2008 at 02:51:48PM -0500, Luis Pedro Coelho wrote:
 I coincidently started my own implementation of a system to manage
 intermediate results last week, which I called jug. I wasn't planning
 to make such an alpha version public just now, but it seems to be on
 topic.

Thanks for your input. This comforts me in my hunch that these problems
where universal.

It is interesting to see that you take a slightly different approach than
the others already discussed. This probably stems from the fact that you
are mostly interested by parallelism, whereas there are other adjacent
problems that can be solved by similar abstractions. In particular, I
have the impression that you do not deal with what I call
lazy-revaluation. In other words, I am not sure if you track results
enough to know whether a intermediate result should be re-run, or if you
run a 'clean' between each run to avoid this problem.

I must admit I went away from using hash to store objects to the disk
because I am very much interested in traceability, and I wanted my
objects to have meaningful names, and to be stored in convenient formats
(pickle, numpy .npy, hdf5, or domain-specific). I have now realized that
explicit naming is convenient, but it should be optional.

Your task-based approach, and the API you have built around it, reminds
my a bit of twisted deferred. Have you studied this API?

 A trick that helps is that I don't really use the argument values to hash 
 (which would be unwieldy for big arrays). I use the computation path (e.g., 
 this is the value obtained from f(g('something'),2)). Since, at least in my 
 problems, things tend to always map back into simple file-system paths, the 
 hash computation doesn't even need to load the intermediate results.

I did notice too that using the argument value to hash was bound to
failure in all but the simplest case. This is the immediate limitation to
the famous memoize pattern when applied to scientific code. If I
understand well, what you do is that you track the 'history' of the
object and use it as a hash to the object, right? I had come to the
conclusion that the history of objects should be tracked, but I hadn't
realized that using it as a hash was also a good way to solve the scoping
problem. Thanks for the trick.

Would you consider making the code BSD? Because I want to be able to
reuse my code in non open-source project, and because I do not want to
lock out contributors, or to ask for copyright assignment, I like to
keep all my code BSD, as all the mainstream scientific Python projects. 

I'll start writing up a wiki page with the all the different learning and
usecases that come from all this interesting feedback.

Cheers,

Gaël
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Alternative to record array

2008-12-29 Thread Francesc Alted
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué:
 Hello,

 I like to use record arrays to access fields by their name, and
 because they are esay to use with pytables. But I think it's not very
 effiicient for what I have to do. Maybe I'm misunderstanding
 something.

 Example :

 import numpy as np
 age = np.random.randint(0, 99, 10e6)
 weight = np.random.randint(0, 200, 10e6)
 data = np.rec.fromarrays((age, weight), names='age, weight')
 # the kind of operations I do is :
 data.age += data.age + 1
 # but it's far less efficient than doing :
 age += 1
 # because I think the record array stores [(age_0, weight_0)
 ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ...
 weight_n].

 So I think I don't use record arrays for the right purpose. I only
 need something which would make me esasy to manipulate data by
 accessing fields by their name.

 Am I wrong ? Is their something in numpy for my purpose ? Do I have
 to implement my own class, with something like :



 class FieldArray:
 def __init__(self, array_dict):
 self.array_list = array_dict

 def __getitem__(self, field):
 return self.array_list[field]

 def __setitem__(self, field, value):
 self.array_list[field] = value

 my_arrays = {'age': age, 'weight' : weight}
 data = FieldArray(my_arrays)

 data['age'] += 1

That's a very good question.  What you are observing are the effects of 
arranging a dataset by fields (row-wise) or by columns (column-wise).  
A record array in numpy arranges data by field, so that in your 'data' 
array the data is placed in memory as follows:

data['age'][0] -- data['weight'][0] --
data['age'][1] -- data['weight'][1] --
...

while in your 'FieldArray' class, data is arranged by column and is 
placed in memory as:

data['age'][0] -- data['age'][1] -- ... --
data['weight'][0] -- data['weight'][1] -- ...

The difference for both approaches is that the row-wise arrangement is 
more efficient when data is iterated by field, while the column-wise 
one is more efficient when data is iterated by column.  This is why you 
are seeing the increase of 4x in performance --incidentally, by looking 
at both data arrangements, I'd expect an increase of just 2x (the 
stride count is 2 in this case), but I suspect that there are hidden 
copies during the increment operation for the record array case.

So you are perfectly right.  In some situations you may want to use a 
row-wise arrangement (record array) and in other situations a 
column-wise one.  So, it would be handy to have some code to convert 
back and forth between both data arrangements.  Here it goes a couple 
of classes for doing this (they are a quick-and-dirty generalization of 
your code):

class ColArray:
def __init__(self, recarray):
dictarray = {}
if isinstance(recarray, np.ndarray):
fields = recarray.dtype.fields
elif isinstance(recarray, RecArray):
fields = recarray.fields
else:
raise TypeError, Unrecognized input type!
for colname in fields:
# For optimum performance you should 'copy' the column!
dictarray[colname] = recarray[colname].copy()
self.dictarray = dictarray

def __getitem__(self, field):
return self.dictarray[field]

def __setitem__(self, field, value):
self.dictarray[field] = value

def iteritems(self):
return self.dictarray.iteritems()


class RecArray:
def __init__(self, dictarray):
ldtype = []
fields = []
for colname, column in dictarray.iteritems():
ldtype.append((colname, column.dtype))
fields.append(colname)
collen = len(column)
dt = np.dtype(ldtype)
recarray = np.empty(collen, dtype=dt)
for colname, column in dictarray.iteritems():
recarray[colname] = column
self.recarray = recarray
self.fields = fields

def __getitem__(self, field):
return self.recarray[field]

def __setitem__(self, field, value):
self.recarray[field] = value

So, ColArray takes as parameter a record array or RecArray class that 
have a row-wise arrangement and returns an object that is column-wise. 
RecArray does the inverse trip on the ColArray that takes as parameter.

A small example of use:

N = 10e6
age = np.random.randint(0, 99, N)
weight = np.random.randint(0, 200, N)

# Get an initial record array
dt = np.dtype([('age', np.int_), ('weight', np.int_)])
data = np.empty(N, dtype=dt)
data['age'] = age
data['weight'] = weight

t1 = time()
data['age'] += 1
print time for initial recarray:, round(time()-t1, 3)

data = ColArray(data)
t1 = time()
data['age'] += 1
print time for ColArray:, round(time()-t1, 3)

data = RecArray(data)
t1 = time()
data['age'] += 1
print time for reconstructed RecArray:, round(time()-t1, 3)

data = ColArray(data)
t1 = time()
data['age'] += 1
print time for reconstructed ColArray:, round(time()-t1, 3)

and the 

Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code

2008-12-29 Thread Luis Pedro Coelho
Hello all,

On Monday 29 December 2008 17:40:07 Gael Varoquaux wrote:
 It is interesting to see that you take a slightly different approach than
 the others already discussed. This probably stems from the fact that you
 are mostly interested by parallelism, whereas there are other adjacent
 problems that can be solved by similar abstractions. In particular, I
 have the impression that you do not deal with what I call
 lazy-revaluation. In other words, I am not sure if you track results
 enough to know whether a intermediate result should be re-run, or if you
 run a 'clean' between each run to avoid this problem.

I do. As long as the hash (the arguments to the function) is the same, the 
code loads objects from disk instead of computing results. I don't track the 
actual source code, though, only whether parameters have changed (but this 
could be a later addition).

 I must admit I went away from using hash to store objects to the disk
 because I am very much interested in traceability, and I wanted my
 objects to have meaningful names, and to be stored in convenient formats
 (pickle, numpy .npy, hdf5, or domain-specific). I have now realized that
 explicit naming is convenient, but it should be optional.

But using a hash is not so impenetrable as long as you can easily get to the 
files you want.

If I want to load the results of a partial computation, all I have to do is to 
generate the same Task objects as the initial computation and load those: I 
can run the jugfile.py inside ipython and call the appropriate load() methods.

ipython jugfile.py

: interesting = [t for t in tasks if t.name == 'something.other']
: intermediate = interesting[0].load()

 I did notice too that using the argument value to hash was bound to
 failure in all but the simplest case. This is the immediate limitation to
 the famous memoize pattern when applied to scientific code. If I
 understand well, what you do is that you track the 'history' of the
 object and use it as a hash to the object, right? I had come to the
 conclusion that the history of objects should be tracked, but I hadn't
 realized that using it as a hash was also a good way to solve the scoping
 problem. Thanks for the trick.

Yes, let's say I have the following:

feats = [Task(features,img) for img in glob('*.png')]
cluster = Task(kmeans,feats,k=10)

then the hash for cluster is computed from its arguments:
* kmeans : the function name
* feats: this is a list of tasks, therefore I use its hash, which is 
defined 
by its argument, which is a simple string.
* k=10: this is a literal.

I don't need to use the value computed by feats to compute the hash for 
cluster.

 Your task-based approach, and the API you have built around it, reminds
 my a bit of twisted deferred. Have you studied this API?

No. I will look into it. Thanks.

bye,
Luis
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] formatting issues, locale and co

2008-12-29 Thread David Cournapeau
On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris
charlesr.har...@gmail.com wrote:


 On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau
 da...@ar.media.kyoto-u.ac.jp wrote:

 Charles R Harris wrote:
 
 
 
  I put my yesterday work in the fix_float_format branch:
   - it fixes the locale issue
   - it fixes the long double issue on windows.
   - it also fixes some tests (we were not testing single precision
  formatting but twice double precision instead - the single precision
  test fails on the trunk BTW).
 
 
  Curious, I don't see any test failures here. Were the tests actually
  being run or is something else different in your test setup? Or do you
  mean the fixed up test fails.

 The later: if you look at numpy/core/tests/test_print, you will see that
 the types tested are np.float, np.double and np.longdouble, but at least
 on linux, np.float == np.double, and np.float32 is what we want to test
 I suppose here instead.

 
  Expected, but I would like to see it change because it is kind of
  frustrating. Fixing it probably involves setting a function pointer in
  the type definition but I am not sure about that.

 Hm, it took me a while to get this, but print np.float32(value) can be
 controlled through tp_print. Still, it does not work in all cases:

 print np.float32(a) - call the tp_print
 print '%f' % np.float32(a) - does not call the tp_print (nor
 tp_str/tp_repr). I have no idea what going on there.

 I'll bet it's calling a conversion to python float, i.e., double, because of
 the %f.

Yes, I meant that I did not understand the code path in that case. I
realize that I don't know how to get the (C) call graph between two
code points in python, that would be useful. Where are you dtrace on
linux when I need you :)


 In [1]: '%s' % np.float32(1)
 Out[1]: '1.0'

 In [2]: '%f' % np.float32(1)
 Out[2]: '1.00'

 I don't see any way to work around that without changing the way the python
 formatting works.

Yes, I think you're right. Specially since python itself is not
consistent. On python 2.6, windows:

a = complex('inf')
print a # - print inf
print '%s' % a # - print inf
print '%f' % a # - print 1.#INF

Which suggests that in that case, it gets directly to stdio without
much formatting work from python. Maybe it is an oversight ? Anyway, I
think it would be useful to override the tp_print member ( to avoid
'print a' printing 1.#INF).

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] formatting issues, locale and co

2008-12-29 Thread Charles R Harris
On Mon, Dec 29, 2008 at 8:12 PM, David Cournapeau courn...@gmail.comwrote:

 On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris
 charlesr.har...@gmail.com wrote:
 
 
  On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau
  da...@ar.media.kyoto-u.ac.jp wrote:
 
  Charles R Harris wrote:
  
  
  
   I put my yesterday work in the fix_float_format branch:
- it fixes the locale issue
- it fixes the long double issue on windows.
- it also fixes some tests (we were not testing single precision
   formatting but twice double precision instead - the single
 precision
   test fails on the trunk BTW).
  
  
   Curious, I don't see any test failures here. Were the tests actually
   being run or is something else different in your test setup? Or do you
   mean the fixed up test fails.
 
  The later: if you look at numpy/core/tests/test_print, you will see that
  the types tested are np.float, np.double and np.longdouble, but at least
  on linux, np.float == np.double, and np.float32 is what we want to test
  I suppose here instead.
 
  
   Expected, but I would like to see it change because it is kind of
   frustrating. Fixing it probably involves setting a function pointer in
   the type definition but I am not sure about that.
 
  Hm, it took me a while to get this, but print np.float32(value) can be
  controlled through tp_print. Still, it does not work in all cases:
 
  print np.float32(a) - call the tp_print
  print '%f' % np.float32(a) - does not call the tp_print (nor
  tp_str/tp_repr). I have no idea what going on there.
 
  I'll bet it's calling a conversion to python float, i.e., double, because
 of
  the %f.

 Yes, I meant that I did not understand the code path in that case. I
 realize that I don't know how to get the (C) call graph between two
 code points in python, that would be useful. Where are you dtrace on
 linux when I need you :)


I'm not sure we are quite on the same page here. The float32 object has a
convert to python float method, (which I don't recall at the moment and I
don't have the source to hand). So when %f appears in the format string that
method is called and the resulting python float is formatted in the python
way. Same with %s, only __str__ is called instead.


 
  In [1]: '%s' % np.float32(1)
  Out[1]: '1.0'
 
  In [2]: '%f' % np.float32(1)
  Out[2]: '1.00'
 
  I don't see any way to work around that without changing the way the
 python
  formatting works.

 Yes, I think you're right. Specially since python itself is not
 consistent. On python 2.6, windows:

 a = complex('inf')
 print a # - print inf
 print '%s' % a # - print inf
 print '%f' % a # - print 1.#INF


How does a python inf display on windows?



 Which suggests that in that case, it gets directly to stdio without
 much formatting work from python. Maybe it is an oversight ? Anyway, I
 think it would be useful to override the tp_print member ( to avoid
 'print a' printing 1.#INF).


Sounds like the sort of thing the python folks would want to clean up, just
as you have for numpy.

Chuck
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] formatting issues, locale and co

2008-12-29 Thread David Cournapeau
Charles R Harris wrote:



 Yes, I meant that I did not understand the code path in that case. I
 realize that I don't know how to get the (C) call graph between two
 code points in python, that would be useful. Where are you dtrace on
 linux when I need you :)


 I'm not sure we are quite on the same page here.

Yep, indeed. I think my bogus example did not help :) The right test
script use float('inf'), not complex('inf').

 The float32 object has a convert to python float method, (which I
 don't recall at the moment and I don't have the source to hand). So
 when %f appears in the format string that method is called and the
 resulting python float is formatted in the python way.

I think that's not the case for '%f', because the 'python' way is to
print 'inf', not '1.#INF' (at least on 2.6 - on 2.5, it is always
'1.#INF' on windows). If you use a pure C program on windows, you will
get '1.#INF', etc... instead of 'inf'. repr, str, print all call the C
format_float function, which takes care of fomatting 'inf' and co the
'python' way.

So getting '1.#INF'  from python suggests me that python does not format
it in the '%f' case - and I don't know the code path at that point. For
'%s', it goes through tp_str, for print a, it goes through tp_print, but
for '%f' ?



 a = complex('inf')
 print a # - print inf
 print '%s' % a # - print inf
 print '%f' % a # - print 1.#INF


 How does a python inf display on windows?

As stated: it depends. 'inf' or '1.#INF', the later being the same as
the formatting done within the MS runtime.

  


 Which suggests that in that case, it gets directly to stdio without
 much formatting work from python. Maybe it is an oversight ? Anyway, I
 think it would be useful to override the tp_print member ( to avoid
 'print a' printing 1.#INF).


 Sounds like the sort of thing the python folks would want to clean up,
 just as you have for numpy.

The thing is since I don't understand what happens in the print '%f'
case, I don't know how to clean it up, if it is at all possible. But in
anyway, it means that with my changes, we are not worse than python
itself, and I think we are better than before,

cheers,

David
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion