Re: [Numpy-discussion] Should object arrays have a buffer interface?
On Montag 29 Dezember 2008, Robert Kern wrote: You could wrap the wrappers in Python and check the dtype. You'd have a similar bug if you passed a wrong non-object dtype, too. Checking/communicating the dtype is something you always have to do when using the 2.x buffer protocol. I'm inclined not to make object a special case. When you ask for the raw bytes, you should get the raw bytes. Ok, fair enough. Andreas signature.asc Description: This is a digitally signed message part. ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Alternative to record array
Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 Thank you for the help, Jean-Baptiste Rudant ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
Jean-Baptiste Rudant wrote: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. Sorry I am not able to answer your question; I am really a new user of numpy also. It does seem the addition operation is more than 4 times slower, when using record arrays, based on the following: import numpy, sys, timeit sys.version '2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)]' numpy.__version__ '1.2.1' count = 10e6 ages = numpy.random.randint(0,100,count) weights = numpy.random.randint(1,200,count) data = numpy.rec.fromarrays((ages,weights),names='ages,weights') timer = timeit.Timer('data.ages += 1','from __main__ import data') timer.timeit(number=100) 30.110649537860262 timer = timeit.Timer('ages += 1','from __main__ import ages') timer.timeit(number=100) 6.9850710076280507 So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 Thank you for the help, Jean-Baptiste Rudant ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
Jean-Baptiste Rudant wrote: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 You can accomplish what your FieldArray class does using numpy dtypes: import numpy as np dt = np.dtype([('age', np.int32), ('weight', np.int32)]) N = int(10e6) data = np.empty(N, dtype=dt) data['age'] = np.random.randint(0, 99, 10e6) data['weight'] = np.random.randint(0, 200, 10e6) data['age'] += 1 Timing for recarrays (your code): In [10]: timeit data.age += 1 10 loops, best of 3: 221 ms per loop Timing for my example: In [2]: timeit data['age']+=1 10 loops, best of 3: 150 ms per loop Hope this helps. Ryan -- Ryan May Graduate Research Assistant School of Meteorology University of Oklahoma ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
Jean-Baptiste, As you stated, everything depends on what you want to do. If you need to keep the correspondence ageweight for each entry, then yes, record arrays, or at least flexible-type arrays, are the best. (The difference between a recarray and a flexible-type array is that fields can be accessed by attributes (data.age) or items (data['age']) with recarrays, but only with items with felxible-type arrays). Using your example, you could very well do: data['age'] += 1 and still keep the correspondence ageweight. Your FieldArray class returns an object that is not a ndarray, which may have some undesired side-effects. As Ryan noted, flexible-type arrays are usually faster, because they lack the overhead brought by the possibiity of accessing data by attributes. So, if you don't mind using the 'access-by-fields' syntax, you're good to go. On Dec 29, 2008, at 10:58 AM, Jean-Baptiste Rudant wrote: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ... (age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 Thank you for the help, Jean-Baptiste Rudant ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
Hello, I coincidently started my own implementation of a system to manage intermediate results last week, which I called jug. I wasn't planning to make such an alpha version public just now, but it seems to be on topic. The main idea is to use hashes to map function arguments to paths on the filesystem, which store the result (nothing extraordinary here). I also added the capability of having tasks (the basic unit) take the results of other tasks and defining an implicit dependency DAG. A simple locking mechanism enables light-weight task-level parellization (this was the second of my goals: help me make my stuff parallel). A trick that helps is that I don't really use the argument values to hash (which would be unwieldy for big arrays). I use the computation path (e.g., this is the value obtained from f(g('something'),2)). Since, at least in my problems, things tend to always map back into simple file-system paths, the hash computation doesn't even need to load the intermediate results. I will make the git repository publicly available once I figure out how to do that. I append the tutorial I wrote, which explains the system. HTH, Luís Pedro Coelho PhD Student in Computational Biology Carnegie Mellon University Jug Tutorial What is jug? Jug is a simple way to write easily parallelisable programs in Python. It also handles intermediate results for you. Example --- This is a simple worked-through example which illustrates what jug does. Problem ~~~ Assume that I want to do the following to a collection of images: (1) for each image, compute some features (2) cluster these features using k-means. In order to find out the number of clusters, I try several values and pick the best result. For each value of k, because of the random initialisation, I run the clustering 10 times. I could write the following simple code: :: imgs = glob('*.png') features = [computefeatures(img,parameter=2) for img in imgs] clusters = [] bics = [] for k in xrange(2,200): for repeat in xrange(10): clusters.append(kmeans(features,k=k,random_seed=repeat)) bics.append(compute_bic(clusters[-1])) Nr_clusters = argmin(bics) // 10 Very simple and solves the problem. However, if I want to take advantage of the obvious parallelisation of the problem, then I need to write much more complicated code. My traditional approach is to break this down into smaller scripts. I'd have one to compute features for some images, I'd have another to merge all the results together and do some of the clustering, and, finally, one to merge all the results of the different clusterings. These would need to be called with different parameters to explore different areas of the parameter space, so I'd have a couple of scripts just for calling the main computation scripts. Intermediate results would be saved and loaded by the different processes. This has several problems. The biggest are (1) The need to manage intermediate files. These are normally files with long names like *features_for_img_0_with_parameter_P.pp*. (2) The code gets much more complex. There are minor issues with having to issue several jobs (and having the cluster be idle in the meanwhile), or deciding on how to partition the jobs so that they take roughly the same amount of time, but the two above are the main ones. Jug solves all these problems! Tasks ~ The main unit of jug is a Task. Any function can be used to generate a Task. A Task can depend on the results of other Tasks. The original idea for jug was a Makefile-like environment for declaring Tasks. I have moved beyond that, but it might help you think about what Tasks are. You create a Task by giving it a function which performs the work and its arguments. The arguments can be either literal values or other tasks (in which case, the function will be called with the *result* of those tasks!). Jug also understands lists of tasks (all standard Python containers will be supported in a later version). For example, the following code declares the necessary tasks for our problem: :: imgs = glob('*.png') feature_tasks = [Task(computefeatures,img,parameter=2) for img in imgs] cluster_tasks = [] bic_tasks = [] for k in xrange(2,200): for repeat in xrange(10): cluster_tasks.append(Task(kmeans,feature_tasks,k=k,random_seed=repeat)) bic_tasks.append(Task(compute_bic,cluster_tasks[-1])) Nr_clusters = Task(argmin,bic_tasks) Task Generators ~~~ In the code above, there is a lot of code of the form *Task(function,args)*, so maybe it should read *function(args)*. A simple helper function aids this process: :: from jug.task import Task def TaskGenerator(function): def gen(*args,**kwargs): return Task(function,*args,**kwargs) return gen computefeatures =
Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
On Monday 29 December 2008 14:51:48 Luis Pedro Coelho wrote: I will make the git repository publicly available once I figure out how to do that. You can get my code with: git clone http://coupland.cbi.cmu.edu/jug As I said, I consider this alpha code and am only making it publicly available at this stage because it came up. The license is LGPL. bye, Luis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
This looks really cool -- thanks Luis. Definitely keep us posted as this progresses, too. Zach On Dec 29, 2008, at 4:41 PM, Luis Pedro Coelho wrote: On Monday 29 December 2008 14:51:48 Luis Pedro Coelho wrote: I will make the git repository publicly available once I figure out how to do that. You can get my code with: git clone http://coupland.cbi.cmu.edu/jug As I said, I consider this alpha code and am only making it publicly available at this stage because it came up. The license is LGPL. bye, Luis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
Hi Luis, On Mon, Dec 29, 2008 at 02:51:48PM -0500, Luis Pedro Coelho wrote: I coincidently started my own implementation of a system to manage intermediate results last week, which I called jug. I wasn't planning to make such an alpha version public just now, but it seems to be on topic. Thanks for your input. This comforts me in my hunch that these problems where universal. It is interesting to see that you take a slightly different approach than the others already discussed. This probably stems from the fact that you are mostly interested by parallelism, whereas there are other adjacent problems that can be solved by similar abstractions. In particular, I have the impression that you do not deal with what I call lazy-revaluation. In other words, I am not sure if you track results enough to know whether a intermediate result should be re-run, or if you run a 'clean' between each run to avoid this problem. I must admit I went away from using hash to store objects to the disk because I am very much interested in traceability, and I wanted my objects to have meaningful names, and to be stored in convenient formats (pickle, numpy .npy, hdf5, or domain-specific). I have now realized that explicit naming is convenient, but it should be optional. Your task-based approach, and the API you have built around it, reminds my a bit of twisted deferred. Have you studied this API? A trick that helps is that I don't really use the argument values to hash (which would be unwieldy for big arrays). I use the computation path (e.g., this is the value obtained from f(g('something'),2)). Since, at least in my problems, things tend to always map back into simple file-system paths, the hash computation doesn't even need to load the intermediate results. I did notice too that using the argument value to hash was bound to failure in all but the simplest case. This is the immediate limitation to the famous memoize pattern when applied to scientific code. If I understand well, what you do is that you track the 'history' of the object and use it as a hash to the object, right? I had come to the conclusion that the history of objects should be tracked, but I hadn't realized that using it as a hash was also a good way to solve the scoping problem. Thanks for the trick. Would you consider making the code BSD? Because I want to be able to reuse my code in non open-source project, and because I do not want to lock out contributors, or to ask for copyright assignment, I like to keep all my code BSD, as all the mainstream scientific Python projects. I'll start writing up a wiki page with the all the different learning and usecases that come from all this interesting feedback. Cheers, Gaël ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Alternative to record array
A Monday 29 December 2008, Jean-Baptiste Rudant escrigué: Hello, I like to use record arrays to access fields by their name, and because they are esay to use with pytables. But I think it's not very effiicient for what I have to do. Maybe I'm misunderstanding something. Example : import numpy as np age = np.random.randint(0, 99, 10e6) weight = np.random.randint(0, 200, 10e6) data = np.rec.fromarrays((age, weight), names='age, weight') # the kind of operations I do is : data.age += data.age + 1 # but it's far less efficient than doing : age += 1 # because I think the record array stores [(age_0, weight_0) ...(age_n, weight_n)] # and not [age0 ... age_n] then [weight_0 ... weight_n]. So I think I don't use record arrays for the right purpose. I only need something which would make me esasy to manipulate data by accessing fields by their name. Am I wrong ? Is their something in numpy for my purpose ? Do I have to implement my own class, with something like : class FieldArray: def __init__(self, array_dict): self.array_list = array_dict def __getitem__(self, field): return self.array_list[field] def __setitem__(self, field, value): self.array_list[field] = value my_arrays = {'age': age, 'weight' : weight} data = FieldArray(my_arrays) data['age'] += 1 That's a very good question. What you are observing are the effects of arranging a dataset by fields (row-wise) or by columns (column-wise). A record array in numpy arranges data by field, so that in your 'data' array the data is placed in memory as follows: data['age'][0] -- data['weight'][0] -- data['age'][1] -- data['weight'][1] -- ... while in your 'FieldArray' class, data is arranged by column and is placed in memory as: data['age'][0] -- data['age'][1] -- ... -- data['weight'][0] -- data['weight'][1] -- ... The difference for both approaches is that the row-wise arrangement is more efficient when data is iterated by field, while the column-wise one is more efficient when data is iterated by column. This is why you are seeing the increase of 4x in performance --incidentally, by looking at both data arrangements, I'd expect an increase of just 2x (the stride count is 2 in this case), but I suspect that there are hidden copies during the increment operation for the record array case. So you are perfectly right. In some situations you may want to use a row-wise arrangement (record array) and in other situations a column-wise one. So, it would be handy to have some code to convert back and forth between both data arrangements. Here it goes a couple of classes for doing this (they are a quick-and-dirty generalization of your code): class ColArray: def __init__(self, recarray): dictarray = {} if isinstance(recarray, np.ndarray): fields = recarray.dtype.fields elif isinstance(recarray, RecArray): fields = recarray.fields else: raise TypeError, Unrecognized input type! for colname in fields: # For optimum performance you should 'copy' the column! dictarray[colname] = recarray[colname].copy() self.dictarray = dictarray def __getitem__(self, field): return self.dictarray[field] def __setitem__(self, field, value): self.dictarray[field] = value def iteritems(self): return self.dictarray.iteritems() class RecArray: def __init__(self, dictarray): ldtype = [] fields = [] for colname, column in dictarray.iteritems(): ldtype.append((colname, column.dtype)) fields.append(colname) collen = len(column) dt = np.dtype(ldtype) recarray = np.empty(collen, dtype=dt) for colname, column in dictarray.iteritems(): recarray[colname] = column self.recarray = recarray self.fields = fields def __getitem__(self, field): return self.recarray[field] def __setitem__(self, field, value): self.recarray[field] = value So, ColArray takes as parameter a record array or RecArray class that have a row-wise arrangement and returns an object that is column-wise. RecArray does the inverse trip on the ColArray that takes as parameter. A small example of use: N = 10e6 age = np.random.randint(0, 99, N) weight = np.random.randint(0, 200, N) # Get an initial record array dt = np.dtype([('age', np.int_), ('weight', np.int_)]) data = np.empty(N, dtype=dt) data['age'] = age data['weight'] = weight t1 = time() data['age'] += 1 print time for initial recarray:, round(time()-t1, 3) data = ColArray(data) t1 = time() data['age'] += 1 print time for ColArray:, round(time()-t1, 3) data = RecArray(data) t1 = time() data['age'] += 1 print time for reconstructed RecArray:, round(time()-t1, 3) data = ColArray(data) t1 = time() data['age'] += 1 print time for reconstructed ColArray:, round(time()-t1, 3) and the
Re: [Numpy-discussion] Thoughts on persistence/object tracking in scientific code
Hello all, On Monday 29 December 2008 17:40:07 Gael Varoquaux wrote: It is interesting to see that you take a slightly different approach than the others already discussed. This probably stems from the fact that you are mostly interested by parallelism, whereas there are other adjacent problems that can be solved by similar abstractions. In particular, I have the impression that you do not deal with what I call lazy-revaluation. In other words, I am not sure if you track results enough to know whether a intermediate result should be re-run, or if you run a 'clean' between each run to avoid this problem. I do. As long as the hash (the arguments to the function) is the same, the code loads objects from disk instead of computing results. I don't track the actual source code, though, only whether parameters have changed (but this could be a later addition). I must admit I went away from using hash to store objects to the disk because I am very much interested in traceability, and I wanted my objects to have meaningful names, and to be stored in convenient formats (pickle, numpy .npy, hdf5, or domain-specific). I have now realized that explicit naming is convenient, but it should be optional. But using a hash is not so impenetrable as long as you can easily get to the files you want. If I want to load the results of a partial computation, all I have to do is to generate the same Task objects as the initial computation and load those: I can run the jugfile.py inside ipython and call the appropriate load() methods. ipython jugfile.py : interesting = [t for t in tasks if t.name == 'something.other'] : intermediate = interesting[0].load() I did notice too that using the argument value to hash was bound to failure in all but the simplest case. This is the immediate limitation to the famous memoize pattern when applied to scientific code. If I understand well, what you do is that you track the 'history' of the object and use it as a hash to the object, right? I had come to the conclusion that the history of objects should be tracked, but I hadn't realized that using it as a hash was also a good way to solve the scoping problem. Thanks for the trick. Yes, let's say I have the following: feats = [Task(features,img) for img in glob('*.png')] cluster = Task(kmeans,feats,k=10) then the hash for cluster is computed from its arguments: * kmeans : the function name * feats: this is a list of tasks, therefore I use its hash, which is defined by its argument, which is a simple string. * k=10: this is a literal. I don't need to use the value computed by feats to compute the hash for cluster. Your task-based approach, and the API you have built around it, reminds my a bit of twisted deferred. Have you studied this API? No. I will look into it. Thanks. bye, Luis ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] formatting issues, locale and co
On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau da...@ar.media.kyoto-u.ac.jp wrote: Charles R Harris wrote: I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW). Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails. The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead. Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that. Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases: print np.float32(a) - call the tp_print print '%f' % np.float32(a) - does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there. I'll bet it's calling a conversion to python float, i.e., double, because of the %f. Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :) In [1]: '%s' % np.float32(1) Out[1]: '1.0' In [2]: '%f' % np.float32(1) Out[2]: '1.00' I don't see any way to work around that without changing the way the python formatting works. Yes, I think you're right. Specially since python itself is not consistent. On python 2.6, windows: a = complex('inf') print a # - print inf print '%s' % a # - print inf print '%f' % a # - print 1.#INF Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF). cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] formatting issues, locale and co
On Mon, Dec 29, 2008 at 8:12 PM, David Cournapeau courn...@gmail.comwrote: On Mon, Dec 29, 2008 at 4:36 PM, Charles R Harris charlesr.har...@gmail.com wrote: On Sun, Dec 28, 2008 at 10:35 PM, David Cournapeau da...@ar.media.kyoto-u.ac.jp wrote: Charles R Harris wrote: I put my yesterday work in the fix_float_format branch: - it fixes the locale issue - it fixes the long double issue on windows. - it also fixes some tests (we were not testing single precision formatting but twice double precision instead - the single precision test fails on the trunk BTW). Curious, I don't see any test failures here. Were the tests actually being run or is something else different in your test setup? Or do you mean the fixed up test fails. The later: if you look at numpy/core/tests/test_print, you will see that the types tested are np.float, np.double and np.longdouble, but at least on linux, np.float == np.double, and np.float32 is what we want to test I suppose here instead. Expected, but I would like to see it change because it is kind of frustrating. Fixing it probably involves setting a function pointer in the type definition but I am not sure about that. Hm, it took me a while to get this, but print np.float32(value) can be controlled through tp_print. Still, it does not work in all cases: print np.float32(a) - call the tp_print print '%f' % np.float32(a) - does not call the tp_print (nor tp_str/tp_repr). I have no idea what going on there. I'll bet it's calling a conversion to python float, i.e., double, because of the %f. Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :) I'm not sure we are quite on the same page here. The float32 object has a convert to python float method, (which I don't recall at the moment and I don't have the source to hand). So when %f appears in the format string that method is called and the resulting python float is formatted in the python way. Same with %s, only __str__ is called instead. In [1]: '%s' % np.float32(1) Out[1]: '1.0' In [2]: '%f' % np.float32(1) Out[2]: '1.00' I don't see any way to work around that without changing the way the python formatting works. Yes, I think you're right. Specially since python itself is not consistent. On python 2.6, windows: a = complex('inf') print a # - print inf print '%s' % a # - print inf print '%f' % a # - print 1.#INF How does a python inf display on windows? Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF). Sounds like the sort of thing the python folks would want to clean up, just as you have for numpy. Chuck ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] formatting issues, locale and co
Charles R Harris wrote: Yes, I meant that I did not understand the code path in that case. I realize that I don't know how to get the (C) call graph between two code points in python, that would be useful. Where are you dtrace on linux when I need you :) I'm not sure we are quite on the same page here. Yep, indeed. I think my bogus example did not help :) The right test script use float('inf'), not complex('inf'). The float32 object has a convert to python float method, (which I don't recall at the moment and I don't have the source to hand). So when %f appears in the format string that method is called and the resulting python float is formatted in the python way. I think that's not the case for '%f', because the 'python' way is to print 'inf', not '1.#INF' (at least on 2.6 - on 2.5, it is always '1.#INF' on windows). If you use a pure C program on windows, you will get '1.#INF', etc... instead of 'inf'. repr, str, print all call the C format_float function, which takes care of fomatting 'inf' and co the 'python' way. So getting '1.#INF' from python suggests me that python does not format it in the '%f' case - and I don't know the code path at that point. For '%s', it goes through tp_str, for print a, it goes through tp_print, but for '%f' ? a = complex('inf') print a # - print inf print '%s' % a # - print inf print '%f' % a # - print 1.#INF How does a python inf display on windows? As stated: it depends. 'inf' or '1.#INF', the later being the same as the formatting done within the MS runtime. Which suggests that in that case, it gets directly to stdio without much formatting work from python. Maybe it is an oversight ? Anyway, I think it would be useful to override the tp_print member ( to avoid 'print a' printing 1.#INF). Sounds like the sort of thing the python folks would want to clean up, just as you have for numpy. The thing is since I don't understand what happens in the print '%f' case, I don't know how to clean it up, if it is at all possible. But in anyway, it means that with my changes, we are not worse than python itself, and I think we are better than before, cheers, David ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion