Re: [Numpy-discussion] np.longlong casts to int
Hi, Le 23/02/2012 02:24, Matthew Brett a écrit : Luckily I was in fact using longdouble in the live code, I had never exotic floating point precision, so thanks for your post which made me take a look at docstring and documentation. If I got it right from the docstring, 'np.longdouble', 'np.longfloat' are all in fact 'np.float128'. (numpy 1.5) However, I was surprised that float128 is not mentioned in the array of available types in the user guide. http://docs.scipy.org/doc/numpy/user/basics.types.html Is there a specific reason for this absence, or is just about visiting the documentation wiki ;-) ? Additionally, I don't know what are the writing guidelines of the user guide, but would it make sense to add some new numpy 1.x messages as in the Python doc. I'm thinking here of np.float16. I know it exists from messages on this mailing list but my 1.5 don't have it. Best, Pierre PS : I found float128 mentionned in the reference http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#built-in-scalar-types However, it is not as easily readable as the user guide (which makes sense !). Does the following statements mean that those types are not available on all platforms ? float96 96 bits, platform? float128 128 bits, platform? signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] python geospatial package?
On 02/22/2012 10:45 PM, Chao YUE wrote: Hi all, Is anyone using some python geospatial package that can do jobs like intersection, etc. the job is like you automatically extract a region on a global map etc. thanks and cheers, Chao Chao, shapely would do this, though I found it had a bit of a steep learning curve. Or you could go the gdal/ogr way, which uses the geos library under the hood (if present) to do geometrical operations like intersections etc. cheers, Vincent. -- *** Chao YUE Laboratoire des Sciences du Climat et de l'Environnement (LSCE-IPSL) UMR 1572 CEA-CNRS-UVSQ Batiment 712 - Pe 119 91191 GIF Sur YVETTE Cedex Tel: (33) 01 69 08 29 02; Fax:01.69.08.77.16 ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Feb 23, 2012, at 3:06 AM, Pierre Haessig wrote: Hi, Le 23/02/2012 02:24, Matthew Brett a écrit : Luckily I was in fact using longdouble in the live code, I had never exotic floating point precision, so thanks for your post which made me take a look at docstring and documentation. If I got it right from the docstring, 'np.longdouble', 'np.longfloat' are all in fact 'np.float128'. (numpy 1.5) That in fact depends on the platform you are using. Typically, for 32-bit platforms, 'np.longfloat' and 'np.longdouble' are bound to 'np.float96', while in 64-bit are to 'np.float128'. However, I was surprised that float128 is not mentioned in the array of available types in the user guide. http://docs.scipy.org/doc/numpy/user/basics.types.html Is there a specific reason for this absence, or is just about visiting the documentation wiki ;-) ? The reason is most probably that you cannot get a float96 or float128 whenever you want (depends on your architecture), so adding these types to the manual could be misleading. However, I'd advocate to document them while warning about platform portability issues. Additionally, I don't know what are the writing guidelines of the user guide, but would it make sense to add some new numpy 1.x messages as in the Python doc. I'm thinking here of np.float16. I know it exists from messages on this mailing list but my 1.5 don't have it. float16 was introduced in NumPy 1.6, IIRC. PS : I found float128 mentionned in the reference http://docs.scipy.org/doc/numpy/reference/arrays.scalars.html#built-in-scalar-types However, it is not as easily readable as the user guide (which makes sense !). Does the following statements mean that those types are not available on all platforms ? float96 96 bits, platform? float128 128 bits, platform? Exactly. I'd update this to read: float9696 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted franc...@continuum.io wrote: Exactly. I'd update this to read: float96 96 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Except float96 is actually 80 bits. (Usually?) Plus some padding... -- Nathaniel ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Le 23/02/2012 12:40, Francesc Alted a écrit : However, I was surprised that float128 is not mentioned in the array of available types in the user guide. http://docs.scipy.org/doc/numpy/user/basics.types.html Is there a specific reason for this absence, or is just about visiting the documentation wiki ;-) ? The reason is most probably that you cannot get a float96 or float128 whenever you want (depends on your architecture), so adding these types to the manual could be misleading. However, I'd advocate to document them while warning about platform portability issues Does the following statements mean that those types are not available on all platforms ? float96 96 bits, platform? float128 128 bits, platform? Exactly. I'd update this to read: float9696 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Thanks for the enlightenment ! I was not aware of this 96 bits - 128 bits relationship. -- Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote: On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted franc...@continuum.io wrote: Exactly. I'd update this to read: float9696 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Except float96 is actually 80 bits. (Usually?) Plus some padding… Good point. The thing is that they actually use 96 bit for storage purposes (this is due to alignment requirements). Another quirk related with this is that MSVC automatically maps long double to 64-bit doubles: http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx Not sure on why they did that (portability issues?). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Feb 23, 2012, at 6:06 AM, Francesc Alted wrote: On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote: On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted franc...@continuum.io wrote: Exactly. I'd update this to read: float9696 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Except float96 is actually 80 bits. (Usually?) Plus some padding… Good point. The thing is that they actually use 96 bit for storage purposes (this is due to alignment requirements). Another quirk related with this is that MSVC automatically maps long double to 64-bit doubles: http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx Not sure on why they did that (portability issues?). Hmm, yet another quirk (this time in NumPy itself). On 32-bit platforms: In [16]: np.longdouble Out[16]: numpy.float96 In [17]: np.finfo(np.longdouble).eps Out[17]: 1.084202172485504434e-19 while on 64-bit ones: In [8]: np.longdouble Out[8]: numpy.float128 In [9]: np.finfo(np.longdouble).eps Out[9]: 1.084202172485504434e-19 i.e. NumPy is saying that the eps (machine epsilon) is the same on both platforms, despite the fact that one uses 80-bit precision and the other 128-bit precision. For the 80-bit, the eps should be (): In [5]: 1 / 2**63. Out[5]: 1.0842021724855044e-19 [http://en.wikipedia.org/wiki/Extended_precision] which is correctly stated by NumPy, while for 128-bit (quad precision), eps should be: In [6]: 1 / 2**113. Out[6]: 9.62964972193618e-35 [http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format] If nobody objects, I'll file a bug about this. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] python geospatial package?
2012/2/23 Vincent Schut sc...@sarvision.nl On 02/22/2012 10:45 PM, Chao YUE wrote: Hi all, Is anyone using some python geospatial package that can do jobs like intersection, etc. the job is like you automatically extract a region on a global map etc. thanks and cheers, Chao Depending what you want to do: Shapely, GDAL/OGR, pyproj, Mapnik, Basemap,... ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Special matrices with structure?
Hi! I was wondering whether it would be easy/possible/reasonable to have classes for arrays that have special structure in order to use less memory and speed up some computations? For instance: - symmetric matrix could be stored in almost half the memory required by a non-symmetric matrix - diagonal matrix only needs to store the diagonal vector - Toeplitz matrix only needs to store one or two vectors - sparse matrix only needs to store non-zero elements (some implementations in scipy.sparse) - and so on If such classes were implemented, it would be nice if they worked with numpy functions (dot, diag, ...) and operations (+, *, +=, ...) easily. I believe this has been discussed before but google didn't help a lot.. Regards, Jaakko ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Hi, On Thu, Feb 23, 2012 at 4:23 AM, Francesc Alted franc...@continuum.io wrote: On Feb 23, 2012, at 6:06 AM, Francesc Alted wrote: On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote: On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted franc...@continuum.io wrote: Exactly. I'd update this to read: float96 96 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Except float96 is actually 80 bits. (Usually?) Plus some padding… Good point. The thing is that they actually use 96 bit for storage purposes (this is due to alignment requirements). Another quirk related with this is that MSVC automatically maps long double to 64-bit doubles: http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx Not sure on why they did that (portability issues?). Hmm, yet another quirk (this time in NumPy itself). On 32-bit platforms: In [16]: np.longdouble Out[16]: numpy.float96 In [17]: np.finfo(np.longdouble).eps Out[17]: 1.084202172485504434e-19 while on 64-bit ones: In [8]: np.longdouble Out[8]: numpy.float128 In [9]: np.finfo(np.longdouble).eps Out[9]: 1.084202172485504434e-19 i.e. NumPy is saying that the eps (machine epsilon) is the same on both platforms, despite the fact that one uses 80-bit precision and the other 128-bit precision. For the 80-bit, the eps should be (): In [5]: 1 / 2**63. Out[5]: 1.0842021724855044e-19 [http://en.wikipedia.org/wiki/Extended_precision] which is correctly stated by NumPy, while for 128-bit (quad precision), eps should be: In [6]: 1 / 2**113. Out[6]: 9.62964972193618e-35 [http://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format] If nobody objects, I'll file a bug about this. There was half a proposal for renaming these guys in the interests of clarity: http://mail.scipy.org/pipermail/numpy-discussion/2011-October/058820.html I'd be happy to write this up as a NEP. Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Thu, Feb 23, 2012 at 5:23 AM, Francesc Alted franc...@continuum.iowrote: On Feb 23, 2012, at 6:06 AM, Francesc Alted wrote: On Feb 23, 2012, at 5:43 AM, Nathaniel Smith wrote: On Thu, Feb 23, 2012 at 11:40 AM, Francesc Alted franc...@continuum.io wrote: Exactly. I'd update this to read: float9696 bits. Only available on 32-bit (i386) platforms. float128 128 bits. Only available on 64-bit (AMD64) platforms. Except float96 is actually 80 bits. (Usually?) Plus some padding… Good point. The thing is that they actually use 96 bit for storage purposes (this is due to alignment requirements). Another quirk related with this is that MSVC automatically maps long double to 64-bit doubles: http://msdn.microsoft.com/en-us/library/9cx8xs15.aspx Not sure on why they did that (portability issues?). Hmm, yet another quirk (this time in NumPy itself). On 32-bit platforms: In [16]: np.longdouble Out[16]: numpy.float96 In [17]: np.finfo(np.longdouble).eps Out[17]: 1.084202172485504434e-19 while on 64-bit ones: In [8]: np.longdouble Out[8]: numpy.float128 In [9]: np.finfo(np.longdouble).eps Out[9]: 1.084202172485504434e-19 i.e. NumPy is saying that the eps (machine epsilon) is the same on both platforms, despite the fact that one uses 80-bit precision and the other 128-bit precision. For the 80-bit, the eps should be (): That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. Chuck In [5]: 1 / 2**63. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Special matrices with structure?
On 02/23/2012 05:50 AM, Jaakko Luttinen wrote: Hi! I was wondering whether it would be easy/possible/reasonable to have classes for arrays that have special structure in order to use less memory and speed up some computations? For instance: - symmetric matrix could be stored in almost half the memory required by a non-symmetric matrix - diagonal matrix only needs to store the diagonal vector - Toeplitz matrix only needs to store one or two vectors - sparse matrix only needs to store non-zero elements (some implementations in scipy.sparse) - and so on If such classes were implemented, it would be nice if they worked with numpy functions (dot, diag, ...) and operations (+, *, +=, ...) easily. I believe this has been discussed before but google didn't help a lot.. I'm currently working on a library for this. The catch is that that I'm doing it as a work project, not a hobby project -- so only the features I strictly need for my PhD thesis really gets priority. That means that it will only really be developed for use on clusters/MPI, not so much for single-node LAPACK. I'd love to pair up with someone who could make sure the library is more generally useful, which is my real goal (if I ever get spare time again...). The general idea of my approach is to have lazily evaluated expressions: A = # ... diagonal matrix B = # ... dense matrix L = (give(A) + give(B)).cholesky() # only symbolic! # give means: overwrite if you want to explain(L) # prints what it will do if it computes L L = compute(L) # does the computation What the code above would do is: - First, determine that the fastest way of doing + is to take the elements in A and += them inplace to the diagonal in B - Then, do the Cholesky in Note that if you change the types of. The goal is to facilitate writing general code which doesn't know the types of the matrices, yet still string together the optimal chain of calls. This requires waiting with evaluation until an explicit compute call (which essentially does a compilation). Adding matrix types and operations is done through pattern matching. This one can provide code like this to provide optimized code for wierd special cases: @computation(RowMajorDense + ColMajorDense, RowMajorDense) def add(a, b): # provide an optimized case for row-major + col-major, resulting # in row-major @cost(add) def add_cost(a, b): # provide estimate for cost of the above routine The compiler looks at all the provided @computation and should determines the cheapest path. My code is at https://github.com/dagss/oomatrix, but I certainly haven't done anything yet to make the codebase useful to anyone but me, so you probably shouldn't look at it, but rather ask me here. Dag ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Special matrices with structure?
On 02/23/2012 09:47 AM, Dag Sverre Seljebotn wrote: On 02/23/2012 05:50 AM, Jaakko Luttinen wrote: Hi! I was wondering whether it would be easy/possible/reasonable to have classes for arrays that have special structure in order to use less memory and speed up some computations? For instance: - symmetric matrix could be stored in almost half the memory required by a non-symmetric matrix - diagonal matrix only needs to store the diagonal vector - Toeplitz matrix only needs to store one or two vectors - sparse matrix only needs to store non-zero elements (some implementations in scipy.sparse) - and so on If such classes were implemented, it would be nice if they worked with numpy functions (dot, diag, ...) and operations (+, *, +=, ...) easily. I believe this has been discussed before but google didn't help a lot.. I'm currently working on a library for this. The catch is that that I'm doing it as a work project, not a hobby project -- so only the features I strictly need for my PhD thesis really gets priority. That means that it will only really be developed for use on clusters/MPI, not so much for single-node LAPACK. I'd love to pair up with someone who could make sure the library is more generally useful, which is my real goal (if I ever get spare time again...). The general idea of my approach is to have lazily evaluated expressions: A = # ... diagonal matrix B = # ... dense matrix L = (give(A) + give(B)).cholesky() # only symbolic! # give means: overwrite if you want to explain(L) # prints what it will do if it computes L L = compute(L) # does the computation What the code above would do is: - First, determine that the fastest way of doing + is to take the elements in A and += them inplace to the diagonal in B - Then, do the Cholesky in Sorry: Then, do the Cholesky inplace in the buffer of B, and use that for L. Dag Note that if you change the types of. The goal is to facilitate writing general code which doesn't know the types of the matrices, yet still string together the optimal chain of calls. This requires waiting with evaluation until an explicit compute call (which essentially does a compilation). Adding matrix types and operations is done through pattern matching. This one can provide code like this to provide optimized code for wierd special cases: @computation(RowMajorDense + ColMajorDense, RowMajorDense) def add(a, b): # provide an optimized case for row-major + col-major, resulting # in row-major @cost(add) def add_cost(a, b): # provide estimate for cost of the above routine The compiler looks at all the provided @computation and should determines the cheapest path. My code is at https://github.com/dagss/oomatrix, but I certainly haven't done anything yet to make the codebase useful to anyone but me, so you probably shouldn't look at it, but rather ask me here. Dag ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Le 23/02/2012 17:28, Charles R Harris a écrit : That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. This is insane ! ;-) -- Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Hi, On Thu, Feb 23, 2012 at 10:11 AM, Pierre Haessig pierre.haes...@crans.org wrote: Le 23/02/2012 17:28, Charles R Harris a écrit : That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. This is insane ! ;-) I don't know if it's insane, but it is certainly very confusing, as this thread the previous one show. The question is, what would be less confusing? Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Thu, Feb 23, 2012 at 10:42 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Thu, Feb 23, 2012 at 10:11 AM, Pierre Haessig pierre.haes...@crans.org wrote: Le 23/02/2012 17:28, Charles R Harris a écrit : That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. This is insane ! ;-) I don't know if it's insane, but it is certainly very confusing, as this thread the previous one show. The question is, what would be less confusing? One approach would be to never alias longdouble as float###. Especially float128 seems to imply that it's the IEEE standard binary128 float, which it is on some platforms, but not on most. Cheers, Mark Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Hi, On Thu, Feb 23, 2012 at 10:45 AM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Feb 23, 2012 at 10:42 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Thu, Feb 23, 2012 at 10:11 AM, Pierre Haessig pierre.haes...@crans.org wrote: Le 23/02/2012 17:28, Charles R Harris a écrit : That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. This is insane ! ;-) I don't know if it's insane, but it is certainly very confusing, as this thread the previous one show. The question is, what would be less confusing? One approach would be to never alias longdouble as float###. Especially float128 seems to imply that it's the IEEE standard binary128 float, which it is on some platforms, but not on most. It's virtually never IEEE binary128. Yarik Halchenko found a real one on an s/360 running Debian. Some docs seem to suggest there are Sun machines out there with binary128, as Chuck said. So the vast majority of numpy users with float128 have Intel 80-bit, and some have PPC twin-float. Do we all agree then that 'float128' is a bad name? In the last thread, I had the feeling there was some consensus on renaming Intel 80s to: float128 - float80_128 float96 - float80_96 For those platforms implementing it, maybe float128 - float128_ieee Maybe for PPC: float128 - float_pair_128 and, personally, I still think it would be preferable, and less confusing, to encourage use of 'longdouble' instead of the various platform specific aliases. What do you think? Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
On Thu, Feb 23, 2012 at 10:55 AM, Matthew Brett matthew.br...@gmail.comwrote: Hi, On Thu, Feb 23, 2012 at 10:45 AM, Mark Wiebe mwwi...@gmail.com wrote: On Thu, Feb 23, 2012 at 10:42 AM, Matthew Brett matthew.br...@gmail.com wrote: Hi, On Thu, Feb 23, 2012 at 10:11 AM, Pierre Haessig pierre.haes...@crans.org wrote: Le 23/02/2012 17:28, Charles R Harris a écrit : That's correct. They are both extended precision (80 bits), but aligned on 32bit/64bit boundaries respectively. Sun provides a true quad precision, also called float128, while on PPC long double is an odd combination of two doubles. This is insane ! ;-) I don't know if it's insane, but it is certainly very confusing, as this thread the previous one show. The question is, what would be less confusing? One approach would be to never alias longdouble as float###. Especially float128 seems to imply that it's the IEEE standard binary128 float, which it is on some platforms, but not on most. It's virtually never IEEE binary128. Yarik Halchenko found a real one on an s/360 running Debian. Some docs seem to suggest there are Sun machines out there with binary128, as Chuck said. So the vast majority of numpy users with float128 have Intel 80-bit, and some have PPC twin-float. Do we all agree then that 'float128' is a bad name? In the last thread, I had the feeling there was some consensus on renaming Intel 80s to: float128 - float80_128 float96 - float80_96 For those platforms implementing it, maybe float128 - float128_ieee Maybe for PPC: float128 - float_pair_128 and, personally, I still think it would be preferable, and less confusing, to encourage use of 'longdouble' instead of the various platform specific aliases. +1, I think it's good for its name to correspond to the name in C/C++, so that when people search for information on it they will find the relevant information more easily. With a bunch of NumPy-specific aliases, it just creates more hassle for everybody. -Mark What do you think? Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Possible roadmap addendum: building better text file readers
dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] mkl usage
Is mkl only used for linear algebra? Will it speed up e.g., elementwise transendental functions? ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] mkl usage
On Feb 23, 2012, at 1:33 PM, Neal Becker wrote: Is mkl only used for linear algebra? Will it speed up e.g., elementwise transendental functions? Yes, MKL comes with VML that has this type of optimizations: http://software.intel.com/sites/products/documentation/hpc/mkl/vml/vmldata.htm Also, see some speedups in a numexpr linked against MKL here: http://code.google.com/p/numexpr/wiki/NumexprVML See also how native multi-threading implementation in numexpr beats MKL's one (at least for this particular example). -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] mkl usage
23.02.2012 20:44, Francesc Alted kirjoitti: On Feb 23, 2012, at 1:33 PM, Neal Becker wrote: Is mkl only used for linear algebra? Will it speed up e.g., elementwise transendental functions? Yes, MKL comes with VML that has this type of optimizations: And also no, in the sense that Numpy and Scipy don't use VML. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] np.longlong casts to int
Le 23/02/2012 20:08, Mark Wiebe a écrit : +1, I think it's good for its name to correspond to the name in C/C++, so that when people search for information on it they will find the relevant information more easily. With a bunch of NumPy-specific aliases, it just creates more hassle for everybody. I don't fully agree. First, this assumes that people were C-educated, at least a bit. I got some C education, but I spent most of my scientific programming time sitting in front of Python, Matlab, and a bit of R (in that order). In this context, double, floats, long and short are all esoteric incantation. Second the C/C++ names are very unprecise with regards to their memory content, and sometimes platform dependent. On the other float64 is very informative. Also, how do these name scale with extended precision (where it's available... ;-) ) ? I wonder what may come after longdoulble/longfloat : what about hyperlongsuperfancyextendeddoublefloat ? I feel float1024 simpler ;-) Now, because of all the specifities you described, this seems to be a complex topic. I guess that good documented aliases help people understand this very complexity. Best, Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:08 PM, Travis Oliphant tra...@continuum.io wrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis Yeah, what I envision is just an infrastructural parsing engine to replace the pure Python guts of np.loadtxt, np.genfromtxt, and the csv module + Cython guts of pandas.read_{csv, table, excel}. It needs to be somewhat adaptable to some of the domain specific decisions of structured arrays vs. DataFrames-- like I use Python objects for strings, but one consequence of this is that I can intern strings (only one PyObject per unique string value occurring) where structured arrays cannot, so you get much better performance and memory usage that way. That's soon to change, though, I gather, at which point I'll almost definitely (!) move to pointer arrays instead of dtype=object arrays. - Wes On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] mkl usage
Pauli Virtanen wrote: 23.02.2012 20:44, Francesc Alted kirjoitti: On Feb 23, 2012, at 1:33 PM, Neal Becker wrote: Is mkl only used for linear algebra? Will it speed up e.g., elementwise transendental functions? Yes, MKL comes with VML that has this type of optimizations: And also no, in the sense that Numpy and Scipy don't use VML. My question is: Should I purchase MKL? To what extent will it speed up my existing python code, without my having to exert (much) effort? So that would be numpy/scipy. I'd entertain trying other things, if it wasn't much effort. ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant tra...@continuum.iowrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis I have a proof of concept CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend. Warren On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Wes - I designed the recfile package to fill this need. It might be a start. Some features: - the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii. http://code.google.com/p/recfile/ the trunk is pretty far past the most recent release. Erin Scott Sheldon Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:19 PM, Warren Weckesser warren.weckes...@enthought.com wrote: On Thu, Feb 23, 2012 at 2:08 PM, Travis Oliphant tra...@continuum.io wrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis I have a proof of concept CSV reader written in C (with a Cython wrapper). I'll put it on github this weekend. Warren Sweet, between this, Continuum folks, and me and my guys I think we can come up with something good and suits all our needs. We should set up some realistic performance test cases that we can monitor via vbench (wesm/vbench) while we're work on the project. - W On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon erin.shel...@gmail.com wrote: Wes - I designed the recfile package to fill this need. It might be a start. Some features: - the ability to efficiently read any subset of the data without loading the whole file. - reads directly into a recarray, so no overheads. - object oriented interface, mimicking recarray slicing. - also supports writing Currently it is fixed-width fields only. It is C++, but wouldn't be hard to convert it C if that is a requirement. Also, it works for binary or ascii. http://code.google.com/p/recfile/ the trunk is pretty far past the most recent release. Erin Scott Sheldon Can you relicense as BSD-compatible? Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit : That would indeed be great. Reading large files is a real pain whatever the python method used. BTW, could you tell us what you mean by large files? cheers, Éric. Sweet, between this, Continuum folks, and me and my guys I think we can come up with something good and suits all our needs. We should set up some realistic performance test cases that we can monitor via vbench (wesm/vbench) while we're work on the project. Un clavier azerty en vaut deux -- Éric Depagnee...@depagne.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Excerpts from Wes McKinney's message of Thu Feb 23 15:24:44 -0500 2012: On Thu, Feb 23, 2012 at 3:23 PM, Erin Sheldon erin.shel...@gmail.com wrote: I designed the recfile package to fill this need. It might be a start. Can you relicense as BSD-compatible? If required, that would be fine with me. -e Excerpts from Wes McKinney's message of Thu Feb 23 14:32:13 -0500 2012: dear all, I haven't read all 180 e-mails, but I didn't see this on Travis's initial list. All of the existing flat file reading solutions I have seen are not suitable for many applications, and they compare very unfavorably to tools present in other languages, like R. Here are some of the main issues I see: - Memory usage: creating millions of Python objects when reading a large file results in horrendously bad memory utilization, which the Python interpreter is loathe to return to the operating system. Any solution using the CSV module (like pandas's parsers-- which are a lot faster than anything else I know of in Python) suffers from this problem because the data come out boxed in tuples of PyObjects. Try loading a 1,000,000 x 20 CSV file into a structured array using np.genfromtxt or into a DataFrame using pandas.read_csv and you will immediately see the problem. R, by contrast, uses very little memory. - Performance: post-processing of Python objects results in poor performance. Also, for the actual parsing, anything regular expression based (like the loadtable effort over the summer, all apologies to those who worked on it), is doomed to failure. I think having a tool with a high degree of compatibility and intelligence for parsing unruly small files does make sense though, but it's not appropriate for large, well-behaved files. - Need to factorize: as soon as there is an enum dtype in NumPy, we will want to enable the file parsers for structured arrays and DataFrame to be able to factorize / convert to enum certain columns (for example, all string columns) during the parsing process, and not afterward. This is very important for enabling fast groupby on large datasets and reducing unnecessary memory usage up front (imagine a column with a million values, with only 10 unique values occurring). This would be trivial to implement using a C hash table implementation like khash.h To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. It seems clear to me that this work needs to be done at the lowest level possible, probably all in C (or C++?) or maybe Cython plus C utilities. If anyone wants to get involved in this particular problem right now, let me know! best, Wes -- Erin Scott Sheldon Brookhaven National Laboratory -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] mkl usage
On Feb 23, 2012, at 2:19 PM, Neal Becker wrote: Pauli Virtanen wrote: 23.02.2012 20:44, Francesc Alted kirjoitti: On Feb 23, 2012, at 1:33 PM, Neal Becker wrote: Is mkl only used for linear algebra? Will it speed up e.g., elementwise transendental functions? Yes, MKL comes with VML that has this type of optimizations: And also no, in the sense that Numpy and Scipy don't use VML. My question is: Should I purchase MKL? To what extent will it speed up my existing python code, without my having to exert (much) effort? So that would be numpy/scipy. Pauli already answered you. If you are restricted to use numpy/scipy and your aim is to accelerate the evaluation of transcendental functions, then there is no point in purchasing MKL. If you can open your spectrum and use numexpr, then I think you should ponder about it. -- Francesc Alted ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Le 23/02/2012 20:32, Wes McKinney a écrit : If anyone wants to get involved in this particular problem right now, let me know! Hi Wes, I'm totally out of the implementations issues you described, but I have some million-lines-long CSV files so that I experience some slowdown when loading those. I'll be very glad to use any upgraded loadfromtxt/genfromtxt/anyfunction once it's out ! Best, Pierre (and this reminds me shamefully that I still didn't take the time to give a serious try at your pandas...) signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Le 23/02/2012 21:08, Travis Oliphant a écrit : I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. Ok, now I understand why I got confused ;-) -- Pierre signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:31 PM, Éric Depagne e...@depagne.org wrote: Le jeudi 23 février 2012 21:24:28, Wes McKinney a écrit : That would indeed be great. Reading large files is a real pain whatever the python method used. BTW, could you tell us what you mean by large files? cheers, Éric. Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012: Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. As a benchmark, recfile can read an uncached file with 350,000 lines and 32 columns in about 5 seconds. File size ~220M -e -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:55 PM, Erin Sheldon erin.shel...@gmail.com wrote: Excerpts from Wes McKinney's message of Thu Feb 23 15:45:18 -0500 2012: Reasonably wide CSV files with hundreds of thousands to millions of rows. I have a separate interest in JSON handling but that is a different kind of problem, and probably just a matter of forking ultrajson and having it not produce Python-object-based data structures. As a benchmark, recfile can read an uncached file with 350,000 lines and 32 columns in about 5 seconds. File size ~220M -e -- Erin Scott Sheldon Brookhaven National Laboratory That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(35, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. -W ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote: In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. But why, oh why, are people storing big data in CSV? G ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
But why, oh why, are people storing big data in CSV? Well, that's what scientist do :-) Éric. G ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion Un clavier azerty en vaut deux -- Éric Depagnee...@depagne.org ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote: In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. But why, oh why, are people storing big data in CSV? Because everyone can read it. It's not so much storage as transmission. -- Robert Kern ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012: That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(35, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files? http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ The dtype is [('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)] -- Erin Scott Sheldon Brookhaven National Laboratory ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 3:14 PM, Robert Kern robert.k...@gmail.com wrote: On Thu, Feb 23, 2012 at 21:09, Gael Varoquaux gael.varoqu...@normalesup.org wrote: On Thu, Feb 23, 2012 at 04:07:04PM -0500, Wes McKinney wrote: In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. But why, oh why, are people storing big data in CSV? Because everyone can read it. It's not so much storage as transmission. Because their labmate/officemate/advisor is using Excel... Ben Root ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
On Thu, Feb 23, 2012 at 4:20 PM, Erin Sheldon erin.shel...@gmail.com wrote: Excerpts from Wes McKinney's message of Thu Feb 23 16:07:04 -0500 2012: That's pretty good. That's faster than pandas's csv-module+Cython approach almost certainly (but I haven't run your code to get a read on how much my hardware makes a difference), but that's not shocking at all: In [1]: df = DataFrame(np.random.randn(35, 32)) In [2]: df.to_csv('/home/wesm/tmp/foo.csv') In [3]: %time df2 = read_csv('/home/wesm/tmp/foo.csv') CPU times: user 6.62 s, sys: 0.40 s, total: 7.02 s Wall time: 7.04 s I must think that skipping the process of creating 11.2 mm Python string objects and then individually converting each of them to float. Note for reference (i'm skipping the first row which has the column labels from above): In [2]: %time arr = np.genfromtxt('/home/wesm/tmp/foo.csv', dtype=None, delimiter=',', skip_header=1)CPU times: user 24.17 s, sys: 0.48 s, total: 24.65 s Wall time: 24.67 s In [6]: %time arr = np.loadtxt('/home/wesm/tmp/foo.csv', delimiter=',', skiprows=1) CPU times: user 11.08 s, sys: 0.22 s, total: 11.30 s Wall time: 11.32 s In this last case for example, around 500 MB of RAM is taken up for an array that should only be about 80-90MB. If you're a data scientist working in Python, this is _not good_. It might be good to compare on recarrays, which are a bit more complex. Can you try one of these .dat files? http://www.cosmo.bnl.gov/www/esheldon/data/lensing/scat/05/ The dtype is [('ra', 'f8'), ('dec', 'f8'), ('g1', 'f8'), ('g2', 'f8'), ('err', 'f8'), ('scinv', 'f8', 27)] -- Erin Scott Sheldon Brookhaven National Laboratory Forgot this one that is also widely used: In [28]: %time recs = matplotlib.mlab.csv2rec('/home/wesm/tmp/foo.csv', skiprows=1) CPU times: user 65.16 s, sys: 0.30 s, total: 65.46 s Wall time: 65.55 s ok with one of those dat files and the dtype I get: In [18]: %time arr = np.genfromtxt('/home/wesm/Downloads/scat-05-000.dat', dtype=dtype, skip_header=0, delimiter=' ') CPU times: user 17.52 s, sys: 0.14 s, total: 17.66 s Wall time: 17.67 s difference not so stark in this case. I don't produce structured arrays, though In [26]: %time arr = read_table('/home/wesm/Downloads/scat-05-000.dat', header=None, sep=' ') CPU times: user 10.15 s, sys: 0.10 s, total: 10.25 s Wall time: 10.26 s - Wes ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
Le 23/02/2012 22:38, Benjamin Root a écrit : labmate/officemate/advisor is using Excel... ... or an industrial partner with its windows-based software that can export (when it works) some very nice field data from a proprietary Honeywell data logger. CSV data is better than no data ! (and better than XLS data !) About the *big* data aspect of Gael's question, this reminds me a software project saying [1] that I would distort the following way : '' Q : How does a CSV data file get to be a million line long ? A : One line at a time ! '' And my experience with some time series measurements was really about this : small changes in the data rate, a slightly longer acquisition period, and that's it ! Pierre (I shamefully confess I spent several hours writing *ad-hoc* Python scripts full of regexps and generators just to fix various tiny details of those CSV files... but in the end it worked !) [1] I just quickly googled one day at a time for a reference and ended up on http://en.wikipedia.org/wiki/The_Mythical_Man-Month signature.asc Description: OpenPGP digital signature ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Problem Building Numpy with Python 2.7.1 and OS X 10.7.3
Hi there, I'm having a problem building NumPy on Python 2.7.1 and OS X 10.7.3. Here is my build log: https://gist.github.com/1895377 Does anyone have any idea what might be happening? I get a very similar error when compiling with clang. Installing a binary really isn't an option for me due to some specifics of my project. Does anyone have an idea what might be wrong? Thanks. --patrick ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Announcing Theano 0.5
=== Announcing Theano 0.5 === This is a major version, with lots of new features, bug fixes, and some interface changes (deprecated or potentially misleading features were removed). Upgrading to Theano 0.5 is recommended for everyone, but you should first make sure that your code does not raise deprecation warnings with Theano 0.4.1. Otherwise, in one case the results can change. In other cases, the warnings are turned into errors (see below for details). For those using the bleeding edge version in the git repository, we encourage you to update to the `0.5` tag. If you have updated to 0.5rc1 or 0.5rc2, you are highly encouraged to update to 0.5, as some bugs introduced in those versions have now been fixed, see items marked with '#' in the lists below. What's New -- Highlight: * Moved to github: http://github.com/Theano/Theano/ * Old trac ticket moved to assembla ticket: http://www.assembla.com/spaces/theano/tickets * Theano vision: http://deeplearning.net/software/theano/introduction.html#theano-vision (Many people) * Theano with GPU works in some cases on Windows now. Still experimental. (Sebastian Urban) * Faster dot() call: New/Better direct call to cpu and gpu ger, gemv, gemm and dot(vector, vector). (James, Frédéric, Pascal) * C implementation of Alloc. (James, Pascal) * theano.grad() now also work with sparse variable. (Arnaud) * Macro to implement the Jacobian/Hessian with theano.tensor.{jacobian,hessian} (Razvan) * See the Interface changes. Interface Behavior Changes: * The current default value of the parameter axis of theano.{max,min,argmax,argmin,max_and_argmax} is now the same as numpy: None. i.e. operate on all dimensions of the tensor. (Frédéric Bastien, Olivier Delalleau) (was deprecated and generated a warning since Theano 0.3 released Nov. 23rd, 2010) * The current output dtype of sum with input dtype [u]int* is now always [u]int64. You can specify the output dtype with a new dtype parameter to sum. The output dtype is the one using for the summation. There is no warning in previous Theano version about this. The consequence is that the sum is done in a dtype with more precision than before. So the sum could be slower, but will be more resistent to overflow. This new behavior is the same as numpy. (Olivier, Pascal) # When using a GPU, detect faulty nvidia drivers. This was detected when running Theano tests. Now this is always tested. Faulty drivers results in in wrong results for reduce operations. (Frederic B.) Interface Features Removed (most were deprecated): * The string modes FAST_RUN_NOGC and STABILIZE are not accepted. They were accepted only by theano.function(). Use Mode(linker='c|py_nogc') or Mode(optimizer='stabilize') instead. * tensor.grad(cost, wrt) now always returns an object of the same type as wrt (list/tuple/TensorVariable). (Ian Goodfellow, Olivier) * A few tag.shape and Join.vec_length left have been removed. (Frederic) * The .value attribute of shared variables is removed, use shared.set_value() or shared.get_value() instead. (Frederic) * Theano config option home is not used anymore as it was redundant with base_compiledir. If you use it, Theano will now raise an error. (Olivier D.) * scan interface changes: (Razvan Pascanu) * The use of `return_steps` for specifying how many entries of the output to return has been removed. Instead, apply a subtensor to the output returned by scan to select a certain slice. * The inner function (that scan receives) should return its outputs and updates following this order: [outputs], [updates], [condition]. One can skip any of the three if not used, but the order has to stay unchanged. Interface bug fix: * Rop in some case should have returned a list of one Theano variable, but returned the variable itself. (Razvan) New deprecation (will be removed in Theano 0.6, warning generated if you use them): * tensor.shared() renamed to tensor._shared(). You probably want to call theano.shared() instead! (Olivier D.) Bug fixes (incorrect results): * On CPU, if the convolution had received explicit shape information, they where not checked at runtime. This caused wrong result if the input shape was not the one expected. (Frederic, reported by Sander Dieleman) * Theoretical bug: in some case we could have GPUSum return bad value. We were not able to reproduce this problem * patterns affected ({0,1}*nb dim, 0 no reduction on this dim, 1 reduction on this dim): 01, 011, 0111, 010, 10, 001, 0011, 0101 (Frederic) * div by zero in verify_grad. This hid a bug in the grad of Images2Neibs. (James) * theano.sandbox.neighbors.Images2Neibs grad was returning a wrong value. The grad is now disabled and returns an error. (Frederic) * An expression of the form 1 / (exp(x) +- constant) was systematically matched to 1 /
Re: [Numpy-discussion] np.longlong casts to int
Hi, On Thu, Feb 23, 2012 at 2:56 PM, Pierre Haessig pierre.haes...@crans.org wrote: Le 23/02/2012 20:08, Mark Wiebe a écrit : +1, I think it's good for its name to correspond to the name in C/C++, so that when people search for information on it they will find the relevant information more easily. With a bunch of NumPy-specific aliases, it just creates more hassle for everybody. I don't fully agree. First, this assumes that people were C-educated, at least a bit. I got some C education, but I spent most of my scientific programming time sitting in front of Python, Matlab, and a bit of R (in that order). In this context, double, floats, long and short are all esoteric incantation. Second the C/C++ names are very unprecise with regards to their memory content, and sometimes platform dependent. On the other float64 is very informative. Right - no proposal to change float64 because it's not ambiguous - it is both binary64 IEEE floating point format and 64 bit width. The confusion here is for float128 - which is very occasionally IEEE binary128 and can be at least two other things (PPC twin double, and Intel 80 bit padded to 128 bits). Some of us were also surprised to find float96 is the same precision as float128 (being an 80 bit Intel padded to 96 bits). The renaming is an attempt to make it less confusing. Do you agree the renaming is less confusing? Do you have another proposal? Preferring 'longdouble' is precisely to flag up to people that they may need to do some more research to find out what exactly that is. Which is correct :) Best, Matthew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Test survey that I have been putting together
Hey all, I would like to gather concrete information about NumPy users and have some data to look at regarding the user base and features that are of interest. We have been putting together a survey that I would love feedback on from members of this list. If you have time and are interested in helping us gather information for improving NumPy, could you please take and fill out information on the following survey: https://www.surveymonkey.com/s/numpy_list_survey After you complete the survey, I would really appreciate any feedback on questions that could be improved, removed, or added. Once we incoporate your feedback, we will distribute the survey more broadly and will report back the main results of the survey to this list. Thank you, -Travis ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
For convenience, here's a link to the mailing list thread on this topic from a couple months ago: http://thread.gmane.org/gmane.comp.python.numeric.general/47094 . Drew ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Possible roadmap addendum: building better text file readers
As others on this list, I've also been confused a bit by the prolific numpy interfaces to reading text. Would it be an idea to create some sort of object oriented solution for this purpose? reader = np.FileReader('my_file.txt') reader.loadtxt() # for backwards compat.; np.loadtxt could instantiate a reader and call this function if one wants to keep the interface reader.very_general_and_typically_slow_reading(missing_data=True) reader.my_files_look_like_this_plz_be_fast(fmt='%20.8e', separator=',', ncol=2) reader.cvs_read() # same as above, but with sensible defaults reader.lazy_read() # returns a generator/iterator, so you can slice out a small part of a huge array, for instance, even when working with text (yes, inefficient) reader.convert_line_by_line(myfunc) # line-by-line call myfunc, letting the user somehow convert easily to his/her format of choice: netcdf, hdf5, ... Not fast, but convenient Another option is to create a hierarchy of readers implemented as classes. Not sure if the benefits outweigh the disadvantages. Just a crazy idea - it would at least gather all the file reading interfaces into one place (or one object hierarchy) so folks know where to look. The whole numpy namespace is a bit cluttered, imho, and for newbies it would be beneficial to use submodules to a greater extent than today - but that's a more long-term discussion. Paul On 23. feb. 2012, at 21:08, Travis Oliphant wrote: This is actually on my short-list as well --- it just didn't make it to the list. In fact, we have someone starting work on it this week. It is his first project so it will take him a little time to get up to speed on it, but he will contact Wes and work with him and report progress to this list. Integration with np.loadtxt is a high-priority. I think loadtxt is now the 3rd or 4th text-reading interface I've seen in NumPy. I have no interest in making a new one if we can avoid it. But, we do need to make it faster with less memory overhead for simple cases like Wes describes. -Travis On Feb 23, 2012, at 1:53 PM, Pauli Virtanen wrote: Hi, 23.02.2012 20:32, Wes McKinney kirjoitti: [clip] To be clear: I'm going to do this eventually whether or not it happens in NumPy because it's an existing problem for heavy pandas users. I see no reason why the code can't emit structured arrays, too, so we might as well have a common library component that I can use in pandas and specialize to the DataFrame internal structure. If you do this, one useful aim could be to design the code such that it can be used in loadtxt, at least as a fast path for common cases. I'd really like to avoid increasing the number of APIs for text file loading. -- Pauli Virtanen ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion ___ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion