[Numpy-discussion] Type of 1st argument in Numexpr where()
Hi all, I noticed that the set of ``where()`` functions defined by Numexpr all have a signature like ``xfxx``, i.e. the first argument is a float and the return, second and third arguments are of the same type (whatever it is). Since the first argument effectively represents a condition, wouldn't it make more sense for it to be a boolean? Booleans are already supported by Numexpr, maybe the old signatures are just a legacy from the time when Numexpr didn't support them. I have attached a patch to the latest version of Numexpr which implements this. Cheers, PS: It seems that http://numpy.scipy.org/ still points to the old SourceForge list address. :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data Index: interp_body.c === --- interp_body.c (revisión: 2439) +++ interp_body.c (copia de trabajo) @@ -155,7 +155,7 @@ case OP_POW_III: VEC_ARG2(i_dest = (i2 0) ? (1 / i1) : (long)pow(i1, i2)); case OP_MOD_III: VEC_ARG2(i_dest = i1 % i2); -case OP_WHERE_IFII: VEC_ARG3(i_dest = f1 ? i2 : i3); +case OP_WHERE_IBII: VEC_ARG3(i_dest = b1 ? i2 : i3); case OP_CAST_FB: VEC_ARG1(f_dest = (long)b1); case OP_CAST_FI: VEC_ARG1(f_dest = (double)(i1)); @@ -175,7 +175,7 @@ case OP_SQRT_FF: VEC_ARG1(f_dest = sqrt(f1)); case OP_ARCTAN2_FFF: VEC_ARG2(f_dest = atan2(f1, f2)); -case OP_WHERE_: VEC_ARG3(f_dest = f1 ? f2 : f3); +case OP_WHERE_FBFF: VEC_ARG3(f_dest = b1 ? f2 : f3); case OP_FUNC_FF: VEC_ARG1(f_dest = functions_f[arg2](f1)); case OP_FUNC_FFF: VEC_ARG2(f_dest = functions_ff[arg3](f1, f2)); @@ -206,8 +206,8 @@ case OP_EQ_BCC: VEC_ARG2(b_dest = (c1r == c2r c1i == c2i) ? 1 : 0); case OP_NE_BCC: VEC_ARG2(b_dest = (c1r != c2r || c1i != c2i) ? 1 : 0); -case OP_WHERE_CFCC: VEC_ARG3(cr_dest = f1 ? c2r : c3r; - ci_dest = f1 ? c2i : c3i); +case OP_WHERE_CBCC: VEC_ARG3(cr_dest = b1 ? c2r : c3r; + ci_dest = b1 ? c2i : c3i); case OP_FUNC_CC: VEC_ARG1(ca.real = c1r; ca.imag = c1i; functions_cc[arg2](ca, ca); Index: tests/test_numexpr.py === --- tests/test_numexpr.py (revisión: 2439) +++ tests/test_numexpr.py (copia de trabajo) @@ -186,8 +186,8 @@ 'sinh(a)', '2*a + (cos(3)+5)*sinh(cos(b))', '2*a + arctan2(a, b)', - 'where(a, 2, b)', - 'where((a-10).real, a, 2)', + 'where(a != 0.0, 2, b)', + 'where((a-10).real != 0.0, a, 2)', 'cos(1+1)', '1+1', '1', Index: interpreter.c === --- interpreter.c (revisión: 2439) +++ interpreter.c (copia de trabajo) @@ -45,7 +45,7 @@ OP_DIV_III, OP_POW_III, OP_MOD_III, -OP_WHERE_IFII, +OP_WHERE_IBII, OP_CAST_FB, OP_CAST_FI, @@ -63,7 +63,7 @@ OP_TAN_FF, OP_SQRT_FF, OP_ARCTAN2_FFF, -OP_WHERE_, +OP_WHERE_FBFF, OP_FUNC_FF, OP_FUNC_FFF, @@ -80,7 +80,7 @@ OP_SUB_CCC, OP_MUL_CCC, OP_DIV_CCC, -OP_WHERE_CFCC, +OP_WHERE_CBCC, OP_FUNC_CC, OP_FUNC_CCC, @@ -148,9 +148,9 @@ case OP_POW_III: if (n == 0 || n == 1 || n == 2) return 'i'; break; -case OP_WHERE_IFII: +case OP_WHERE_IBII: if (n == 0 || n == 2 || n == 3) return 'i'; -if (n == 1) return 'f'; +if (n == 1) return 'b'; break; case OP_CAST_FB: if (n == 0) return 'f'; @@ -178,8 +178,9 @@ case OP_ARCTAN2_FFF: if (n == 0 || n == 1 || n == 2) return 'f'; break; -case OP_WHERE_: -if (n == 0 || n == 1 || n == 2 || n == 3) return 'f'; +case OP_WHERE_FBFF: +if (n == 0 || n == 2 || n == 3) return 'f'; +if (n == 1) return 'b'; break; case OP_FUNC_FF: if (n == 0 || n == 1) return 'f'; @@ -217,9 +218,9 @@ case OP_DIV_CCC: if (n == 0 || n == 1 || n == 2) return 'c'; break; -case OP_WHERE_CFCC: +case OP_WHERE_CBCC: if (n == 0 || n == 2 || n == 3) return 'c'; -if (n == 1) return 'f'; +if (n == 1) return 'b'; break; case OP_FUNC_CC: if (n == 0 || n == 1) return 'c'; @@ -1320,7 +1321,7 @@ add_op(div_iii, OP_DIV_III); add_op(pow_iii, OP_POW_III); add_op(mod_iii, OP_MOD_III); -add_op(where_ifii, OP_WHERE_IFII); +add_op(where_ibii, OP_WHERE_IBII); add_op
Re: [Numpy-discussion] Type of 1st argument in Numexpr where()
Tim Hochberg (el 2006-12-20 a les 09:20:01 -0700) va dir:: Actually, this is on purpose. Numpy.where (and most other switching constructs in Python) will switch on almost anything. In particular, any number that is nonzero is considered True, zero is considered False. By changing the signature, you're restricting where to only accepting booleans. Since booleans and ints can by freely cast to doubles in numexpr, always using float for the condition saves us a couple of opcodes. [...] Yes, I understand the reasons you expose here. Nou you brought the topic about, I'm curious about what does always using float for the condition saves us a couple of opcodes mean. Could you explain this? Just for curiosity. :) :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] Fixes to Numexpr under 64 bit platforms
Hi all, here you have a patch that fixes some type declaration bugs which cause Numexpr to crash under 64 bit platforms. All of them are confusions between the ``int`` and ``intp`` types, which happen to be the same under 32 bit platforms but not under 64 bit ones, which caused garbage values to be used as shapes and strides. The errors where easy to spot by looking at the warnings yielded by the compiler. Changes have been tested under a Dual Core AMD Opteron 270 running SuSE 10.0 X86-64 with Python 2.4 and 2.5. Have nice holidays, :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data Index: interpreter.c === --- interpreter.c (revision 2465) +++ interpreter.c (working copy) @@ -704,7 +704,7 @@ rawmemsize = BLOCK_SIZE1 * (size_from_sig(constsig) + size_from_sig(tempsig)); mem = PyMem_New(char *, 1 + n_inputs + n_constants + n_temps); rawmem = PyMem_New(char, rawmemsize); -memsteps = PyMem_New(int, 1 + n_inputs + n_constants + n_temps); +memsteps = PyMem_New(intp, 1 + n_inputs + n_constants + n_temps); if (!mem || !rawmem || !memsteps) { Py_DECREF(constants); Py_DECREF(constsig); @@ -822,8 +822,8 @@ int count; int size; int findex; -int *shape; -int *strides; +intp *shape; +intp *strides; int *index; char *buffer; }; @@ -956,7 +956,7 @@ PyObject *output = NULL, *a_inputs = NULL; struct index_data *inddata = NULL; unsigned int n_inputs, n_dimensions = 0; -int shape[MAX_DIMS]; +intp shape[MAX_DIMS]; int i, j, size, r, pc_error; char **inputs = NULL; intp strides[MAX_DIMS]; /* clean up XXX */ @@ -1032,7 +1032,7 @@ for (i = 0; i n_inputs; i++) { PyObject *a = PyTuple_GET_ITEM(a_inputs, i); PyObject *b; -int strides[MAX_DIMS]; +intp strides[MAX_DIMS]; int delta = n_dimensions - PyArray_NDIM(a); if (PyArray_NDIM(a)) { for (j = 0; j n_dimensions; j++) signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] [ANN] PyTables 2.0 alpha2
Hi all, I'm posting this message to announce the availability of the *second alpha release of PyTables 2.0*, the new and shiny major version of PyTables. This release settles the file format used in this major version, removing the need to use pickled objects in order to store system attributes, so we expect that no more changes will happen to the on-disk format for future 2.0 releases. The storage and handling of group filters has also been streamlined. The new release also allows running the complete test suite from within Python, enables new tests and fixes some problems with test data installation, among other fixes. We expect to have the documentation revised and the API definitely settled very soon in order to release the first beta version. The official announcement follows. Enjoy data! :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data === Announcing PyTables 2.0a2 === This is the second *alpha* version of PyTables 2.0. This release, although being fairly stable regarding its operativity, is tagged as alpha because the API can still change a bit (but hopefully not a great deal), so it is meant basically for developers and people who want to get a taste of the new exciting features in this major version. You can download a source package of the version 2.0a2 with generated PDF and HTML docs from http://www.pytables.org/download/preliminary/ You can also get the latest sources from the Subversion repository at http://pytables.org/svn/pytables/trunk/ If you are afraid of Subversion (you shouldn't), you can always download the latest, daily updated, packed sources from http://www.pytables.org/download/snapshot/ Please have in mind that some sections in the manual can be obsolete (specially the Optimization tips chapter). The reference chapter should be fairly up-to-date though. You may also want to have an in-deep read of the ``RELEASE-NOTES.txt`` file where you will find an entire section devoted to how to migrate your existing PyTables 1.x apps to the 2.0 version. You can find an HTML version of this document at http://www.pytables.org/moin/ReleaseNotes/Release_2.0a2 Changes more in depth = Improvements: - NumPy is finally at the core! That means that PyTables no longer needs numarray in order to operate, although it continues to be supported (as well as Numeric). This also means that you should be able to run PyTables in scenarios combining Python 2.5 and 64-bit platforms (these are a source of problems with numarray/Numeric because they don't support this combination yet). - Most of the operations in PyTables have experimented noticeable speed-ups (sometimes up to 2x, like in regular Python table selections). This is a consequence of both using NumPy internally and a considerable effort in terms of refactorization and optimization of the new code. - Numexpr has been integrated in all in-kernel selections. So, now it is possible to perform complex selections like:: result = [ row['var3'] for row in table.where('(var2 20) | (var1 == sas)') ] or:: complex_cond = '((%s = col5) (col2 = %s)) ' \ '| (sqrt(col1 + 3.1*col2 + col3*col4) 3)' result = [ row['var3'] for row in table.where(complex_cond % (inf, sup)) ] and run them at full C-speed (or perhaps more, due to the cache-tuned computing kernel of Numexpr). - Now, it is possible to get fields of the ``Row`` iterator by specifiying their position, or even ranges of positions (extended slicing is supported). For example, you can do:: result = [ row[4] for row in table# fetch field #4 if row[1] 20 ] result = [ row[:] for row in table# fetch all fields if row['var2'] 20 ] result = [ row[1::2] for row in # fetch odd fields table.iterrows(2, 3000, 3) ] in addition to the classical:: result = [row['var3'] for row in table.where('var2 20')] - ``Row`` has received a new method called ``fetch_all_fields()`` in order to easily retrieve all the fields of a row in situations like:: [row.fetch_all_fields() for row in table.where('column1 0.3')] The difference between ``row[:]`` and ``row.fetch_all_fields()`` is that the former will return all the fields as a tuple, while the latter will return the fields in a NumPy void type and should be faster. Choose whatever fits better to your needs. - Now, all data that is read from disk is converted, if necessary, to the native byteorder of the hosting machine (before, this only happened with ``Table`` objects). This should help to accelerate apps that have to do computations with data generated in platforms with a byteorder different than the user machine. - All the leaf constructors have
[Numpy-discussion] [ANN] PyTables 2.0rc1 released
__setitem__) now doesn't make a copy of the value in the case that the shape of the value passed is the same as the slice to be overwritten. This results in considerable memory savings when you are modifying disk objects with big array values. - All leaf constructors (except for ``Array``) have received a new ``chunkshape`` argument that lets the user explicitly select the chunksizes for the underlying HDF5 datasets (only for advanced users). - All leaf constructors have received a new parameter called ``byteorder`` that lets the user specify the byteorder of their data *on disk*. This effectively allows to create datasets in other byteorders than the native platform. - Native HDF5 datasets with ``H5T_ARRAY`` datatypes are fully supported for reading now. - The test suites for the different packages are installed now, so you don't need a copy of the PyTables sources to run the tests. Besides, you can run the test suite from the Python console by using:: tables.tests() Resources = Go to the PyTables web site for more details: http://www.pytables.org About the HDF5 library: http://hdf.ncsa.uiuc.edu/HDF5/ About NumPy: http://numpy.scipy.org/ To know more about the company behind the development of PyTables, see: http://www.carabos.com/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Many thanks also to SourceForge who have helped to make and distribute this package! And last, but not least thanks a lot to the HDF5 and NumPy (and numarray!) makers. Without them PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Team :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] [ANN] PyTables PyTables Pro 2.0 released
Announcing PyTables PyTables Pro 2.0 PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. After more than one year of continuous development and about five months of alpha, beta and release candidates, we are very happy to announce that the PyTables and PyTables Pro 2.0 are here. We are pretty confident that the 2.0 versions are ready to be used in production scenarios, bringing higher performance, better portability (specially in 64-bit environments) and more stability than the 1.x series. You can download a source package of the PyTables 2.0 with generated PDF and HTML docs and binaries for Windows from http://www.pytables.org/download/stable/ For an on-line version of the manual, visit: http://www.pytables.org/docs/manual-2.0 In case you want to know more in detail what has changed in this version, have a look at ``RELEASE_NOTES.txt``. Find the HTML version for this document at: http://www.pytables.org/moin/ReleaseNotes/Release_2.0 If you are a user of PyTables 1.x, probably it is worth for you to look at ``MIGRATING_TO_2.x.txt`` file where you will find directions on how to migrate your existing PyTables 1.x apps to the 2.0 version. You can find an HTML version of this document at http://www.pytables.org/moin/ReleaseNotes/Migrating_To_2.x Introducing PyTables Pro 2.0 The main difference between PyTables Pro and regular PyTables is that the Pro version includes OPSI, a new indexing technology, allowing to perform data lookups in tables exceeding 10 gigarows (10**10 rows) in less than 1 tenth of a second. Wearing more than 15000 tests and having passed the complete test suite in the most common platforms (Windows, Mac OS X, Linux 32-bit and Linux 64-bit), we are pretty confident that PyTables Pro 2.0 is ready to be used in production scenarios, bringing maximum stability and top performance to those users who need it. For more info about PyTables Pro, see: http://www.carabos.com/products/pytables-pro For the operational details and benchmarks see the OPSI white paper: http://www.carabos.com/docs/OPSI-indexes.pdf Coinciding with the publication of PyTables Pro we are introducing an innovative liberation process that will allow to ultimate release the PyTables Pro 2.x series as open source. You may want to know that, by buying a PyTables Pro license, you are contributing to this process. For details, see: http://www.carabos.com/liberation New features of PyTables 2.0 series === - A complete refactoring of many, many modules in PyTables. With this, the different parts of the code are much better integrated and code redundancy is kept under a minimum. A lot of new optimizations have been included as well, making working with it a smoother experience than ever before. - NumPy is finally at the core! That means that PyTables no longer needs numarray in order to operate, although it continues to be supported (as well as Numeric). This also means that you should be able to run PyTables in scenarios combining Python 2.5 and 64-bit platforms (these are a source of problems with numarray/Numeric because they don't support this combination as of this writing). - Most of the operations in PyTables have experimented noticeable speed-ups (sometimes up to 2x, like in regular Python table selections). This is a consequence of both using NumPy internally and a considerable effort in terms of refactorization and optimization of the new code. - Combined conditions are finally supported for in-kernel selections. So, now it is possible to perform complex selections like:: result = [ row['var3'] for row in table.where('(var2 20) | (var1 == sas)') ] or:: complex_cond = '((%s = col5) (col2 = %s)) ' \ '| (sqrt(col1 + 3.1*col2 + col3*col4) 3)' result = [ row['var3'] for row in table.where(complex_cond % (inf, sup)) ] and run them at full C-speed (or perhaps more, due to the cache-tuned computing kernel of Numexpr, which has been integrated into PyTables). - Now, it is possible to get fields of the ``Row`` iterator by specifying their position, or even ranges of positions (extended slicing is supported). For example, you can do:: result = [ row[4] for row in table# fetch field #4 if row[1] 20 ] result = [ row[:] for row in table# fetch all fields if row['var2'] 20 ] result = [ row[1::2] for row in # fetch odd fields table.iterrows(2, 3000, 3) ] in addition to the classical:: result = [row['var3'] for row in
Re: [Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's
Gael Varoquaux (el 2007-07-20 a les 11:24:34 +0200) va dir:: I new I really should put these things on line, I have just been wanting to iron them a bit, but it has been almost two year since I have touched these, so ... http://scipy.org/Cookbook/hdf5_in_Matlab Wow, that looks really sweet and simple, useful code. Great! :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's
Vincent Nijs (el 2007-07-22 a les 10:21:18 -0500) va dir:: [...] I would assume the NULL's could be treated as missing values (?) Don't know about the different types in one column however. Maybe a masked array would do the trick, with NULL values masked out. :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] [ANN] Release of the first PyTables video
= Release of the first PyTables video = `Carabos http://www.carabos.com/`_ is very proud to announce the first of a series of videos dedicated to introducing the main features of PyTables to the public in a visual and easy to grasp manner. http://www.carabos.com/videos/pytables-1-intro `PyTables http://www.pytables.org/`_ is a Free/Open Source package designed to handle massive amounts of data in a simple, but highly efficient way, using the HDF5 file format and NumPy data containers. This first video is an introductory overview of PyTables, covering the following topics: * HDF5 file creation * the object tree * homogeneous array storage * natural naming * working with attributes With a running length of little more than 10 minutes, you may sit back and watch it during any short break. More videos about PyTables will be published in the near future. Stay tuned on www.pytables.org for the announcement of the new videos. We would like to hear your opinion on the video so we can do it better the next time. We are also open to suggestions for the topics of future videos. Best regards, :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] [ANN] Release of the first PyTables video
Steve Lianoglou (el 2007-11-17 a les 10:10:15 -0500) va dir:: `Carabos http://www.carabos.com/`_ is very proud to announce the first of a series of videos dedicated to introducing the main features of PyTables to the public in a visual and easy to grasp manner. I just got a chance to watch the video and wanted to thank you for putting that together. Oh, it's a pleasure to hear that people enjoyed our video, specially since it's our vey first published attempt at screencasts. :) I've always been meaning to check out PyTables but haven't had the time to figure out how to work it on to potentially replace my hacked- together data storage schemes, so these videos are a great help. Though we put a lot of effort in the written tutorials, we also thought that watching an interactive session would be an incomparable way of showing how it really feels to work with PyTables. It's also quite fun, so I recommend both developers and users to try and record and publish your own videos of your favorite tools! :) Looking forward to your next video! We're currently working on it and it may get released in a couple of weeks or so. We'll probably only be announcing it in the PyTables lists and in our websites, to avoid flooding other lists with announces. Best regards, :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
[Numpy-discussion] ANN: PyTables PyTables Pro 2.0.2 are out!
Hi everyone, We at Carabos are happy to announce the simultaneous release of the new 2.0.2 versions of both PyTables and PyTables Pro. They are mainly bugfix releases, and users of previous versions are encouraged to upgrade. And now the official announce: Announcing PyTables and PyTables Pro 2.0.2 PyTables is a library for managing hierarchical datasets and designed to efficiently cope with extremely large amounts of data with support for full 64-bit file addressing. PyTables runs on top of the HDF5 library and NumPy package for achieving maximum throughput and convenient use. PyTables Pro adds OPSI, a powerful indexing engine for executing very fast queries in large tables. In this version, some bugs have been fixed, being the most important a problem when moving or renaming a group. Some small improvements have been added as well. Besides, a *critical* bug has been fixed in the Pro version (the problem arose when doing repeated queries using the same index). Because of this, an upgrade is strongly recommended. In case you want to know more in detail what has changed in this version, have a look at ``RELEASE_NOTES.txt``. Find the HTML version for this document at: http://www.pytables.org/moin/ReleaseNotes/Release_2.0.2 You can download a source package of the version 2.0.2 with generated PDF and HTML docs and binaries for Windows from http://www.pytables.org/download/stable/ For an on-line version of the manual, visit: http://www.pytables.org/docs/manual-2.0.2 Migration Notes for PyTables 1.x users == If you are a user of PyTables 1.x, probably it is worth for you to look at ``MIGRATING_TO_2.x.txt`` file where you will find directions on how to migrate your existing PyTables 1.x apps to the 2.x versions. You can find an HTML version of this document at http://www.pytables.org/moin/ReleaseNotes/Migrating_To_2.x Resources = Go to the PyTables web site for more details: http://www.pytables.org About the HDF5 library: http://hdfgroup.org/HDF5/ About NumPy: http://numpy.scipy.org/ To know more about the company behind the development of PyTables, see: http://www.carabos.com/ Acknowledgments === Thanks to many users who provided feature improvements, patches, bug reports, support and suggestions. See the ``THANKS`` file in the distribution package for a (incomplete) list of contributors. Many thanks also to SourceForge who have helped to make and distribute this package! And last, but not least thanks a lot to the HDF5 and NumPy (and numarray!) makers. Without them, PyTables simply would not exist. Share your experience = Let us know of any bugs, suggestions, gripes, kudos, etc. you may have. **Enjoy data!** -- The PyTables Team :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Loading a GB file into array
Martin Spacek (el 2007-11-30 a les 00:47:41 -0800) va dir:: [...] I find that if I load the file in two pieces into two arrays, say 1GB and 0.3GB respectively, I can avoid the memory error. So it seems that it's not that windows can't allocate the memory, just that it can't allocate enough contiguous memory. I'm OK with this, but for indexing convenience, I'd like to be able to treat the two arrays as if they were one. Specifically, this file is movie data, and the array I'd like to get out of this is of shape (nframes, height, width). [...] Well, one thing you could do is dump your data into a PyTables_ ``CArray`` dataset, which you may afterwards access as if its was a NumPy array to get slices which are actually NumPy arrays. PyTables datasets have no problem in working with datasets exceeding memory size. For instance:: h5f = tables.openFile('foo.h5', 'w') carray = h5f.createCArray( '/', 'bar', atom=tables.UInt8Atom(), shape=(TOTAL_NROWS, 3) ) base = 0 for array in your_list_of_partial_arrays: carray[base:base+len(array)] = array base += len(array) carray.flush() # Now you can access ``carray`` as a NumPy array. carray[42] -- a (3,) uint8 NumPy array carray[10:20] -- a (10, 3) uint8 NumPy array carray[42,2] -- a NumPy uint8 scalar, width for row 42 (You may use an ``EArray`` dataset if you want to enlarge it with new rows afterwards, or a ``Table`` if you want a different type for each field.) .. _PyTables: http://www.pytables.org/ HTH, :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] Loading a GB file into array
Ivan Vilata i Balaguer (el 2007-11-30 a les 19:19:38 +0100) va dir:: Well, one thing you could do is dump your data into a PyTables_ ``CArray`` dataset, which you may afterwards access as if its was a NumPy array to get slices which are actually NumPy arrays. PyTables datasets have no problem in working with datasets exceeding memory size. [...] I've put together the simple script I've attached which dumps a binary file into a PyTables' ``CArray`` or loads it to measure the time taken to load each frame. I've run it on my laptop, which has a not very fast 4200 RPM laptop hard disk, and I've reached average times of 16 ms per frame, after dropping caches with:: # sync echo 1 /proc/sys/vm/drop_caches This I've done with the standard chunkshape and no compression. Your data may lean itself very well to bigger chunkshapes and compression, which should lower access times even further. Since (as David pointed out) 200 Hz may be a little exaggerated for human eye, loading individual frames from disk may prove more than enough for your problem. HTH, :: Ivan Vilata i Balaguer qo http://www.carabos.com/ Cárabos Coop. V. V V Enjoy Data from __future__ import with_statement from time import time from contextlib import nested import numpy as np from tables import openFile, UInt8Atom, Filters width, height = 640, 480 # 300 KiB per (greyscale) frame def dump_frames_1(npfname, h5fname, nframes): Dump `nframes` frames to a ``CArray`` dataset. with nested(file(npfname, 'rb'), openFile(h5fname, 'w')) as (npf, h5f): frames = h5f.createCArray( '/', 'frames', atom=UInt8Atom(), shape=(nframes, height, width), chunkshape=(1, height/2, width), # filters=Filters(complib='lzo'), ) framesize = width * height * 1 for framei in xrange(nframes): frame = np.fromfile(npf, np.uint8, count=framesize) frame.shape = (height, width) frames[framei] = frame def dump_frames_2(npfname, h5fname, nframes): Dump `nframes` frames to an ``EArray`` dataset. with nested(file(npfname, 'rb'), openFile(h5fname, 'w')) as (npf, h5f): frames = h5f.createEArray( '/', 'frames', atom=UInt8Atom(), shape=(0, height, width), expectedrows=nframes, # chunkshape=(1, height/2, width), # filters=Filters(complib='lzo'), ) framesize = width * height * 1 for framei in xrange(nframes): frame = np.fromfile(npf, np.uint8, count=framesize) frame.shape = (1, height, width) frames.append(frame) def load_frames(h5fname): with openFile(h5fname, 'r') as h5f: frames = h5f.root.frames nframes = len(frames) times = np.zeros(nframes, float) for framei in xrange(nframes): t0 = time() frame = frames[framei] t1 = time() times[framei] = t1 - t0 print ( Load times for %d frames: min=%.4f avg=%.4f max=%.4f % (nframes, np.min(times), np.average(times), np.max(times)) ) if __name__ == '__main__': import sys if sys.argv[1] == 'dump': npfname, h5fname, nframes = sys.argv[2:] nframes = int(nframes, 10) dump_frames_1(npfname, h5fname, nframes) elif sys.argv[1] == 'load': load_frames(sys.argv[2]) else: print sys.stderr, \ Usage: script dump NP_FILE H5_FILE NFRAMES or: script load H5_FILE sys.exit(1) signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] RFC: A (second) proposal for implementing some date/time types in NumPy
Francesc Alted (el 2008-07-16 a les 18:44:36 +0200) va dir:: After tons of excellent feedback received for our first proposal about the date/time types in NumPy Ivan and me have had another brainstorming session and ended with a new proposal for your consideration. After re-reading the proposal, Francesc and me found some points that needed small corrections and some clarifications or enhancements. Here you have a new version of the proposal. The changes aren't fundamental: * Reference to POSIX-like treatment of leap seconds. * Notes on default resolutions. * Meaning of the stored values. * Usage examples for scalar constructor. * Using an ISO 8601 string as a date value. * Fixed str() and repr() representations. * Note on operations with mixed resolutions. * Other small corrections. Thanks for the feedback! A (second) proposal for implementing some date/time types in NumPy :Author: Francesc Alted i Abad :Contact: [EMAIL PROTECTED] :Author: Ivan Vilata i Balaguer :Contact: [EMAIL PROTECTED] :Date: 2008-07-18 Executive summary = A date/time mark is something very handy to have in many fields where one has to deal with data sets. While Python has several modules that define a date/time type (like the integrated ``datetime`` [1]_ or ``mx.DateTime`` [2]_), NumPy has a lack of them. In this document, we are proposing the addition of a series of date/time types to fill this gap. The requirements for the proposed types are two-folded: 1) they have to be fast to operate with and 2) they have to be as compatible as possible with the existing ``datetime`` module that comes with Python. Types proposed == To start with, it is virtually impossible to come up with a single date/time type that fills the needs of every case of use. So, after pondering about different possibilities, we have stuck with *two* different types, namely ``datetime64`` and ``timedelta64`` (these names are preliminary and can be changed), that can have different resolutions so as to cover different needs. .. Important:: the resolution is conceived here as metadata that *complements* a date/time dtype, *without changing the base type*. It provides information about the *meaning* of the stored numbers, not about their *structure*. Now follows a detailed description of the proposed types. ``datetime64`` -- It represents a time that is absolute (i.e. not relative). It is implemented internally as an ``int64`` type. The internal epoch is the POSIX epoch (see [3]_). Like POSIX, the representation of a date doesn't take leap seconds into account. Resolution ~~ It accepts different resolutions, each of them implying a different time span. The table below describes the resolutions supported with their corresponding time spans. === == Resolution Time span (years) -- Code Meaning === == Y year[9.2e18 BC, 9.2e18 AC] Q quarter [3.0e18 BC, 3.0e18 AC] M month [7.6e17 BC, 7.6e17 AC] W week[1.7e17 BC, 1.7e17 AC] d day [2.5e16 BC, 2.5e16 AC] h hour[1.0e15 BC, 1.0e15 AC] m minute [1.7e13 BC, 1.7e13 AC] s second [ 2.9e9 BC, 2.9e9 AC] ms millisecond [ 2.9e6 BC, 2.9e6 AC] us microsecond [290301 BC, 294241 AC] ns nanosecond [ 1678 AC, 2262 AC] === == When a resolution is not provided, the default resolution of microseconds is used. The value of an absolute date is thus *an integer number of units of the chosen resolution* passed since the internal epoch. Building a ``datetime64`` dtype ~~~ The proposed way to specify the resolution in the dtype constructor is: Using parameters in the constructor:: dtype('datetime64', res=us) # the default res. is microseconds Using the long string notation:: dtype('datetime64[us]') # equivalent to dtype('datetime64') Using the short string notation:: dtype('T8[us]') # equivalent to dtype('T8') Compatibility issues This will be fully compatible with the ``datetime`` class of the ``datetime`` module of Python only when using a resolution of microseconds. For other resolutions, the conversion process will loose precision or will overflow as needed. The conversion from/to a ``datetime`` object doesn't take leap seconds into account. ``timedelta64`` --- It represents a time that is relative (i.e. not absolute). It is implemented internally as an ``int64`` type. Resolution ~~ It accepts different resolutions
Re: [Numpy-discussion] The date/time dtype and the casting issue
David Huard (el 2008-07-29 a les 12:31:54 -0400) va dir:: Silent casting is often a source of bugs and I appreciate the strict rules you want to enforce. However, I think there should be a simpler mechanism for operations between different types than creating a copy of a variable with the correct type. My suggestion is to have a dtype argument for methods such as add and subs: numpy.ones(3, dtype=t8[Y]).add(numpy.zeros(3, dtype=t8[fs]), dtype=t8[fs]) This way, `implicit` operations (+,-) enforce strict rules, and `explicit` operations (add, subs) let's you do want you want at your own risk. Umm, that looks like a big change (or addition) to the NumPy interface. I think similar include a dtype argument for method X issues hava been discussed before in the list. However, given the big change of adding the new explicit operation methods I think your proposal falls beyond the scope of the project being discussed. However, since yours isn't necessarily a time-related proposal, you may ask what people think of it in a separate thread. :: Ivan Vilata i Balaguer @ Intellectual Monopoly hinders Innovation! @ http://www.selidor.net/ @ http://www.nosoftwarepatents.com/ @ signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The date/time dtype and the casting issue
Tom Denniston (el 2008-07-29 a les 12:21:39 -0500) va dir:: [...] I think it would be ideal if things like the following worked: series = numpy.array(['1970-02-01','1970-09-01'], dtype = 'datetime64[D]') series == '1970-02-01' [True, False] I view this as similar to: series = numpy.array([1,2,3], dtype=float) series == 2 [False,True,False] 1. However it does numpy recognizes that an int is comparable with a float and does the float cast. I think you want the same behavior between strings that parse into dates and date arrays. Some might object that the relationship between string and date is more tenuous than float and int, which is true, but having used my own homespun date array numpy extension for over a year, I've found that the first thing I did was wrap it into an object that handles these string-date translations elegantly and that made it infinately more usable from an ipython session. That may be feasible as long as there is a very clear rule for what time units you get given a string. For instance, '1970' could yield years and '1970-03-12T12:00' minutes, but then we don't have a way of creating a time in business days... However, it looks interesting. Any more people interested in this behaviour? 2. Even more important to me, however, is the issue of date parsing. The mx library does many things badly but it does do a great job of parsing dates of many formats. When you parse '1/1/95' or 1995-01-01' it knows that you mean 19950101 which is really nice. I believe the scipy timeseries code for parsing dates is based on it. I would highly suggest starting with that level of functionality. The one major issue with it is an uninterpretable date doesn't throw an error but becomes whatever date is right now. That is obviously unfavorable. Umm, that may get quite complex. E.g. does '1/2/95' refer to February the 1st. or January the 2nd.? There are sooo many date formats and standards that maybe using an external parser code (like mx, TimeSeries or even datetime/strptime) for them would be preferable. I think the ISO 8601 is enough for a basic, well defined time string support. At least to start with. 3. Finally my current implementation uses floats uses nan to represent an invalid date. When you assign an element of an date array to None it uses nan as the value. When you assign a real date it puts in the equivalent floating point value. I have found this to be hugely beneficial and just wanted to float the idea of reserving a value to indicate the floating point equivalent of nan. People might prefer masked arrays as a solution, but I just wanted to float the idea. [...] Good news! Our next proposal includes a Not a Time value which came around due to the impossibility of converting some times into business days. Stay tuned. However I should point out that the NaT value isn't as powerful as the floating-point NaN, since the former is completely lacking of any sense to hardware, and patching that in all cases would make computations quite slower. Using floating point values doesn't look like an option anymore, since they don't have a fixed precision given a time unit. Cheers, :: Ivan Vilata i Balaguer @ Intellectual Monopoly hinders Innovation! @ http://www.selidor.net/ @ http://www.nosoftwarepatents.com/ @ signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The date/time dtype and the casting issue
Pierre GM (el 2008-07-29 a les 12:38:19 -0400) va dir:: Relative time versus relative time -- This case would be the same than the previous one (absolute vs absolute). Our proposal is to forbid this operation if the time units of the operands are different. Mmh, less sure on this one. Can't we use a hierarchy of time units, and force to the lowest ? For example: numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M]) array([15,15,15], dtype=t8['M']) I agree that adding ns to years makes no sense, but ns to s ? min to hr or days ? In short: systematically raising an exception looks a bit too drastic. There are some simple unambiguous cases that sould be allowed (Y+M, Y+Q, M+Q, H+D...) Do you mean using the most precise unit for operations with near enough, different units? I see the point, but what makes me doubt about it is giving the user the false impression that the most precise unit is *always* expected. I'd rather spare the user as many surprises as possible, by simplifying rules in favour of explicitness (but that may be debated). Introducing a time casting function --- change_unit(time_object, new_unit, reference) where 'time_object' is the time object whose unit is to be changed, 'new_unit' is the desired new time unit, and 'reference' is an absolute date that will be used to allow the conversion of relative times in case of using time units with an uncertain number of smaller time units (relative years or months cannot be expressed in days). reference default to the POSIX epoch, right ? So this function could be a first step towards our problem of frequency conversion... Note: we refused to use the ``.astype()`` method because of the additional 'time_reference' parameter that will sound strange for other typical uses of ``.astype()``. A method would be really, really helpful, though... [...] Yay, but what doesn't seem to fit for me is that the method would only have sense to time values. NumPy is pretty orthogonal in that every method and attribute applies to every type. However, if units were to be adopted by NumPy, the method would fit in well. In fact, we are thinking of adding a ``unit`` attribute to dtypes to support time units (being ``None`` for normal NumPy types). But full unit support in NumPy looks so far away that I'm not sure to adopt the method. Thanks for the insights. Cheers, :: Ivan Vilata i Balaguer @ Intellectual Monopoly hinders Innovation! @ http://www.selidor.net/ @ http://www.nosoftwarepatents.com/ @ signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion
Re: [Numpy-discussion] The date/time dtype and the casting issue
Pierre GM (el 2008-07-29 a les 15:47:52 -0400) va dir:: On Tuesday 29 July 2008 15:14:13 Ivan Vilata i Balaguer wrote: Pierre GM (el 2008-07-29 a les 12:38:19 -0400) va dir:: Relative time versus relative time -- This case would be the same than the previous one (absolute vs absolute). Our proposal is to forbid this operation if the time units of the operands are different. Mmh, less sure on this one. Can't we use a hierarchy of time units, and force to the lowest ? For example: numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M]) array([15,15,15], dtype=t8['M']) I agree that adding ns to years makes no sense, but ns to s ? min to hr or days ? In short: systematically raising an exception looks a bit too drastic. There are some simple unambiguous cases that sould be allowed (Y+M, Y+Q, M+Q, H+D...) Do you mean using the most precise unit for operations with near enough, different units? I see the point, but what makes me doubt about it is giving the user the false impression that the most precise unit is *always* expected. I'd rather spare the user as many surprises as possible, by simplifying rules in favour of explicitness (but that may be debated). Let me rephrase: Adding different relative time units should be allowed when there's no ambiguity on the output: For example, a relative year timedelta is always 12 month timedeltas, or 4 quarter timedeltas. In that case, I should be able to do: numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M]) array([15,15,15], dtype=t8['M']) numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[Q]) array([7,7,7], dtype=t8['Q']) Similarly: * an hour is always 3600s, so I could add relative s/ms/us/ns timedeltas to hour timedeltas, and get the result in s/ms/us/ns. * A day is always 24h, so I could add relative hours and days timedeltas and get an hour timedelta * A week is always 7d, so W+D - D However: * We can't tell beforehand how much days are in any month, so adding relative days and months would raise an exception. * Same thing with weeks and months/quarters/years There'll be only a limited number of time units, therefore a limited number of potential combinations between time units. It'd be just a matter of listing which ones are allowed and which ones will raise an exception. That's keep the precision over keep the range. At first Francesc and I opted for keep the range because that's what NumPy does, e.g. when operating an int64 with an uint64. Then, since we weren't sure about what the best choice would be for the majority of users, we decided upon letting (or forcing) the user to be explicit. However, the use of time units and integer values is precisely intended to keep the precision, and overflow won't be so frequent given the correct time unit and the span of uint64, so you may be right in the end. :) Note: we refused to use the ``.astype()`` method because of the additional 'time_reference' parameter that will sound strange for other typical uses of ``.astype()``. A method would be really, really helpful, though... [...] Yay, but what doesn't seem to fit for me is that the method would only have sense to time values. Well, what about a .tounit(new_unit, reference=None) ? By default, the reference would be None and default to the POSIX epoch. We could also go for .totunit (for to time unit) Yes, that'd be the signature for a method. The ``reference`` argument shouldn't be allowed for ``datetime64`` values (absolute times, no ambiguities) but it should be mandatory for ``timedelta64`` ones. Sorry, but I can't see the use of having a default reference, unless one wanted to work with Epoch-based deltas, which looks like an extremely particular case. Could you please show me a use case for having a reference defaulting to the POSIX epoch? Cheers, :: Ivan Vilata i Balaguer @ Intellectual Monopoly hinders Innovation! @ http://www.selidor.net/ @ http://www.nosoftwarepatents.com/ @ signature.asc Description: Digital signature ___ Numpy-discussion mailing list Numpy-discussion@scipy.org http://projects.scipy.org/mailman/listinfo/numpy-discussion