[Numpy-discussion] Type of 1st argument in Numexpr where()

2006-12-20 Thread Ivan Vilata i Balaguer
Hi all,

I noticed that the set of ``where()`` functions defined by Numexpr all
have a signature like ``xfxx``, i.e. the first argument is a float and
the return, second and third arguments are of the same type (whatever it
is).

Since the first argument effectively represents a condition, wouldn't it
make more sense for it to be a boolean?  Booleans are already supported
by Numexpr, maybe the old signatures are just a legacy from the time
when Numexpr didn't support them.

I have attached a patch to the latest version of Numexpr which
implements this.

Cheers,

PS: It seems that http://numpy.scipy.org/ still points to the old
SourceForge list address.

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  
Index: interp_body.c
===
--- interp_body.c   (revisión: 2439)
+++ interp_body.c   (copia de trabajo)
@@ -155,7 +155,7 @@
 case OP_POW_III: VEC_ARG2(i_dest = (i2  0) ? (1 / i1) : (long)pow(i1, 
i2));
 case OP_MOD_III: VEC_ARG2(i_dest = i1 % i2);
 
-case OP_WHERE_IFII: VEC_ARG3(i_dest = f1 ? i2 : i3);
+case OP_WHERE_IBII: VEC_ARG3(i_dest = b1 ? i2 : i3);
 
 case OP_CAST_FB: VEC_ARG1(f_dest = (long)b1);
 case OP_CAST_FI: VEC_ARG1(f_dest = (double)(i1));
@@ -175,7 +175,7 @@
 case OP_SQRT_FF: VEC_ARG1(f_dest = sqrt(f1));
 case OP_ARCTAN2_FFF: VEC_ARG2(f_dest = atan2(f1, f2));
 
-case OP_WHERE_: VEC_ARG3(f_dest = f1 ? f2 : f3);
+case OP_WHERE_FBFF: VEC_ARG3(f_dest = b1 ? f2 : f3);
 
 case OP_FUNC_FF: VEC_ARG1(f_dest = functions_f[arg2](f1));
 case OP_FUNC_FFF: VEC_ARG2(f_dest = functions_ff[arg3](f1, f2));
@@ -206,8 +206,8 @@
 case OP_EQ_BCC: VEC_ARG2(b_dest = (c1r == c2r  c1i == c2i) ? 1 : 0);
 case OP_NE_BCC: VEC_ARG2(b_dest = (c1r != c2r || c1i != c2i) ? 1 : 0);
 
-case OP_WHERE_CFCC: VEC_ARG3(cr_dest = f1 ? c2r : c3r;
- ci_dest = f1 ? c2i : c3i);
+case OP_WHERE_CBCC: VEC_ARG3(cr_dest = b1 ? c2r : c3r;
+ ci_dest = b1 ? c2i : c3i);
 case OP_FUNC_CC: VEC_ARG1(ca.real = c1r;
   ca.imag = c1i;
   functions_cc[arg2](ca, ca);
Index: tests/test_numexpr.py
===
--- tests/test_numexpr.py   (revisión: 2439)
+++ tests/test_numexpr.py   (copia de trabajo)
@@ -186,8 +186,8 @@
   'sinh(a)',
   '2*a + (cos(3)+5)*sinh(cos(b))',
   '2*a + arctan2(a, b)',
-  'where(a, 2, b)',
-  'where((a-10).real, a, 2)',
+  'where(a != 0.0, 2, b)',
+  'where((a-10).real != 0.0, a, 2)',
   'cos(1+1)',
   '1+1',
   '1',
Index: interpreter.c
===
--- interpreter.c   (revisión: 2439)
+++ interpreter.c   (copia de trabajo)
@@ -45,7 +45,7 @@
 OP_DIV_III,
 OP_POW_III,
 OP_MOD_III,
-OP_WHERE_IFII,
+OP_WHERE_IBII,
 
 OP_CAST_FB,
 OP_CAST_FI,
@@ -63,7 +63,7 @@
 OP_TAN_FF,
 OP_SQRT_FF,
 OP_ARCTAN2_FFF,
-OP_WHERE_,
+OP_WHERE_FBFF,
 OP_FUNC_FF,
 OP_FUNC_FFF,
 
@@ -80,7 +80,7 @@
 OP_SUB_CCC,
 OP_MUL_CCC,
 OP_DIV_CCC,
-OP_WHERE_CFCC,
+OP_WHERE_CBCC,
 OP_FUNC_CC,
 OP_FUNC_CCC,
 
@@ -148,9 +148,9 @@
 case OP_POW_III:
 if (n == 0 || n == 1 || n == 2) return 'i';
 break;
-case OP_WHERE_IFII:
+case OP_WHERE_IBII:
 if (n == 0 || n == 2 || n == 3) return 'i';
-if (n == 1) return 'f';
+if (n == 1) return 'b';
 break;
 case OP_CAST_FB:
 if (n == 0) return 'f';
@@ -178,8 +178,9 @@
 case OP_ARCTAN2_FFF:
 if (n == 0 || n == 1 || n == 2) return 'f';
 break;
-case OP_WHERE_:
-if (n == 0 || n == 1 || n == 2 || n == 3) return 'f';
+case OP_WHERE_FBFF:
+if (n == 0 || n == 2 || n == 3) return 'f';
+if (n == 1) return 'b';
 break;
 case OP_FUNC_FF:
 if (n == 0 || n == 1) return 'f';
@@ -217,9 +218,9 @@
 case OP_DIV_CCC:
 if (n == 0 || n == 1 || n == 2) return 'c';
 break;
-case OP_WHERE_CFCC:
+case OP_WHERE_CBCC:
 if (n == 0 || n == 2 || n == 3) return 'c';
-if (n == 1) return 'f';
+if (n == 1) return 'b';
 break;
 case OP_FUNC_CC:
 if (n == 0 || n == 1) return 'c';
@@ -1320,7 +1321,7 @@
 add_op(div_iii, OP_DIV_III);
 add_op(pow_iii, OP_POW_III);
 add_op(mod_iii, OP_MOD_III);
-add_op(where_ifii, OP_WHERE_IFII);
+add_op(where_ibii, OP_WHERE_IBII);
 
 add_op

Re: [Numpy-discussion] Type of 1st argument in Numexpr where()

2006-12-20 Thread Ivan Vilata i Balaguer
Tim Hochberg (el 2006-12-20 a les 09:20:01 -0700) va dir::

 Actually, this is on purpose. Numpy.where (and most other switching 
 constructs in Python) will switch on almost anything. In particular, any 
 number that is nonzero is considered True, zero is considered False. By 
 changing the signature, you're restricting where to only accepting 
 booleans. Since booleans and ints can by freely cast to doubles in 
 numexpr, always using float for the condition saves us a couple of opcodes.
 [...]

Yes, I understand the reasons you expose here.  Nou you brought the
topic about, I'm curious about what does always using float for the
condition saves us a couple of opcodes mean.  Could you explain this?
Just for curiosity. :)

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] Fixes to Numexpr under 64 bit platforms

2006-12-26 Thread Ivan Vilata i Balaguer
Hi all, here you have a patch that fixes some type declaration bugs
which cause Numexpr to crash under 64 bit platforms.  All of them are
confusions between the ``int`` and ``intp`` types, which happen to be
the same under 32 bit platforms but not under 64 bit ones, which caused
garbage values to be used as shapes and strides.

The errors where easy to spot by looking at the warnings yielded by the
compiler.  Changes have been tested under a Dual Core AMD Opteron 270
running SuSE 10.0 X86-64 with Python 2.4 and 2.5.

Have nice holidays,

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  
Index: interpreter.c
===
--- interpreter.c   (revision 2465)
+++ interpreter.c   (working copy)
@@ -704,7 +704,7 @@
 rawmemsize = BLOCK_SIZE1 * (size_from_sig(constsig) + 
size_from_sig(tempsig));
 mem = PyMem_New(char *, 1 + n_inputs + n_constants + n_temps);
 rawmem = PyMem_New(char, rawmemsize);
-memsteps = PyMem_New(int, 1 + n_inputs + n_constants + n_temps);
+memsteps = PyMem_New(intp, 1 + n_inputs + n_constants + n_temps);
 if (!mem || !rawmem || !memsteps) {
 Py_DECREF(constants);
 Py_DECREF(constsig);
@@ -822,8 +822,8 @@
 int count;
 int size;
 int findex;
-int *shape;
-int *strides;
+intp *shape;
+intp *strides;
 int *index;
 char *buffer;
 };
@@ -956,7 +956,7 @@
 PyObject *output = NULL, *a_inputs = NULL;
 struct index_data *inddata = NULL;
 unsigned int n_inputs, n_dimensions = 0;
-int shape[MAX_DIMS];
+intp shape[MAX_DIMS];
 int i, j, size, r, pc_error;
 char **inputs = NULL;
 intp strides[MAX_DIMS]; /* clean up XXX */
@@ -1032,7 +1032,7 @@
 for (i = 0; i  n_inputs; i++) {
 PyObject *a = PyTuple_GET_ITEM(a_inputs, i);
 PyObject *b;
-int strides[MAX_DIMS];
+intp strides[MAX_DIMS];
 int delta = n_dimensions - PyArray_NDIM(a);
 if (PyArray_NDIM(a)) {
 for (j = 0; j  n_dimensions; j++)


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] [ANN] PyTables 2.0 alpha2

2007-03-01 Thread Ivan Vilata i Balaguer
Hi all,

I'm posting this message to announce the availability of the *second
alpha release of PyTables 2.0*, the new and shiny major version of
PyTables.

This release settles the file format used in this major version,
removing the need to use pickled objects in order to store system
attributes, so we expect that no more changes will happen to the on-disk
format for future 2.0 releases.  The storage and handling of group
filters has also been streamlined.  The new release also allows running
the complete test suite from within Python, enables new tests and fixes
some problems with test data installation, among other fixes.

We expect to have the documentation revised and the API definitely
settled very soon in order to release the first beta version.

The official announcement follows.  Enjoy data!

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


===
 Announcing PyTables 2.0a2
===

This is the second *alpha* version of PyTables 2.0.  This release,
although being fairly stable regarding its operativity, is tagged as
alpha because the API can still change a bit (but hopefully not a great
deal), so it is meant basically for developers and people who want to
get a taste of the new exciting features in this major version.

You can download a source package of the version 2.0a2 with
generated PDF and HTML docs from
http://www.pytables.org/download/preliminary/

You can also get the latest sources from the Subversion repository at
http://pytables.org/svn/pytables/trunk/

If you are afraid of Subversion (you shouldn't), you can always download
the latest, daily updated, packed sources from
http://www.pytables.org/download/snapshot/

Please have in mind that some sections in the manual can be obsolete
(specially the Optimization tips chapter).  The reference chapter
should be fairly up-to-date though.

You may also want to have an in-deep read of the ``RELEASE-NOTES.txt``
file where you will find an entire section devoted to how to migrate
your existing PyTables 1.x apps to the 2.0 version.  You can find an
HTML version of this document at
http://www.pytables.org/moin/ReleaseNotes/Release_2.0a2


Changes more in depth
=

Improvements:

- NumPy is finally at the core!  That means that PyTables no longer
  needs numarray in order to operate, although it continues to be
  supported (as well as Numeric).  This also means that you should be
  able to run PyTables in scenarios combining Python 2.5 and 64-bit
  platforms (these are a source of problems with numarray/Numeric
  because they don't support this combination yet).

- Most of the operations in PyTables have experimented noticeable
  speed-ups (sometimes up to 2x, like in regular Python table
  selections).  This is a consequence of both using NumPy internally and
  a considerable effort in terms of refactorization and optimization of
  the new code.

- Numexpr has been integrated in all in-kernel selections.  So, now it
  is possible to perform complex selections like::

  result = [ row['var3'] for row in
 table.where('(var2  20) | (var1 == sas)') ]

  or::

  complex_cond = '((%s = col5)  (col2 = %s)) ' \
 '| (sqrt(col1 + 3.1*col2 + col3*col4)  3)'
  result = [ row['var3'] for row in
 table.where(complex_cond % (inf, sup)) ]

  and run them at full C-speed (or perhaps more, due to the cache-tuned
  computing kernel of Numexpr).

- Now, it is possible to get fields of the ``Row`` iterator by
  specifiying their position, or even ranges of positions (extended
  slicing is supported).  For example, you can do::

  result = [ row[4] for row in table# fetch field #4
 if row[1]  20 ]
  result = [ row[:] for row in table# fetch all fields
 if row['var2']  20 ]
  result = [ row[1::2] for row in   # fetch odd fields
 table.iterrows(2, 3000, 3) ]

  in addition to the classical::

  result = [row['var3'] for row in table.where('var2  20')]

- ``Row`` has received a new method called ``fetch_all_fields()`` in
  order to easily retrieve all the fields of a row in situations like::

  [row.fetch_all_fields() for row in table.where('column1  0.3')]

  The difference between ``row[:]`` and ``row.fetch_all_fields()`` is
  that the former will return all the fields as a tuple, while the
  latter will return the fields in a NumPy void type and should be
  faster.  Choose whatever fits better to your needs.

- Now, all data that is read from disk is converted, if necessary, to
  the native byteorder of the hosting machine (before, this only
  happened with ``Table`` objects).  This should help to accelerate apps
  that have to do computations with data generated in platforms with a
  byteorder different than the user machine.

- All the leaf constructors have

[Numpy-discussion] [ANN] PyTables 2.0rc1 released

2007-04-26 Thread Ivan Vilata i Balaguer
 __setitem__)
  now doesn't make a copy of the value in the case that the shape of the
  value passed is the same as the slice to be overwritten. This results
  in considerable memory savings when you are modifying disk objects
  with big array values.

- All leaf constructors (except for ``Array``) have received a new
  ``chunkshape`` argument that lets the user explicitly select the
  chunksizes for the underlying HDF5 datasets (only for advanced users).

- All leaf constructors have received a new parameter called
  ``byteorder`` that lets the user specify the byteorder of their data
  *on disk*.  This effectively allows to create datasets in other
  byteorders than the native platform.

- Native HDF5 datasets with ``H5T_ARRAY`` datatypes are fully supported
  for reading now.

- The test suites for the different packages are installed now, so you
  don't need a copy of the PyTables sources to run the tests.  Besides,
  you can run the test suite from the Python console by using::

   tables.tests()


Resources
=

Go to the PyTables web site for more details:

http://www.pytables.org

About the HDF5 library:

http://hdf.ncsa.uiuc.edu/HDF5/

About NumPy:

http://numpy.scipy.org/

To know more about the company behind the development of PyTables, see:

http://www.carabos.com/


Acknowledgments
===

Thanks to many users who provided feature improvements, patches, bug
reports, support and suggestions.  See the ``THANKS`` file in the
distribution package for a (incomplete) list of contributors.  Many
thanks also to SourceForge who have helped to make and distribute this
package!  And last, but not least thanks a lot to the HDF5 and NumPy
(and numarray!) makers. Without them PyTables simply would not exist.


Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.




  **Enjoy data!**

  -- The PyTables Team

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] [ANN] PyTables PyTables Pro 2.0 released

2007-07-13 Thread Ivan Vilata i Balaguer

 Announcing PyTables  PyTables Pro 2.0


PyTables is a library for managing hierarchical datasets and designed to
efficiently cope with extremely large amounts of data with support for
full 64-bit file addressing.  PyTables runs on top of the HDF5 library
and NumPy package for achieving maximum throughput and convenient use.

After more than one year of continuous development and about five months
of alpha, beta and release candidates, we are very happy to announce
that the PyTables and PyTables Pro 2.0 are here.  We are pretty
confident that the 2.0 versions are ready to be used in production
scenarios, bringing higher performance, better portability (specially in
64-bit environments) and more stability than the 1.x series.

You can download a source package of the PyTables 2.0 with generated PDF
and HTML docs and binaries for Windows from
http://www.pytables.org/download/stable/

For an on-line version of the manual, visit:
http://www.pytables.org/docs/manual-2.0

In case you want to know more in detail what has changed in this
version, have a look at ``RELEASE_NOTES.txt``.  Find the HTML version
for this document at:
http://www.pytables.org/moin/ReleaseNotes/Release_2.0

If you are a user of PyTables 1.x, probably it is worth for you to look
at ``MIGRATING_TO_2.x.txt`` file where you will find directions on how
to migrate your existing PyTables 1.x apps to the 2.0 version.  You can
find an HTML version of this document at
http://www.pytables.org/moin/ReleaseNotes/Migrating_To_2.x


Introducing PyTables Pro 2.0


The main difference between PyTables Pro and regular PyTables is that
the Pro version includes OPSI, a new indexing technology, allowing to
perform data lookups in tables exceeding 10 gigarows (10**10 rows) in
less than 1 tenth of a second.  Wearing more than 15000 tests and having
passed the complete test suite in the most common platforms (Windows,
Mac OS X, Linux 32-bit and Linux 64-bit), we are pretty confident that
PyTables Pro 2.0 is ready to be used in production scenarios, bringing
maximum stability and top performance to those users who need it.
For more info about PyTables Pro, see:
http://www.carabos.com/products/pytables-pro
For the operational details and benchmarks see the OPSI white paper:
http://www.carabos.com/docs/OPSI-indexes.pdf

Coinciding with the publication of PyTables Pro we are introducing an
innovative liberation process that will allow to ultimate release the
PyTables Pro 2.x series as open source.  You may want to know that, by
buying a PyTables Pro license, you are contributing to this process. For
details, see: http://www.carabos.com/liberation


New features of PyTables 2.0 series
===

- A complete refactoring of many, many modules in PyTables.  With this,
  the different parts of the code are much better integrated and code
  redundancy is kept under a minimum.  A lot of new optimizations have
  been included as well, making working with it a smoother experience
  than ever before.

- NumPy is finally at the core!  That means that PyTables no longer
  needs numarray in order to operate, although it continues to be
  supported (as well as Numeric).  This also means that you should be
  able to run PyTables in scenarios combining Python 2.5 and 64-bit
  platforms (these are a source of problems with numarray/Numeric
  because they don't support this combination as of this writing).

- Most of the operations in PyTables have experimented noticeable
  speed-ups (sometimes up to 2x, like in regular Python table
  selections).  This is a consequence of both using NumPy internally and
  a considerable effort in terms of refactorization and optimization of
  the new code.

- Combined conditions are finally supported for in-kernel selections.
  So, now it is possible to perform complex selections like::

  result = [ row['var3'] for row in
 table.where('(var2  20) | (var1 == sas)') ]

  or::

  complex_cond = '((%s = col5)  (col2 = %s)) ' \
 '| (sqrt(col1 + 3.1*col2 + col3*col4)  3)'
  result = [ row['var3'] for row in
 table.where(complex_cond % (inf, sup)) ]

  and run them at full C-speed (or perhaps more, due to the cache-tuned
  computing kernel of Numexpr, which has been integrated into PyTables).

- Now, it is possible to get fields of the ``Row`` iterator by
  specifying their position, or even ranges of positions (extended
  slicing is supported).  For example, you can do::

  result = [ row[4] for row in table# fetch field #4
 if row[1]  20 ]
  result = [ row[:] for row in table# fetch all fields
 if row['var2']  20 ]
  result = [ row[1::2] for row in   # fetch odd fields
 table.iterrows(2, 3000, 3) ]

  in addition to the classical::

  result = [row['var3'] for row in 

Re: [Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

2007-07-20 Thread Ivan Vilata i Balaguer
Gael Varoquaux (el 2007-07-20 a les 11:24:34 +0200) va dir::

 I new I really should put these things on line, I have just been wanting
 to iron them a bit, but it has been almost two year since I have touched
 these, so ...
 
 http://scipy.org/Cookbook/hdf5_in_Matlab

Wow, that looks really sweet and simple, useful code.  Great!

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Pickle, pytables, and sqlite - loading and saving recarray's

2007-07-23 Thread Ivan Vilata i Balaguer
Vincent Nijs (el 2007-07-22 a les 10:21:18 -0500) va dir::

 [...]
 I would assume the NULL's could be treated as missing values (?) Don't know
 about the different types in one column however.

Maybe a masked array would do the trick, with NULL values masked out.

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] [ANN] Release of the first PyTables video

2007-11-14 Thread Ivan Vilata i Balaguer
=
 Release of the first PyTables video
=

`Carabos http://www.carabos.com/`_ is very proud to announce the
first of a series of videos dedicated to introducing the main features
of PyTables to the public in a visual and easy to grasp manner.

  http://www.carabos.com/videos/pytables-1-intro

`PyTables http://www.pytables.org/`_ is a Free/Open Source package
designed to handle massive amounts of data in a simple, but highly
efficient way, using the HDF5 file format and NumPy data containers.

This first video is an introductory overview of PyTables, covering the
following topics:

  * HDF5 file creation
  * the object tree
  * homogeneous array storage
  * natural naming
  * working with attributes

With a running length of little more than 10 minutes, you may sit back
and watch it during any short break.

More videos about PyTables will be published in the near future.  Stay
tuned on www.pytables.org for the announcement of the new videos.

We would like to hear your opinion on the video so we can do it better
the next time.  We are also open to suggestions for the topics of
future videos.

Best regards,

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] [ANN] Release of the first PyTables video

2007-11-18 Thread Ivan Vilata i Balaguer
Steve Lianoglou (el 2007-11-17 a les 10:10:15 -0500) va dir::

  `Carabos http://www.carabos.com/`_ is very proud to announce the
  first of a series of videos dedicated to introducing the main features
  of PyTables to the public in a visual and easy to grasp manner.
 
 I just got a chance to watch the video and wanted to thank you for  
 putting that together.

Oh, it's a pleasure to hear that people enjoyed our video, specially
since it's our vey first published attempt at screencasts. :)

 I've always been meaning to check out PyTables but haven't had the  
 time to figure out how to work it on to potentially replace my hacked- 
 together data storage schemes, so these videos are a great help.

Though we put a lot of effort in the written tutorials, we also thought
that watching an interactive session would be an incomparable way of
showing how it really feels to work with PyTables.  It's also quite fun,
so I recommend both developers and users to try and record and publish
your own videos of your favorite tools! :)

 Looking forward to your next video!

We're currently working on it and it may get released in a couple of
weeks or so.  We'll probably only be announcing it in the PyTables lists
and in our websites, to avoid flooding other lists with announces.

Best regards,

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


[Numpy-discussion] ANN: PyTables PyTables Pro 2.0.2 are out!

2007-11-27 Thread Ivan Vilata i Balaguer
Hi everyone,

We at Carabos are happy to announce the simultaneous release of the new
2.0.2 versions of both PyTables and PyTables Pro.  They are mainly
bugfix releases, and users of previous versions are encouraged to
upgrade.

And now the official announce:


 Announcing PyTables and PyTables Pro 2.0.2


PyTables is a library for managing hierarchical datasets and designed to
efficiently cope with extremely large amounts of data with support for
full 64-bit file addressing.  PyTables runs on top of the HDF5 library
and NumPy package for achieving maximum throughput and convenient use.
PyTables Pro adds OPSI, a powerful indexing engine for executing
very fast queries in large tables.

In this version, some bugs have been fixed, being the most important a
problem when moving or renaming a group.  Some small improvements have
been added as well.  Besides, a *critical* bug has been fixed in the Pro
version (the problem arose when doing repeated queries using the same
index). Because of this, an upgrade is strongly recommended.

In case you want to know more in detail what has changed in this
version, have a look at ``RELEASE_NOTES.txt``.  Find the HTML version
for this document at:
http://www.pytables.org/moin/ReleaseNotes/Release_2.0.2

You can download a source package of the version 2.0.2 with
generated PDF and HTML docs and binaries for Windows from
http://www.pytables.org/download/stable/

For an on-line version of the manual, visit:
http://www.pytables.org/docs/manual-2.0.2


Migration Notes for PyTables 1.x users
==

If you are a user of PyTables 1.x, probably it is worth for you to look
at ``MIGRATING_TO_2.x.txt`` file where you will find directions on how
to migrate your existing PyTables 1.x apps to the 2.x versions.  You can
find an HTML version of this document at
http://www.pytables.org/moin/ReleaseNotes/Migrating_To_2.x


Resources
=

Go to the PyTables web site for more details:

http://www.pytables.org

About the HDF5 library:

http://hdfgroup.org/HDF5/

About NumPy:

http://numpy.scipy.org/

To know more about the company behind the development of PyTables, see:

http://www.carabos.com/

Acknowledgments
===

Thanks to many users who provided feature improvements, patches, bug
reports, support and suggestions.  See the ``THANKS`` file in the
distribution package for a (incomplete) list of contributors.  Many
thanks also to SourceForge who have helped to make and distribute this
package!  And last, but not least thanks a lot to the HDF5 and NumPy
(and numarray!) makers. Without them, PyTables simply would not exist.

Share your experience
=

Let us know of any bugs, suggestions, gripes, kudos, etc. you may
have.



  **Enjoy data!**

  -- The PyTables Team

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Loading a GB file into array

2007-11-30 Thread Ivan Vilata i Balaguer
Martin Spacek (el 2007-11-30 a les 00:47:41 -0800) va dir::

[...]
 I find that if I load the file in two pieces into two arrays, say 1GB
 and 0.3GB respectively, I can avoid the memory error. So it seems that
 it's not that windows can't allocate the memory, just that it can't
 allocate enough contiguous memory. I'm OK with this, but for indexing
 convenience, I'd like to be able to treat the two arrays as if they were
 one. Specifically, this file is movie data, and the array I'd like to
 get out of this is of shape (nframes, height, width).
[...]

Well, one thing you could do is dump your data into a PyTables_
``CArray`` dataset, which you may afterwards access as if its was a
NumPy array to get slices which are actually NumPy arrays.  PyTables
datasets have no problem in working with datasets exceeding memory size.
For instance::

  h5f = tables.openFile('foo.h5', 'w')
  carray = h5f.createCArray(
  '/', 'bar', atom=tables.UInt8Atom(), shape=(TOTAL_NROWS, 3) )
  base = 0
  for array in your_list_of_partial_arrays:
  carray[base:base+len(array)] = array
  base += len(array)
  carray.flush()

  # Now you can access ``carray`` as a NumPy array.
  carray[42] -- a (3,) uint8 NumPy array
  carray[10:20] -- a (10, 3) uint8 NumPy array
  carray[42,2] -- a NumPy uint8 scalar, width for row 42

(You may use an ``EArray`` dataset if you want to enlarge it with new
rows afterwards, or a ``Table`` if you want a different type for each
field.)

.. _PyTables: http://www.pytables.org/

HTH,

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] Loading a GB file into array

2007-12-01 Thread Ivan Vilata i Balaguer
Ivan Vilata i Balaguer (el 2007-11-30 a les 19:19:38 +0100) va dir::

 Well, one thing you could do is dump your data into a PyTables_
 ``CArray`` dataset, which you may afterwards access as if its was a
 NumPy array to get slices which are actually NumPy arrays.  PyTables
 datasets have no problem in working with datasets exceeding memory size.
[...]

I've put together the simple script I've attached which dumps a binary
file into a PyTables' ``CArray`` or loads it to measure the time taken
to load each frame.  I've run it on my laptop, which has a not very fast
4200 RPM laptop hard disk, and I've reached average times of 16 ms per
frame, after dropping caches with::

# sync  echo 1  /proc/sys/vm/drop_caches

This I've done with the standard chunkshape and no compression.  Your
data may lean itself very well to bigger chunkshapes and compression,
which should lower access times even further.  Since (as David pointed
out) 200 Hz may be a little exaggerated for human eye, loading
individual frames from disk may prove more than enough for your problem.

HTH,

::

Ivan Vilata i Balaguer   qo   http://www.carabos.com/
   Cárabos Coop. V.  V  V   Enjoy Data
  
from __future__ import with_statement
from time import time
from contextlib import nested

import numpy as np
from tables import openFile, UInt8Atom, Filters


width, height = 640, 480  # 300 KiB per (greyscale) frame

def dump_frames_1(npfname, h5fname, nframes):
 Dump `nframes` frames to a ``CArray`` dataset.
 with nested(file(npfname, 'rb'), openFile(h5fname, 'w')) as (npf, h5f):
  frames = h5f.createCArray( '/', 'frames', atom=UInt8Atom(),
 shape=(nframes, height, width),
 chunkshape=(1, height/2, width),
 # filters=Filters(complib='lzo'),
 )
  framesize = width * height * 1
  for framei in xrange(nframes):
   frame = np.fromfile(npf, np.uint8, count=framesize)
   frame.shape = (height, width)
   frames[framei] = frame

def dump_frames_2(npfname, h5fname, nframes):
 Dump `nframes` frames to an ``EArray`` dataset.
 with nested(file(npfname, 'rb'), openFile(h5fname, 'w')) as (npf, h5f):
  frames = h5f.createEArray( '/', 'frames', atom=UInt8Atom(),
 shape=(0, height, width),
 expectedrows=nframes,
 # chunkshape=(1, height/2, width),
 # filters=Filters(complib='lzo'),
 )
  framesize = width * height * 1
  for framei in xrange(nframes):
   frame = np.fromfile(npf, np.uint8, count=framesize)
   frame.shape = (1, height, width)
   frames.append(frame)

def load_frames(h5fname):
 with openFile(h5fname, 'r') as h5f:
  frames = h5f.root.frames
  nframes = len(frames)
  times = np.zeros(nframes, float)
  for framei in xrange(nframes):
   t0 = time()
   frame = frames[framei]
   t1 = time()
   times[framei] = t1 - t0
 print ( Load times for %d frames: min=%.4f avg=%.4f max=%.4f
 % (nframes, np.min(times), np.average(times), np.max(times)) )

if __name__ == '__main__':
 import sys

 if sys.argv[1] == 'dump':
  npfname, h5fname, nframes = sys.argv[2:]
  nframes = int(nframes, 10)
  dump_frames_1(npfname, h5fname, nframes)
 elif sys.argv[1] == 'load':
  load_frames(sys.argv[2])
 else:
  print  sys.stderr, \
Usage: script dump NP_FILE H5_FILE NFRAMES
   or: script load H5_FILE
  sys.exit(1)


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] RFC: A (second) proposal for implementing some date/time types in NumPy

2008-07-18 Thread Ivan Vilata i Balaguer
Francesc Alted (el 2008-07-16 a les 18:44:36 +0200) va dir::

 After tons of excellent feedback received for our first proposal about 
 the date/time types in NumPy Ivan and me have had another brainstorming 
 session and ended with a new proposal for your consideration.

After re-reading the proposal, Francesc and me found some points that
needed small corrections and some clarifications or enhancements.  Here
you have a new version of the proposal.  The changes aren't fundamental:

* Reference to POSIX-like treatment of leap seconds.
* Notes on default resolutions.
* Meaning of the stored values.
* Usage examples for scalar constructor.
* Using an ISO 8601 string as a date value.
* Fixed str() and repr() representations.
* Note on operations with mixed resolutions.
* Other small corrections.

Thanks for the feedback!




 A (second) proposal for implementing some date/time types in NumPy


:Author: Francesc Alted i Abad
:Contact: [EMAIL PROTECTED]
:Author: Ivan Vilata i Balaguer
:Contact: [EMAIL PROTECTED]
:Date: 2008-07-18


Executive summary
=

A date/time mark is something very handy to have in many fields where
one has to deal with data sets.  While Python has several modules that
define a date/time type (like the integrated ``datetime`` [1]_ or
``mx.DateTime`` [2]_), NumPy has a lack of them.

In this document, we are proposing the addition of a series of date/time
types to fill this gap.  The requirements for the proposed types are
two-folded: 1) they have to be fast to operate with and 2) they have to
be as compatible as possible with the existing ``datetime`` module that
comes with Python.


Types proposed
==

To start with, it is virtually impossible to come up with a single
date/time type that fills the needs of every case of use.  So, after
pondering about different possibilities, we have stuck with *two*
different types, namely ``datetime64`` and ``timedelta64`` (these names
are preliminary and can be changed), that can have different resolutions
so as to cover different needs.

.. Important:: the resolution is conceived here as metadata that
  *complements* a date/time dtype, *without changing the base type*.  It
  provides information about the *meaning* of the stored numbers, not
  about their *structure*.

Now follows a detailed description of the proposed types.


``datetime64``
--

It represents a time that is absolute (i.e. not relative).  It is
implemented internally as an ``int64`` type.  The internal epoch is the
POSIX epoch (see [3]_).  Like POSIX, the representation of a date
doesn't take leap seconds into account.

Resolution
~~

It accepts different resolutions, each of them implying a different time
span.  The table below describes the resolutions supported with their
corresponding time spans.

 === ==
  Resolution Time span (years)
 --
  Code   Meaning
 === ==
   Y   year[9.2e18 BC, 9.2e18 AC]
   Q   quarter [3.0e18 BC, 3.0e18 AC]
   M   month   [7.6e17 BC, 7.6e17 AC]
   W   week[1.7e17 BC, 1.7e17 AC]
   d   day [2.5e16 BC, 2.5e16 AC]
   h   hour[1.0e15 BC, 1.0e15 AC]
   m   minute  [1.7e13 BC, 1.7e13 AC]
   s   second  [ 2.9e9 BC,  2.9e9 AC]
   ms  millisecond [ 2.9e6 BC,  2.9e6 AC]
   us  microsecond [290301 BC, 294241 AC]
   ns  nanosecond  [  1678 AC,   2262 AC]
 === ==

When a resolution is not provided, the default resolution of
microseconds is used.

The value of an absolute date is thus *an integer number of units of the
chosen resolution* passed since the internal epoch.

Building a ``datetime64`` dtype
~~~

The proposed way to specify the resolution in the dtype constructor
is:

Using parameters in the constructor::

  dtype('datetime64', res=us)  # the default res. is microseconds

Using the long string notation::

  dtype('datetime64[us]')   # equivalent to dtype('datetime64')

Using the short string notation::

  dtype('T8[us]')   # equivalent to dtype('T8')

Compatibility issues


This will be fully compatible with the ``datetime`` class of the
``datetime`` module of Python only when using a resolution of
microseconds.  For other resolutions, the conversion process will loose
precision or will overflow as needed.  The conversion from/to a
``datetime`` object doesn't take leap seconds into account.


``timedelta64``
---

It represents a time that is relative (i.e. not absolute).  It is
implemented internally as an ``int64`` type.

Resolution
~~

It accepts different resolutions

Re: [Numpy-discussion] The date/time dtype and the casting issue

2008-07-29 Thread Ivan Vilata i Balaguer
David Huard (el 2008-07-29 a les 12:31:54 -0400) va dir::

 Silent casting is often a source of bugs and I appreciate the strict
 rules you want to enforce.  However, I think there should be a simpler
 mechanism for operations between different types than creating a copy
 of a variable with the correct type.
 
 My suggestion is to have a dtype argument for methods such as add and subs:
 
  numpy.ones(3, dtype=t8[Y]).add(numpy.zeros(3, dtype=t8[fs]),
 dtype=t8[fs])
 
 This way, `implicit` operations (+,-) enforce strict rules, and
 `explicit` operations (add, subs) let's you do want you want at your
 own risk.

Umm, that looks like a big change (or addition) to the NumPy interface.
I think similar include a dtype argument for method X issues hava been
discussed before in the list.  However, given the big change of adding
the new explicit operation methods I think your proposal falls beyond
the scope of the project being discussed.

However, since yours isn't necessarily a time-related proposal, you may
ask what people think of it in a separate thread.

::

  Ivan Vilata i Balaguer   @ Intellectual Monopoly hinders Innovation! @
  http://www.selidor.net/  @ http://www.nosoftwarepatents.com/ @


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The date/time dtype and the casting issue

2008-07-29 Thread Ivan Vilata i Balaguer
Tom Denniston (el 2008-07-29 a les 12:21:39 -0500) va dir::

 [...]
 I think it would be ideal if things like the following worked:
 
  series = numpy.array(['1970-02-01','1970-09-01'], dtype = 'datetime64[D]')
  series == '1970-02-01'
 [True, False]
 
 I view this as similar to:
 
  series = numpy.array([1,2,3], dtype=float)
  series == 2
 [False,True,False]
 
 1. However it does numpy recognizes that an int is comparable with a
 float and does the float cast.  I think you want the same behavior
 between strings that parse into dates and date arrays.  Some might
 object that the relationship between string and date is more tenuous
 than float and int, which is true, but having used my own homespun
 date array numpy extension for over a year, I've found that the first
 thing I did was wrap it into an object that handles these string-date
 translations elegantly and that made it infinately more usable from an
 ipython session.

That may be feasible as long as there is a very clear rule for what time
units you get given a string.  For instance, '1970' could yield years
and '1970-03-12T12:00' minutes, but then we don't have a way of creating
a time in business days...  However, it looks interesting.  Any more
people interested in this behaviour?

 2. Even more important to me, however, is the issue of date parsing.
 The mx library does many things badly but it does do a great job of
 parsing dates of many formats.  When you parse '1/1/95' or 1995-01-01'
 it knows that you mean 19950101 which is really nice.  I believe the
 scipy timeseries code for parsing dates is based on it.  I would
 highly suggest starting with that level of functionality.  The one
 major issue with it is an uninterpretable date doesn't throw an error
 but becomes whatever date is right now.  That is obviously
 unfavorable.

Umm, that may get quite complex.  E.g. does '1/2/95' refer to February
the 1st. or January the 2nd.?  There are sooo many date formats and
standards that maybe using an external parser code (like mx, TimeSeries
or even datetime/strptime) for them would be preferable.  I think the
ISO 8601 is enough for a basic, well defined time string support.  At
least to start with.

 3. Finally my current implementation uses floats uses nan to represent
 an invalid date.  When you assign an element of an date array to None
 it uses nan as the value.  When you assign a real date it puts in the
 equivalent floating point value.  I have found this to be hugely
 beneficial and just wanted to float the idea of reserving a value to
 indicate the floating point equivalent of nan.  People might prefer
 masked arrays as a solution, but I just wanted to float the idea.
 [...]

Good news!  Our next proposal includes a Not a Time value which came
around due to the impossibility of converting some times into business
days.  Stay tuned.

However I should point out that the NaT value isn't as powerful as the
floating-point NaN, since the former is completely lacking of any sense
to hardware, and patching that in all cases would make computations
quite slower.  Using floating point values doesn't look like an option
anymore, since they don't have a fixed precision given a time unit.

Cheers,

::

  Ivan Vilata i Balaguer   @ Intellectual Monopoly hinders Innovation! @
  http://www.selidor.net/  @ http://www.nosoftwarepatents.com/ @


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The date/time dtype and the casting issue

2008-07-29 Thread Ivan Vilata i Balaguer
Pierre GM (el 2008-07-29 a les 12:38:19 -0400) va dir::

  Relative time versus relative time
  --
 
  This case would be the same than the previous one (absolute vs
  absolute).  Our proposal is to forbid this operation if the time units
  of the operands are different.  
 
 Mmh, less sure on this one. Can't we use a hierarchy of time units, and force 
 to the lowest ? 
 For example:
 numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M])
 array([15,15,15], dtype=t8['M'])
 
 I agree that adding ns to years makes no sense, but ns to s ? min to
 hr or days ?  In short: systematically raising an exception looks a
 bit too drastic. There are some simple unambiguous cases that sould be
 allowed (Y+M, Y+Q, M+Q, H+D...)

Do you mean using the most precise unit for operations with near
enough, different units?  I see the point, but what makes me doubt
about it is giving the user the false impression that the most precise
unit is *always* expected.  I'd rather spare the user as many surprises
as possible, by simplifying rules in favour of explicitness (but that
may be debated).

  Introducing a time casting function
  ---
 
  change_unit(time_object, new_unit, reference)
 
  where 'time_object' is the time object whose unit is to be
  changed, 'new_unit' is the desired new time unit, and 'reference' is an
  absolute date that will be used to allow the conversion of relative
  times in case of using time units with an uncertain number of smaller
  time units (relative years or months cannot be expressed in days).  
 
 reference default to the POSIX epoch, right ?
 So this function could be a first step towards our problem of frequency 
 conversion...
 
  Note: we refused to use the ``.astype()`` method because of the
  additional 'time_reference' parameter that will sound strange for other
  typical uses of ``.astype()``.
 
 A method would be really, really helpful, though...
 [...]

Yay, but what doesn't seem to fit for me is that the method would only
have sense to time values.  NumPy is pretty orthogonal in that every
method and attribute applies to every type.  However, if units were to
be adopted by NumPy, the method would fit in well.  In fact, we are
thinking of adding a ``unit`` attribute to dtypes to support time units
(being ``None`` for normal NumPy types).  But full unit support in NumPy
looks so far away that I'm not sure to adopt the method.

Thanks for the insights.  Cheers,

::

  Ivan Vilata i Balaguer   @ Intellectual Monopoly hinders Innovation! @
  http://www.selidor.net/  @ http://www.nosoftwarepatents.com/ @


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion


Re: [Numpy-discussion] The date/time dtype and the casting issue

2008-07-30 Thread Ivan Vilata i Balaguer
Pierre GM (el 2008-07-29 a les 15:47:52 -0400) va dir::

 On Tuesday 29 July 2008 15:14:13 Ivan Vilata i Balaguer wrote:
  Pierre GM (el 2008-07-29 a les 12:38:19 -0400) va dir::
Relative time versus relative time
--
   
This case would be the same than the previous one (absolute vs
absolute).  Our proposal is to forbid this operation if the time units
of the operands are different.
  
   Mmh, less sure on this one. Can't we use a hierarchy of time units, and
   force to the lowest ?
  
   For example:
   numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M])
   array([15,15,15], dtype=t8['M'])
  
   I agree that adding ns to years makes no sense, but ns to s ? min to
   hr or days ?  In short: systematically raising an exception looks a
   bit too drastic. There are some simple unambiguous cases that sould be
   allowed (Y+M, Y+Q, M+Q, H+D...)
 
  Do you mean using the most precise unit for operations with near
  enough, different units?  I see the point, but what makes me doubt
  about it is giving the user the false impression that the most precise
  unit is *always* expected.  I'd rather spare the user as many surprises
  as possible, by simplifying rules in favour of explicitness (but that
  may be debated).
 
 Let me rephrase:
 Adding different relative time units should be allowed when there's no 
 ambiguity on the output:
 For example, a relative year timedelta is always 12 month timedeltas, or 4 
 quarter timedeltas. In that case, I should be able to do:
 
 numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[M])
 array([15,15,15], dtype=t8['M'])
 numpy.ones(3, dtype=t8[Y]) + 3*numpy.ones(3, dtype=t8[Q])
 array([7,7,7], dtype=t8['Q'])
 
 Similarly:
 * an hour is always 3600s, so I could add relative s/ms/us/ns timedeltas to 
 hour timedeltas, and get the result in s/ms/us/ns.
 * A day is always 24h, so I could add relative hours and days timedeltas and 
 get an hour timedelta
 * A week is always 7d, so W+D - D 
 
 However:
 * We can't tell beforehand how much days are in any month, so adding relative 
 days and months would raise an exception.
 * Same thing with weeks and months/quarters/years
 
 There'll be only a limited number of time units, therefore a limited number 
 of 
 potential combinations between time units. It'd be just a matter of listing 
 which ones are allowed and which ones will raise an exception.

That's keep the precision over keep the range.  At first Francesc
and I opted for keep the range because that's what NumPy does, e.g.
when operating an int64 with an uint64.  Then, since we weren't sure
about what the best choice would be for the majority of users, we
decided upon letting (or forcing) the user to be explicit.  However, the
use of time units and integer values is precisely intended to keep the
precision, and overflow won't be so frequent given the correct time
unit and the span of uint64, so you may be right in the end. :)

Note: we refused to use the ``.astype()`` method because of the
additional 'time_reference' parameter that will sound strange for other
typical uses of ``.astype()``.
  
   A method would be really, really helpful, though...
   [...]
 
  Yay, but what doesn't seem to fit for me is that the method would only
  have sense to time values.  
 
 Well, what about a .tounit(new_unit, reference=None) ?
 By default, the reference would be None and default to the POSIX epoch.
 We could also go for .totunit (for to time unit)

Yes, that'd be the signature for a method.  The ``reference`` argument
shouldn't be allowed for ``datetime64`` values (absolute times, no
ambiguities) but it should be mandatory for ``timedelta64`` ones.
Sorry, but I can't see the use of having a default reference, unless one
wanted to work with Epoch-based deltas, which looks like an extremely
particular case.  Could you please show me a use case for having a
reference defaulting to the POSIX epoch?

Cheers,

::

  Ivan Vilata i Balaguer   @ Intellectual Monopoly hinders Innovation! @
  http://www.selidor.net/  @ http://www.nosoftwarepatents.com/ @


signature.asc
Description: Digital signature
___
Numpy-discussion mailing list
Numpy-discussion@scipy.org
http://projects.scipy.org/mailman/listinfo/numpy-discussion