Re: [Pytables-users] ANN: PyTables 3.0 final

2013-06-03 Thread Seref Arikan
Many thanks for keeping such a great piece of work up and running. I've
just seen some features in the release notes, features which I was going to
need in the very near future!
Great job!

Best regards
Seref Arikan



On Sat, Jun 1, 2013 at 12:33 PM, Antonio Valentino 
antonio.valent...@tiscali.it wrote:

 ===
   Announcing PyTables 3.0.0
 ===

 We are happy to announce PyTables 3.0.0.

 PyTables 3.0.0 comes after about 5 years from the last major release
 (2.0) and 7 months since the last stable release (2.4.0).

 This is new major release and an important milestone for the PyTables
 project since it provides the long waited support for Python 3.x, which
 has been around for 4 years.

 Almost all of the core numeric/scientific packages for Python already
 support Python 3 so we are very happy that now also PyTables can provide
 this important feature.


 What's new
 ==

 A short summary of main new features:

 - Since this release, PyTables now provides full support to Python 3
 - The entire code base is now more compliant with coding style
guidelines described in PEP8.
 - Basic support for HDF5 drivers.  It now is possible to open/create an
HDF5 file using one of the SEC2, DIRECT, LOG, WINDOWS, STDIO or CORE
drivers.
 - Basic support for in-memory image files.  An HDF5 file can be set
from or copied into a memory buffer.
 - Implemented methods to get/set the user block size in a HDF5 file.
 - All read methods now have an optional *out* argument that allows to
pass a pre-allocated array to store data.
 - Added support for the floating point data types with extended
precision (Float96, Float128, Complex192 and Complex256).
 - Consistent ``create_xxx()`` signatures.  Now it is possible to create
all data sets Array, CArray, EArray, VLArray, and Table from existing
Python objects.
 - Complete rewrite of the `nodes.filenode` module. Now it is fully
compliant with the interfaces defined in the standard `io` module.
Only non-buffered binary I/O is supported currently.

 Please refer to the RELEASE_NOTES document for a more detailed list of
 changes in this release.

 As always, a large amount of bugs have been addressed and squashed as well.

 In case you want to know more in detail what has changed in this
 version, please refer to: http://pytables.github.io/release_notes.html

 You can download a source package with generated PDF and HTML docs, as
 well as binaries for Windows, from:
 http://sourceforge.net/projects/pytables/files/pytables/3.0.0

 For an online version of the manual, visit:
 http://pytables.github.io/usersguide/index.html


 What it is?
 ===

 PyTables is a library for managing hierarchical datasets and
 designed to efficiently cope with extremely large amounts of data with
 support for full 64-bit file addressing.  PyTables runs on top of
 the HDF5 library and NumPy package for achieving maximum throughput and
 convenient use.  PyTables includes OPSI, a new indexing technology,
 allowing to perform data lookups in tables exceeding 10 gigarows
 (10**10 rows) in less than a tenth of a second.


 Resources
 =

 About PyTables: http://www.pytables.org

 About the HDF5 library: http://hdfgroup.org/HDF5/

 About NumPy: http://numpy.scipy.org/


 Acknowledgments
 ===

 Thanks to many users who provided feature improvements, patches, bug
 reports, support and suggestions.  See the ``THANKS`` file in the
 distribution package for a (incomplete) list of contributors.  Most
 specially, a lot of kudos go to the HDF5 and NumPy makers.
 Without them, PyTables simply would not exist.


 Share your experience
 =

 Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.


 

**Enjoy data!**

-- The PyTables Developers


 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


[Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Andreas Hilboll
Hi,

I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
(the last dimension represents time, and once per month there'll be one
more 5760x2880 array to add to the end).

Now, extracting timeseries at one index location is slow; e.g., for four
indices, it takes several seconds:

   In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))

   In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
   CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
   Wall time: 7.17 s

I have the feeling that this performance could be improved, but I'm not
sure about how to properly use the `chunkshape` parameter in my case.

Any help is greatly appreciated :)

Cheers, Andreas.

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Andreas Hilboll
On 03.06.2013 14:43, Andreas Hilboll wrote:
 Hi,
 
 I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
 (the last dimension represents time, and once per month there'll be one
 more 5760x2880 array to add to the end).
 
 Now, extracting timeseries at one index location is slow; e.g., for four
 indices, it takes several seconds:
 
In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
Wall time: 7.17 s
 
 I have the feeling that this performance could be improved, but I'm not
 sure about how to properly use the `chunkshape` parameter in my case.
 
 Any help is greatly appreciated :)
 
 Cheers, Andreas.

PS: If I could get significant performance gains by not using an EArray
and therefore re-creating the whole database each month, then this would
also be an option.

-- Andreas.


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Anthony Scopatz
Hi Andreas,

First off, nothing should be this bad, but

What is the data type of the array?  Also are you selecting chunksize
manually or letting PyTables figure it out?

Here are some things that you can try:

1.  Query with fancy indexing, once.  That is, rather than using a list
comprehension just say, _a[zip(*idx)]

2. set _a.nrowsinbuf [1] to a much smaller value (1, 5, or 10) which is
more appropriate for pulling out individual indexes.

Lastly, it is my opinion that the iteration mechanics are slower than they
can / should be.  I have a bunch of ideas about how to make them faster AND
clean up the code base but I won't have a ton of time to work on them in
the near term.  However, if this is something that you are interested in,
that would be great!  I'd love to help out anyone who was willing to take
this on.

Be Well
Anthony

1.
http://pytables.github.io/usersguide/libref/hierarchy_classes.html#tables.Leaf.nrowsinbuf


On Mon, Jun 3, 2013 at 7:45 AM, Andreas Hilboll li...@hilboll.de wrote:

 On 03.06.2013 14:43, Andreas Hilboll wrote:
  Hi,
 
  I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
  (the last dimension represents time, and once per month there'll be one
  more 5760x2880 array to add to the end).
 
  Now, extracting timeseries at one index location is slow; e.g., for four
  indices, it takes several seconds:
 
 In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
 In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
 CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
 Wall time: 7.17 s
 
  I have the feeling that this performance could be improved, but I'm not
  sure about how to properly use the `chunkshape` parameter in my case.
 
  Any help is greatly appreciated :)
 
  Cheers, Andreas.

 PS: If I could get significant performance gains by not using an EArray
 and therefore re-creating the whole database each month, then this would
 also be an option.

 -- Andreas.



 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


[Pytables-users] Anyone want to present at PyData Boston, July 27-28th

2013-06-03 Thread Anthony Scopatz
Hey everyone,

Leah Silen (CC'd) of NumFOCUS was wondering if anyone wanted to give a talk
or tutorial about PyTables at PyData Boston [1].

I don't think that I'll be able to make it, but I highly encourage others
to take her up on this.  This sort of thing shouldn't be too hard to put
together since I have already assembled a repo of slides and exercises for
a 4 hour long tutorial [2].  Feel free to use them!

Be Well
Anthony

1. http://pydata.org/bos2013/
2. https://github.com/scopatz/hdf5-is-for-lovers
--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess
My thoughts are:- try it without any compression. Assuming 32 bit floats, your monthly 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at the least it will give you a baseline to work from - and will help if you are investigating IO tuning.- I have found with CArray that the auto chunksize works fairly well. Experiment with that chunksize and with some chunksizes that you think are more appropriate (maybe temporal rather than spatial in your case).On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Anthony Scopatz
Opps!  I forgot to mention CArray!


On Mon, Jun 3, 2013 at 10:35 PM, Tim Burgess timburg...@mac.com wrote:

 My thoughts are:

 - try it without any compression. Assuming 32 bit floats, your monthly
 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at
 the least it will give you a baseline to work from - and will help if you
 are investigating IO tuning.

 - I have found with CArray that the auto chunksize works fairly well.
 Experiment with that chunksize and with some chunksizes that you think are
 more appropriate (maybe temporal rather than spatial in your case).


 On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:

 On 03.06.2013 14:43, Andreas Hilboll wrote:
  Hi,
 
  I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
  (the last dimension represents time, and once per month there'll be one
  more 5760x2880 array to add to the end).
 
  Now, extracting timeseries at one index location is slow; e.g., for four
  indices, it takes several seconds:
 
  In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
  In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
  Wall time: 7.17 s
 
  I have the feeling that this performance could be improved, but I'm not
  sure about how to properly use the `chunkshape` parameter in my case.
 
  Any help is greatly appreciated :)
 
  Cheers, Andreas.

 PS: If I could get significant performance gains by not using an EArray
 and therefore re-creating the whole database each month, then this would
 also be an option.

 -- Andreas.



 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users



 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess
and for the record...yes, it should be much faster than 4 seconds. foo = np.empty([5760,2880,150],dtype=np.float32) idx = ((5000,600,800,900),(1000,2000,500,1)) import time t0 = time.time();bar=np.vstack([foo[i,j] for i,j in zip(*idx)]);t1=time.time(); print t1-t00.000144004821777On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users