subject:"\[Pytables\-users\] Chunk selection for optimized data access"

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Antonio Valentino

Hi list,

Il 05/06/2013 00:38, Anthony Scopatz ha scritto:
 On Tue, Jun 4, 2013 at 12:30 PM, Seref Arikan serefari...@gmail.com wrote:

 I think I've seen this in the release notes of 3.0. This is actually
 something that I'm looking into as well. So any experience/feedback about
 creating files in memory would be much appreciated.


 I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I haven't
 ever used this personally, but it would be great to have an example script,
 if someone wants to write one ;)

 Be Well
 Anthony

 1.
 http://pytables.github.io/usersguide/parameter_files.html#hdf5-driver-management



thare is also a small example of usage in the cookbook [1]


[1] http://pytables.github.io/cookbook/inmemory_hdf5_files.html


ciao

-- 
Antonio Valentino

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Seref Arikan

You would be suprised to see how convenient HDF5 can be in small scale data
:) There are cases where one may need to use binary serialization of a few
thousand items, but still needing metadata, indexing and other nice
features provided by HDF5/pyTables.




On Wed, Jun 5, 2013 at 2:29 AM, Tim Burgess timburg...@mac.com wrote:

 I was playing around with in-memory HDF5 prior to the 3.0 release. Here's
 an example based on what I was doing.
 I looked over the docs and it does mention that there is an option to
 throw away the 'file' rather than write it to disk.
 Not sure how to do that and can't actually think of a use case where I
 would want to :-)

 And be wary, it is H5FD_CORE.


 On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:


 I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I
 haven't ever used this personally, but it would be great to have an example
 script, if someone wants to write one ;)



 import numpy as np
 import tables

 CHUNKY = 30
 CHUNKX = 8640

 if __name__ == '__main__':

 # create dataset and add global attrs

 file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)

 with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory
 example', driver='H5FD_CORE') as h5f:

 # dummy some data
 lats = np.empty([4320])
 lons = np.empty([8640])

 # create some simple arrays
 lat_node = h5f.create_array('/', 'lat', lats, title='latitude')
 lon_node = h5f.create_array('/', 'lon', lons, title='longitude')

 # create a 365 x 4320 x 8640 CArray of 32bit float
 shape = (365, 4320, 8640)
 atom = tables.Float32Atom(dflt=np.nan)

 # chunk into daily slices and then further chunk days
 sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape,
 chunkshape=(1, CHUNKY, CHUNKX))

 # dummy up an ndarray
 sst = np.empty([4320, 8640], dtype=np.float32)
 sst.fill(30.0)

 # write ndarray to a 2D plane in the HDF5
 sst_node[0] = sst



 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Andreas Hilboll

On 05.06.2013 10:31, Andreas Hilboll wrote:
 On 05.06.2013 03:29, Tim Burgess wrote:
 I was playing around with in-memory HDF5 prior to the 3.0 release.
 Here's an example based on what I was doing.
 I looked over the docs and it does mention that there is an option to
 throw away the 'file' rather than write it to disk.
 Not sure how to do that and can't actually think of a use case where I
 would want to :-)

 And be wary, it is H5FD_CORE.


 On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:

 I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I
 haven't ever used this personally, but it would be great to have an
 example script, if someone wants to write one ;)

  

 import numpy as np
 import tables

 CHUNKY = 30 
 CHUNKX = 8640

 if __name__ == '__main__':

 # create dataset and add global attrs

 file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)

 with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory
 example', driver='H5FD_CORE') as h5f:
 
 # dummy some data
 lats = np.empty([4320])
 lons = np.empty([8640])

 # create some simple arrays
 lat_node = h5f.create_array('/', 'lat', lats, title='latitude')
 lon_node = h5f.create_array('/', 'lon', lons, title='longitude')

 # create a 365 x 4320 x 8640 CArray of 32bit float
 shape = (365, 4320, 8640)
 atom = tables.Float32Atom(dflt=np.nan)

 # chunk into daily slices and then further chunk days
 sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape,
 chunkshape=(1, CHUNKY, CHUNKX))

 # dummy up an ndarray
 sst = np.empty([4320, 8640], dtype=np.float32)
 sst.fill(30.0)

 # write ndarray to a 2D plane in the HDF5
 sst_node[0] = sst
 
 Thanks Tim,
 
 I adapted your example for my use case (I'm using the EArray class,
 because I need to continuously update my database), and it works well.
 
 However, when I use this with my own data (but also creating the arrays
 like you did), I'm running into errors like Could not wait on barrier.
 It seems like the HDF library is spawing several threads.
 
 Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at
 runtime?

Update:

When setting max_blosc_threads=2 and max_numexpr_threads=2, everything
seems to work as expected (but a bit on the slow side ...). With
max_blosc_threads=4, the error pops up.

Cheers, Andreas.


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Francesc Alted

On 6/5/13 11:45 AM, Andreas Hilboll wrote:
 On 05.06.2013 10:31, Andreas Hilboll wrote:
 On 05.06.2013 03:29, Tim Burgess wrote:
 I was playing around with in-memory HDF5 prior to the 3.0 release.
 Here's an example based on what I was doing.
 I looked over the docs and it does mention that there is an option to
 throw away the 'file' rather than write it to disk.
 Not sure how to do that and can't actually think of a use case where I
 would want to :-)

 And be wary, it is H5FD_CORE.


 On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:
 I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I
 haven't ever used this personally, but it would be great to have an
 example script, if someone wants to write one ;)

   

 import numpy as np
 import tables

 CHUNKY = 30
 CHUNKX = 8640

 if __name__ == '__main__':

  # create dataset and add global attrs

  file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)

  with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory
 example', driver='H5FD_CORE') as h5f:
  
  # dummy some data
  lats = np.empty([4320])
  lons = np.empty([8640])

  # create some simple arrays
  lat_node = h5f.create_array('/', 'lat', lats, title='latitude')
  lon_node = h5f.create_array('/', 'lon', lons, title='longitude')

  # create a 365 x 4320 x 8640 CArray of 32bit float
  shape = (365, 4320, 8640)
  atom = tables.Float32Atom(dflt=np.nan)

  # chunk into daily slices and then further chunk days
  sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape,
 chunkshape=(1, CHUNKY, CHUNKX))

  # dummy up an ndarray
  sst = np.empty([4320, 8640], dtype=np.float32)
  sst.fill(30.0)

  # write ndarray to a 2D plane in the HDF5
  sst_node[0] = sst
 Thanks Tim,

 I adapted your example for my use case (I'm using the EArray class,
 because I need to continuously update my database), and it works well.

 However, when I use this with my own data (but also creating the arrays
 like you did), I'm running into errors like Could not wait on barrier.
 It seems like the HDF library is spawing several threads.

 Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at
 runtime?
 Update:

 When setting max_blosc_threads=2 and max_numexpr_threads=2, everything
 seems to work as expected (but a bit on the slow side ...). With
 max_blosc_threads=4, the error pops up.

Hmm, this seems like a bad interaction among threads in numexpr and 
blosc.  I'm not sure why this is triggering because the libraries should 
execute at different times.  Hmm is your app multi-threaded?

Although Blosc has implemented a lock for preventing this situation in 
the latest releases, numexpr still lacks this protection.  As the 
multithreading engine is the same for both packages, it should be 
relatively easy to implement the lock support to numexpr too. Volunteers?

-- 
Francesc Alted


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Francesc Alted

On 6/5/13 11:45 AM, Andreas Hilboll wrote:
 On 05.06.2013 10:31, Andreas Hilboll wrote:
 On 05.06.2013 03:29, Tim Burgess wrote:
 I was playing around with in-memory HDF5 prior to the 3.0 release.
 Here's an example based on what I was doing.
 I looked over the docs and it does mention that there is an option to
 throw away the 'file' rather than write it to disk.
 Not sure how to do that and can't actually think of a use case where I
 would want to :-)

 And be wary, it is H5FD_CORE.


 On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:
 I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I
 haven't ever used this personally, but it would be great to have an
 example script, if someone wants to write one ;)

   

 import numpy as np
 import tables

 CHUNKY = 30
 CHUNKX = 8640

 if __name__ == '__main__':

  # create dataset and add global attrs

  file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)

  with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory
 example', driver='H5FD_CORE') as h5f:
  
  # dummy some data
  lats = np.empty([4320])
  lons = np.empty([8640])

  # create some simple arrays
  lat_node = h5f.create_array('/', 'lat', lats, title='latitude')
  lon_node = h5f.create_array('/', 'lon', lons, title='longitude')

  # create a 365 x 4320 x 8640 CArray of 32bit float
  shape = (365, 4320, 8640)
  atom = tables.Float32Atom(dflt=np.nan)

  # chunk into daily slices and then further chunk days
  sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape,
 chunkshape=(1, CHUNKY, CHUNKX))

  # dummy up an ndarray
  sst = np.empty([4320, 8640], dtype=np.float32)
  sst.fill(30.0)

  # write ndarray to a 2D plane in the HDF5
  sst_node[0] = sst
 Thanks Tim,

 I adapted your example for my use case (I'm using the EArray class,
 because I need to continuously update my database), and it works well.

 However, when I use this with my own data (but also creating the arrays
 like you did), I'm running into errors like Could not wait on barrier.
 It seems like the HDF library is spawing several threads.

 Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at
 runtime?
 Update:

 When setting max_blosc_threads=2 and max_numexpr_threads=2, everything
 seems to work as expected (but a bit on the slow side ...).

BTW, can you really notice the difference between using 1, 2 or 4 
threads?  Can you show some figures?  Just curious.

-- 
Francesc Alted


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Anthony Scopatz

Thanks Antonio and Tim!

These are great. I think that one of these should definitely make it into
the examples/ dir.

Be Well
Anthony


On Wed, Jun 5, 2013 at 8:10 AM, Francesc Alted fal...@gmail.com wrote:

 On 6/5/13 11:45 AM, Andreas Hilboll wrote:
  On 05.06.2013 10:31, Andreas Hilboll wrote:
  On 05.06.2013 03:29, Tim Burgess wrote:
  I was playing around with in-memory HDF5 prior to the 3.0 release.
  Here's an example based on what I was doing.
  I looked over the docs and it does mention that there is an option to
  throw away the 'file' rather than write it to disk.
  Not sure how to do that and can't actually think of a use case where I
  would want to :-)
 
  And be wary, it is H5FD_CORE.
 
 
  On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com
 wrote:
  I think that you want to set parameters.DRIVER to H5DF_CORE [1].  I
  haven't ever used this personally, but it would be great to have an
  example script, if someone wants to write one ;)
 
 
 
  import numpy as np
  import tables
 
  CHUNKY = 30
  CHUNKX = 8640
 
  if __name__ == '__main__':
 
   # create dataset and add global attrs
 
   file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)
 
   with tables.open_file(file_path, 'w', title='PyTables HDF5
 In-memory
  example', driver='H5FD_CORE') as h5f:
 
   # dummy some data
   lats = np.empty([4320])
   lons = np.empty([8640])
 
   # create some simple arrays
   lat_node = h5f.create_array('/', 'lat', lats,
 title='latitude')
   lon_node = h5f.create_array('/', 'lon', lons,
 title='longitude')
 
   # create a 365 x 4320 x 8640 CArray of 32bit float
   shape = (365, 4320, 8640)
   atom = tables.Float32Atom(dflt=np.nan)
 
   # chunk into daily slices and then further chunk days
   sst_node = h5f.create_carray(h5f.root, 'sst', atom, shape,
  chunkshape=(1, CHUNKY, CHUNKX))
 
   # dummy up an ndarray
   sst = np.empty([4320, 8640], dtype=np.float32)
   sst.fill(30.0)
 
   # write ndarray to a 2D plane in the HDF5
   sst_node[0] = sst
  Thanks Tim,
 
  I adapted your example for my use case (I'm using the EArray class,
  because I need to continuously update my database), and it works well.
 
  However, when I use this with my own data (but also creating the arrays
  like you did), I'm running into errors like Could not wait on barrier.
  It seems like the HDF library is spawing several threads.
 
  Any idea what's going wrong? Can I somehow avoid HDF5 multithreading at
  runtime?
  Update:
 
  When setting max_blosc_threads=2 and max_numexpr_threads=2, everything
  seems to work as expected (but a bit on the slow side ...).

 BTW, can you really notice the difference between using 1, 2 or 4
 threads?  Can you show some figures?  Just curious.

 --
 Francesc Alted



 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Tim Burgess

On Jun 06, 2013, at 04:19 AM, Anthony Scopatz scop...@gmail.com wrote:Thanks Antonio and Tim!These are great. I think that one of these should definitely make it into the examples/ dir.Be WellAnthonyOK. I have put up a pull request with the code added.https://github.com/PyTables/PyTables/pull/266Cheers, Tim
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-04 Thread Andreas Hilboll

On 04.06.2013 05:35, Tim Burgess wrote:
 My thoughts are:
 
 - try it without any compression. Assuming 32 bit floats, your monthly
 5760 x 2880 is only about 65MB. Uncompressed data may perform well and
 at the least it will give you a baseline to work from - and will help if
 you are investigating IO tuning.
 
 - I have found with CArray that the auto chunksize works fairly well.
 Experiment with that chunksize and with some chunksizes that you think
 are more appropriate (maybe temporal rather than spatial in your case).
 
 On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:
 
 On 03.06.2013 14:43, Andreas Hilboll wrote:
  Hi,
 
  I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
  (the last dimension represents time, and once per month there'll be one
  more 5760x2880 array to add to the end).
 
  Now, extracting timeseries at one index location is slow; e.g., for four
  indices, it takes several seconds:
 
  In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
  In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
  Wall time: 7.17 s
 
  I have the feeling that this performance could be improved, but I'm not
  sure about how to properly use the `chunkshape` parameter in my case.
 
  Any help is greatly appreciated :)
 
  Cheers, Andreas.

 PS: If I could get significant performance gains by not using an EArray
 and therefore re-creating the whole database each month, then this would
 also be an option.

 -- Andreas.

Thanks a lot, Anthony and Tim! I was able to get down the readout time
considerably using  chunkshape=(32, 32, 256) for my 5760x2880x150 array.
Now, reading times are about as fast as I expected.

the downside is that now, building up the database takes up a lot of
time, because i get the data in chunks of 5760x2880x1. So I guess that
writing the data to disk like this causes a load of IO operations ...

My new question: Is there a way to create a file in-memory? If possible,
I could then build up my database in-memory and then, once it's done,
just copy the arrays to an on-disk file. Is that possible? If so, how?

Thanks a lot for your help!

-- Andreas.


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-04 Thread Seref Arikan

I think I've seen this in the release notes of 3.0. This is actually
something that I'm looking into as well. So any experience/feedback about
creating files in memory would be much appreciated.

Best regards
Seref



On Tue, Jun 4, 2013 at 2:09 PM, Andreas Hilboll li...@hilboll.de wrote:

 On 04.06.2013 05:35, Tim Burgess wrote:
  My thoughts are:
 
  - try it without any compression. Assuming 32 bit floats, your monthly
  5760 x 2880 is only about 65MB. Uncompressed data may perform well and
  at the least it will give you a baseline to work from - and will help if
  you are investigating IO tuning.
 
  - I have found with CArray that the auto chunksize works fairly well.
  Experiment with that chunksize and with some chunksizes that you think
  are more appropriate (maybe temporal rather than spatial in your case).
 
  On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:
 
  On 03.06.2013 14:43, Andreas Hilboll wrote:
   Hi,
  
   I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
   (the last dimension represents time, and once per month there'll be
 one
   more 5760x2880 array to add to the end).
  
   Now, extracting timeseries at one index location is slow; e.g., for
 four
   indices, it takes several seconds:
  
   In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
  
   In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
   CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
   Wall time: 7.17 s
  
   I have the feeling that this performance could be improved, but I'm
 not
   sure about how to properly use the `chunkshape` parameter in my case.
  
   Any help is greatly appreciated :)
  
   Cheers, Andreas.
 
  PS: If I could get significant performance gains by not using an EArray
  and therefore re-creating the whole database each month, then this would
  also be an option.
 
  -- Andreas.

 Thanks a lot, Anthony and Tim! I was able to get down the readout time
 considerably using  chunkshape=(32, 32, 256) for my 5760x2880x150 array.
 Now, reading times are about as fast as I expected.

 the downside is that now, building up the database takes up a lot of
 time, because i get the data in chunks of 5760x2880x1. So I guess that
 writing the data to disk like this causes a load of IO operations ...

 My new question: Is there a way to create a file in-memory? If possible,
 I could then build up my database in-memory and then, once it's done,
 just copy the arrays to an on-disk file. Is that possible? If so, how?

 Thanks a lot for your help!

 -- Andreas.



 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-04 Thread Tim Burgess

I was playing around with in-memory HDF5 prior to the 3.0 release. Here's an example based on what I was doing.I looked over the docs and it does mention that there is an option to throw away the 'file' rather than write it to disk.Not sure how to do that and can't actually think of a use case where I would want to :-)And be wary, it is H5FD_CORE.On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:I think that you want to set parameters.DRIVER to H5DF_CORE [1]. I haven't ever used this personally, but it would be great to have an example script, if someone wants to write one ;)import numpy as npimport tablesCHUNKY = 30CHUNKX = 8640if __name__ == '__main__':	  # create dataset and add global attrs  file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)  with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory example', driver='H5FD_CORE') as h5f:   # dummy some datalats = np.empty([4320])lons = np.empty([8640])# create some simple arrayslat_node = h5f.create_array('/', 'lat', lats, title='latitude')lon_node = h5f.create_array('/', 'lon', lons, title='longitude')# create a 365 x 4320 x 8640 CArray of 32bit floatshape = (365, 4320, 8640)atom = tables.Float32Atom(dflt=np.nan)# chunk into daily slices and then further chunk dayssst_node = h5f.create_carray(h5f.root, 'sst', atom, shape, chunkshape=(1, CHUNKY, CHUNKX))# dummy up an ndarraysst = np.empty([4320, 8640], dtype=np.float32)sst.fill(30.0)# write ndarray to a 2D plane in the HDF5sst_node[0] = sst--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Andreas Hilboll

Hi,

I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
(the last dimension represents time, and once per month there'll be one
more 5760x2880 array to add to the end).

Now, extracting timeseries at one index location is slow; e.g., for four
indices, it takes several seconds:

   In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))

   In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
   CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
   Wall time: 7.17 s

I have the feeling that this performance could be improved, but I'm not
sure about how to properly use the `chunkshape` parameter in my case.

Any help is greatly appreciated :)

Cheers, Andreas.

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Andreas Hilboll

On 03.06.2013 14:43, Andreas Hilboll wrote:
 Hi,
 
 I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
 (the last dimension represents time, and once per month there'll be one
 more 5760x2880 array to add to the end).
 
 Now, extracting timeseries at one index location is slow; e.g., for four
 indices, it takes several seconds:
 
In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
Wall time: 7.17 s
 
 I have the feeling that this performance could be improved, but I'm not
 sure about how to properly use the `chunkshape` parameter in my case.
 
 Any help is greatly appreciated :)
 
 Cheers, Andreas.

PS: If I could get significant performance gains by not using an EArray
and therefore re-creating the whole database each month, then this would
also be an option.

-- Andreas.


--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Anthony Scopatz

Hi Andreas,

First off, nothing should be this bad, but

What is the data type of the array?  Also are you selecting chunksize
manually or letting PyTables figure it out?

Here are some things that you can try:

1.  Query with fancy indexing, once.  That is, rather than using a list
comprehension just say, _a[zip(*idx)]

2. set _a.nrowsinbuf [1] to a much smaller value (1, 5, or 10) which is
more appropriate for pulling out individual indexes.

Lastly, it is my opinion that the iteration mechanics are slower than they
can / should be.  I have a bunch of ideas about how to make them faster AND
clean up the code base but I won't have a ton of time to work on them in
the near term.  However, if this is something that you are interested in,
that would be great!  I'd love to help out anyone who was willing to take
this on.

Be Well
Anthony

1.
http://pytables.github.io/usersguide/libref/hierarchy_classes.html#tables.Leaf.nrowsinbuf


On Mon, Jun 3, 2013 at 7:45 AM, Andreas Hilboll li...@hilboll.de wrote:

 On 03.06.2013 14:43, Andreas Hilboll wrote:
  Hi,
 
  I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
  (the last dimension represents time, and once per month there'll be one
  more 5760x2880 array to add to the end).
 
  Now, extracting timeseries at one index location is slow; e.g., for four
  indices, it takes several seconds:
 
 In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
 In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
 CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
 Wall time: 7.17 s
 
  I have the feeling that this performance could be improved, but I'm not
  sure about how to properly use the `chunkshape` parameter in my case.
 
  Any help is greatly appreciated :)
 
  Cheers, Andreas.

 PS: If I could get significant performance gains by not using an EArray
 and therefore re-creating the whole database each month, then this would
 also be an option.

 -- Andreas.



 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users

--
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with 2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess

My thoughts are:- try it without any compression. Assuming 32 bit floats, your monthly 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at the least it will give you a baseline to work from - and will help if you are investigating IO tuning.- I have found with CArray that the auto chunksize works fairly well. Experiment with that chunksize and with some chunksizes that you think are more appropriate (maybe temporal rather than spatial in your case).On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Anthony Scopatz

Opps!  I forgot to mention CArray!


On Mon, Jun 3, 2013 at 10:35 PM, Tim Burgess timburg...@mac.com wrote:

 My thoughts are:

 - try it without any compression. Assuming 32 bit floats, your monthly
 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at
 the least it will give you a baseline to work from - and will help if you
 are investigating IO tuning.

 - I have found with CArray that the auto chunksize works fairly well.
 Experiment with that chunksize and with some chunksizes that you think are
 more appropriate (maybe temporal rather than spatial in your case).


 On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:

 On 03.06.2013 14:43, Andreas Hilboll wrote:
  Hi,
 
  I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray
  (the last dimension represents time, and once per month there'll be one
  more 5760x2880 array to add to the end).
 
  Now, extracting timeseries at one index location is slow; e.g., for four
  indices, it takes several seconds:
 
  In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))
 
  In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])
  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s
  Wall time: 7.17 s
 
  I have the feeling that this performance could be improved, but I'm not
  sure about how to properly use the `chunkshape` parameter in my case.
 
  Any help is greatly appreciated :)
 
  Cheers, Andreas.

 PS: If I could get significant performance gains by not using an EArray
 and therefore re-creating the whole database each month, then this would
 also be an option.

 -- Andreas.



 --
 Get 100% visibility into Java/.NET code with AppDynamics Lite
 It's a free troubleshooting tool designed for production
 Get down to code-level detail for bottlenecks, with 2% overhead.
 Download for free and get started troubleshooting in minutes.
 http://p.sf.net/sfu/appdyn_d2d_ap2
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users



 --
 How ServiceNow helps IT people transform IT departments:
 1. A cloud service to automate IT design, transition and operations
 2. Dashboards that offer high-level views of enterprise services
 3. A single system of record for all IT processes
 http://p.sf.net/sfu/servicenow-d2d-j
 ___
 Pytables-users mailing list
 Pytables-users@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/pytables-users


--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess

and for the record...yes, it should be much faster than 4 seconds. foo = np.empty([5760,2880,150],dtype=np.float32) idx = ((5000,600,800,900),(1000,2000,500,1)) import time t0 = time.time();bar=np.vstack([foo[i,j] for i,j in zip(*idx)]);t1=time.time(); print t1-t00.000144004821777On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

[Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

Re: [Pytables-users] Chunk selection for optimized data access

16 matches

Site Navigation

Mail list logo

Footer information