Re: [Pytables-users] Storing large images in PyTable

2013-07-04 Thread Tim Burgess

Hi Mathieu,

As Anthony indicates, it's hard to discern the exact issue when you don't 
provide much in the way of code to look at.

If it helps, here is an example of creating a HDF5 file with a float32 array of 
the dimensions you specified. The shape value should be a tuple.

 import numpy as np
 import tables
 x = np.random.random((121,145,121))
 x.shape
(121, 145, 121)
 f = tables.openFile('patient.h5', 'w')
 atom = tables.Float32Atom()
 image1 = f.createCArray(f.root, 'image1', atom, x.shape)
 image1[:] = x
 f.close()

You could have an HDF5 file per patient and keep the three float32 arrays and 
the CSV data as separate nodes.

I would suggest that you have the HDF5 structure and resulting PyTables code 
worked out before thinking about how to wrap it in an object.

Cheers, Tim 


On 05/07/2013, at 7:13 AM, Mathieu Dubois duboismathieu_g...@yahoo.fr wrote:

 Hello,
 
 I'm a beginner with Pyable.
 
 I wanted to store a database in a HDF5 file using PyTable. The DB is 
 made by a CSV file (which contains the subject information) and a lot of 
 images (I work on MRI so the images are 3 dimensional float32 arrays of 
 shape (121, 145, 121)). The relation is very simple: there are a 3 
 images per subject.
 
 My first idea was to create a class  Subject like this:
 class Subject(tables.IsDescription):
 # Subject information
 Id   = tables.UInt16Col()
 ...
 Image= tables.Float32Col(shape=IMAGE_SIZE)
 
 And the proceed like in the tutorial (open a file, create a group and a 
 table associated to the Subject class and then append data to this table).
 
 Unfortunately I got an error when creating the table (even before 
 inserting data):
 HDF5-DIAG: Error detected in HDF5 (1.8.4-patch1) thread 140612945950464:
   #000: ../../../src/H5Ddeprec.c line 170 in H5Dcreate1(): unable to 

snip


--
This SF.net email is sponsored by Windows:

Build for Windows Store.

http://p.sf.net/sfu/windows-dev2dev
___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-05 Thread Tim Burgess
On Jun 06, 2013, at 04:19 AM, Anthony Scopatz scop...@gmail.com wrote:Thanks Antonio and Tim!These are great. I think that one of these should definitely make it into the examples/ dir.Be WellAnthonyOK. I have put up a pull request with the code added.https://github.com/PyTables/PyTables/pull/266Cheers, Tim
--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-04 Thread Tim Burgess
I was playing around with in-memory HDF5 prior to the 3.0 release. Here's an example based on what I was doing.I looked over the docs and it does mention that there is an option to throw away the 'file' rather than write it to disk.Not sure how to do that and can't actually think of a use case where I would want to :-)And be wary, it is H5FD_CORE.On Jun 05, 2013, at 08:38 AM, Anthony Scopatz scop...@gmail.com wrote:I think that you want to set parameters.DRIVER to H5DF_CORE [1]. I haven't ever used this personally, but it would be great to have an example script, if someone wants to write one ;)import numpy as npimport tablesCHUNKY = 30CHUNKX = 8640if __name__ == '__main__':	  # create dataset and add global attrs  file_path = 'demofile_chunk%sx%d.h5' % (CHUNKY, CHUNKX)  with tables.open_file(file_path, 'w', title='PyTables HDF5 In-memory example', driver='H5FD_CORE') as h5f:   # dummy some datalats = np.empty([4320])lons = np.empty([8640])# create some simple arrayslat_node = h5f.create_array('/', 'lat', lats, title='latitude')lon_node = h5f.create_array('/', 'lon', lons, title='longitude')# create a 365 x 4320 x 8640 CArray of 32bit floatshape = (365, 4320, 8640)atom = tables.Float32Atom(dflt=np.nan)# chunk into daily slices and then further chunk dayssst_node = h5f.create_carray(h5f.root, 'sst', atom, shape, chunkshape=(1, CHUNKY, CHUNKX))# dummy up an ndarraysst = np.empty([4320, 8640], dtype=np.float32)sst.fill(30.0)# write ndarray to a 2D plane in the HDF5sst_node[0] = sst--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess
My thoughts are:- try it without any compression. Assuming 32 bit floats, your monthly 5760 x 2880 is only about 65MB. Uncompressed data may perform well and at the least it will give you a baseline to work from - and will help if you are investigating IO tuning.- I have found with CArray that the auto chunksize works fairly well. Experiment with that chunksize and with some chunksizes that you think are more appropriate (maybe temporal rather than spatial in your case).On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Chunk selection for optimized data access

2013-06-03 Thread Tim Burgess
and for the record...yes, it should be much faster than 4 seconds. foo = np.empty([5760,2880,150],dtype=np.float32) idx = ((5000,600,800,900),(1000,2000,500,1)) import time t0 = time.time();bar=np.vstack([foo[i,j] for i,j in zip(*idx)]);t1=time.time(); print t1-t00.000144004821777On Jun 03, 2013, at 10:45 PM, Andreas Hilboll li...@hilboll.de wrote:On 03.06.2013 14:43, Andreas Hilboll wrote:  Hi,I'm storing large datasets (5760 x 2880 x ~150) in a compressed EArray  (the last dimension represents time, and once per month there'll be one  more 5760x2880 array to add to the end).Now, extracting timeseries at one index location is slow; e.g., for four  indices, it takes several seconds:In [19]: idx = ((5000, 600, 800, 900), (1000, 2000, 500, 1))In [20]: %time AA = np.vstack([_a[i,j] for i,j in zip(*idx)])  CPU times: user 4.31 s, sys: 0.07 s, total: 4.38 s  Wall time: 7.17 sI have the feeling that this performance could be improved, but I'm not  sure about how to properly use the `chunkshape` parameter in my case.Any help is greatly appreciated :)Cheers, Andreas.  PS: If I could get significant performance gains by not using an EArray and therefore re-creating the whole database each month, then this would also be an option.  -- Andreas.   -- Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with 2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users--
How ServiceNow helps IT people transform IT departments:
1. A cloud service to automate IT design, transition and operations
2. Dashboards that offer high-level views of enterprise services
3. A single system of record for all IT processes
http://p.sf.net/sfu/servicenow-d2d-j___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


Re: [Pytables-users] Writing to CArray

2013-03-11 Thread Tim Burgess
  The netCDF library gives me a masked array so I have to explicitly transform that into a regular numpy array.Ahh interesting. Depending on the netCDF version the file was made with, you should be able to read the file directly from PyTables. You could thus directly get a normal numpy array. This *should* be possible, but I have never tried it ;)I think the netCDF3 functionality has been taken out or at least deprecated (https://github.com/PyTables/PyTables/issues/68). Using the python-netCDF4 module allows me to pull from pretty much any netcdf file and the inherent masking is sometimes very useful where the dataset is smaller and I can live with the lower performance of masks. I've looked under the covers and have seen that the ma masked implementation is all pure Python and so there is a performance drawback. I'm not up to speed yet on where the numpy.na masking implementation is (started a new job here).  I tried to do an implementation in memory (except for the final write) and found that I have about 2GB of indices when I extract the quality indices. Simply using those indexes, memory usage grows to over 64GB and I eventually run out of memory and start churning away in swap.  For the moment, I have pulled down the latest git master and am using the new in-memory HDF feature. This seems to give be better performance and is code-wise pretty simple so for the moment, it's good enough.Awesome! I am glad that this is working for you.Yes - appears to work great!--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users


[Pytables-users] Writing to CArray

2013-03-06 Thread Tim Burgess
I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected.The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day.I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray.I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day athttp://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png  To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here.Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.ArrayHere's the pertinent code:  for year in range(2011, 2012): # create dataset and add global attrsannualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)print 'Creating ' + annualfile_path with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f:# write lat lons  lat_node = h5f.createArray('/', 'lat', lats, title='latitude')  lon_node = h5f.createArray('/', 'lon', lons, title='longitude')# glob all the region summaries in a year  files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]]  print 'Found %d days' % len(files)  files.sort()# create a 365 x 4320 x 8640 array  shape = (NUMDAYS, 4320, 8640)  atom = tables.Float32Atom(dflt=np.nan)  # we chunk into daily slices and then further chunk days  sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240))for filename in files:   # get dayday = int(filename[-25:-22])print 'Processing %d day %d' % (year, day)   ds = Dataset(filename)kelvin64 = ds.variables['sea_surface_temperature'][:]qual = ds.variables['quality_level'][:]ds.close()# convert sst to single precision with nan as maskkelvin32 = np.ma.filled(kelvin64, fill_value=np.nan).astype(np.float32)sst = kelvin32 - 273.15   # find all quality 4 locationsqual_indices = np.where(np.ma.filled(qual) = 4)print 'Found %d quality pixels' % len(qual_indices[0])# qual_indices is actually a 3D index. so set first sst quality index# to match the current day and write to sst_nodequal_indices[0].flags.writeable = Truequal_indices[0][:] = daysst_node[qual_indices] = sst[0,qual_indices[1],qual_indices[2]]# sanity check that max values are the same in sst_node as source sst dataprint 'sst max %4.1f node max %4.1f' % (np.nanmax(sst[0,qual_indices[1],qual_indices[2]]), np.nanmax(sst_node[day]))Would value any comments on this :-)Thanks,Tim Burgess--
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and remains a good choice in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev___
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users