I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected.
The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day.
I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray.
I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day at
To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here.
Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.Array
Here's the pertinent code:
for year in range(2011, 2012):
# create dataset and add global attrs
annualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)
print 'Creating ' + annualfile_path
with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f:
# write lat lons
lat_node = h5f.createArray('/', 'lat', lats, title='latitude')
lon_node = h5f.createArray('/', 'lon', lons, title='longitude')
# glob all the region summaries in a year
files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]]
print 'Found %d days' % len(files)
files.sort()
# create a 365 x 4320 x 8640 array
shape = (NUMDAYS, 4320, 8640)
atom = tables.Float32Atom(dflt=np.nan)
# we chunk into daily slices and then further chunk days
sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240))
for filename in files:
# get day
day = int(filename[-25:-22])
print 'Processing %d day %d' % (year, day)
ds = Dataset(filename)
kelvin64 = ds.variables['sea_surface_temperature'][:]
qual = ds.variables['quality_level'][:]
ds.close()
# convert sst to single precision with nan as mask
kelvin32 = np.ma.filled(kelvin64, fill_value=np.nan).astype(np.float32)
sst = kelvin32 - 273.15
# find all quality >4 locations
qual_indices = np.where(np.ma.filled(qual) >= 4)
print 'Found %d quality pixels' % len(qual_indices[0])
# qual_indices is actually a 3D index. so set first sst quality index
# to match the current day and write to sst_node
qual_indices[0].flags.writeable = True
qual_indices[0][:] = day
sst_node[qual_indices] = sst[0,qual_indices[1],qual_indices[2]]
# sanity check that max values are the same in sst_node as source sst data
print 'sst max %4.1f node max %4.1f' % (np.nanmax(sst[0,qual_indices[1],qual_indices[2]]), np.nanmax(sst_node[day]))
Would value any comments on this :-)
Thanks,
Tim Burgess
------------------------------------------------------------------------------ Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the endpoint security space. For insight on selecting the right partner to tackle endpoint security challenges, access the full report. http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users