[Pytables-users] Writing to CArray

Tim Burgess Wed, 06 Mar 2013 21:26:41 -0800

I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected.

The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day.

I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray.

I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day at

http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png

To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here.

Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.Array

Here's the pertinent code:

for year in range(2011, 2012):

# create dataset and add global attrs

annualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)

print 'Creating ' + annualfile_path

with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f:

# write lat lons

lat_node = h5f.createArray('/', 'lat', lats, title='latitude')

lon_node = h5f.createArray('/', 'lon', lons, title='longitude')

# glob all the region summaries in a year

files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]]

print 'Found %d days' % len(files)

files.sort()

# create a 365 x 4320 x 8640 array

shape = (NUMDAYS, 4320, 8640)

atom = tables.Float32Atom(dflt=np.nan)

# we chunk into daily slices and then further chunk days

sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240))

for filename in files:

# get day

day = int(filename[-25:-22])

print 'Processing %d day %d' % (year, day)

ds = Dataset(filename)

kelvin64 = ds.variables['sea_surface_temperature'][:]

qual = ds.variables['quality_level'][:]

ds.close()

# convert sst to single precision with nan as mask

kelvin32 = np.ma.filled(kelvin64, fill_value=np.nan).astype(np.float32)

sst = kelvin32 - 273.15

# find all quality >4 locations

qual_indices = np.where(np.ma.filled(qual) >= 4)

print 'Found %d quality pixels' % len(qual_indices[0])

# qual_indices is actually a 3D index. so set first sst quality index

# to match the current day and write to sst_node

qual_indices[0].flags.writeable = True

qual_indices[0][:] = day

sst_node[qual_indices] = sst[0,qual_indices[1],qual_indices[2]]

# sanity check that max values are the same in sst_node as source sst data

print 'sst max %4.1f node max %4.1f' % (np.nanmax(sst[0,qual_indices[1],qual_indices[2]]), np.nanmax(sst_node[day]))

Would value any comments on this :-)

Thanks,

Tim Burgess

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev

_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

[Pytables-users] Writing to CArray

Reply via email to