I'm producing a large chunked HDF5 using CArray and want to clarify that the performance I'm getting is what would normally be expected.

The source data is a large annual satellite dataset - 365 days x 4320 latitiude by 8640 longitude of 32bit floats. I'm only interested in pixels of a certain quality so I am iterating over the source data (which is in daily files) and then determining the indices of all quality pixels in that day. There are usually about 2 million quality pixels in a day.

I then set the equivalent CArray locations to the value of the quality pixels. As you can see in the code below, the source numpy array is a 1 x 4320 x 8640. So for addressing the CArray, I simply take the first index and set it to the current day to map indices to the 365 x 4320 x 8640 CArray.

I've tried a couple of different chunkshapes. As I will be reading the HDF sequentially day by day and as the data comes from a polar-orbit, I'm using a 1 x 1080 x 240 chunk to try and optimize for chunks that will have no data (and therefore reduce the total filesize). You can see an image of an example day at 

http://data.nodc.noaa.gov/pathfinder/Version5.2/browse_images/2011/sea_surface_temperature/20110101001735-NODC-L3C_GHRSST-SSTskin-AVHRR_Pathfinder-PFV5.2_NOAA19_G_2011001_night-v02.0-fv01.0-sea_surface_temperature.png     

To produce a day takes about 2.5 minutes on a Linux (Ubuntu 12.04) machine with two SSDs in RAID 0. The system has 64GB of RAM but I don't think memory is a constraint here.
Looking at a profile, most of that 2.5 minutes is spent in _g_writeCoords in tables.hdf5Extension.Array

Here's the pertinent code:

    for year in range(2011, 2012):
    
        # create dataset and add global attrs
        annualfile_path = '%sPF4km/V5.2/hdf/annual/PF52-%d-c1080x240-test.h5' % (crwdir, year)
        print 'Creating ' + annualfile_path
            
 
        with tables.openFile(annualfile_path, 'w', title=('Pathfinder V5.2 %d' % year)) as h5f:
             
            # write lat lons
            lat_node = h5f.createArray('/', 'lat', lats, title='latitude')
            lon_node = h5f.createArray('/', 'lon', lons, title='longitude')
            
            
            # glob all the region summaries in a year
            files = [glob.glob('%sPF4km/V5.2/%d/*night*' % (crwdir, year))[0]]
            print 'Found %d days' % len(files)
            files.sort()
            
            
            # create a 365 x 4320 x 8640 array
            shape = (NUMDAYS, 4320, 8640)
            atom = tables.Float32Atom(dflt=np.nan)
            # we chunk into daily slices and then further chunk days
            sst_node = h5f.createCArray(h5f.root, 'sst', atom, shape, chunkshape=(1, 1080, 240))
            
            
            for filename in files:
                
                # get day
                day = int(filename[-25:-22])
                print 'Processing %d day %d' % (year, day)
                
                ds = Dataset(filename)
                kelvin64 = ds.variables['sea_surface_temperature'][:]
                qual = ds.variables['quality_level'][:]
                ds.close()
                # convert sst to single precision with nan as mask
                kelvin32 = np.ma.filled(kelvin64, fill_value=np.nan).astype(np.float32)
                sst = kelvin32 - 273.15
                
                # find all quality >4 locations
                qual_indices = np.where(np.ma.filled(qual) >= 4)
                print 'Found %d quality pixels' % len(qual_indices[0])

                # qual_indices is actually a 3D index. so set first sst quality index
                # to match the current day and write to sst_node 
                qual_indices[0].flags.writeable = True
                qual_indices[0][:] = day

                sst_node[qual_indices] = sst[0,qual_indices[1],qual_indices[2]]

                # sanity check that max values are the same in sst_node as source sst data
                print 'sst max %4.1f node max %4.1f' % (np.nanmax(sst[0,qual_indices[1],qual_indices[2]]), np.nanmax(sst_node[day]))

Would value any comments on this :-)
Thanks,

Tim Burgess

------------------------------------------------------------------------------
Symantec Endpoint Protection 12 positioned as A LEADER in The Forrester  
Wave(TM): Endpoint Security, Q1 2013 and "remains a good choice" in the  
endpoint security space. For insight on selecting the right partner to 
tackle endpoint security challenges, access the full report. 
http://p.sf.net/sfu/symantec-dev2dev
_______________________________________________
Pytables-users mailing list
Pytables-users@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/pytables-users

Reply via email to