On 09/24/2014 05:19 AM, questions anon wrote:
Ok, I am continuing to get stuck. I think I was asking the wrong question
so I have posted the entire script (see below).
What I really want to do is find the daily maximum for a dataset (e.g.
Temperature) that is contained in monthly netcdf files where the data are
separated by hour.
The major steps are:
open monthly netcdf files and use the timestamp to extract the hourly data
for the first day.
Append the data for each hour of that day to a list, concatenate, find max
and plot
then loop through and do the following day.

I can do some of the steps separately but I run into trouble in working out
how to loop through each hour and separate the data into each day and then
repeat all the steps for the following day.

Any feedback will be greatly appreciated!



This is NOT the whole program. You don't seem to define ncvariablename, Dataset, np, Basemap, plt, and probably others.


oneday=[]
all_the_dates=[]
onedateperday=[]


#open folders and open relevant netcdf files that are monthly files
containing hourly data across a region
for (path, dirs, files) in os.walk(MainFolder):
         for ncfile in files:

If there are more than one file, or more than one directory, then there are no guarantees that this will give you the files in the order you are likely to want. You may need to do some sorting to make it work out. But after further study, you seem to assume you'll read everything in from all the files, and then process it. Have you studied how big that might get, and whether you'll have enough memory to make it viable?

             if ncfile.endswith(ncvariablename+'.nc'):
                 print "dealing with ncfiles:", path+ncfile
                 ncfile=os.path.join(path,ncfile)
                 ncfile=Dataset(ncfile, 'r+', 'NETCDF4')
                 variable=ncfile.variables[ncvariablename][:,:,:]
                 TIME=ncfile.variables['time'][:]

Since you've given ncfile MANY different meanings, this close() will not close the file. Variables are cheap, use lots of them, and give them good names. In this particular case, maybe using a with statement would make sense.

                 ncfile.close()

                 #combine variable and time so I can make calculations based
on hours or days
                 for v, time in zip((variable[:,:,:]),(TIME[:])):

                     cdftime=utime('seconds since 1970-01-01 00:00:00')
                     ncfiletime=cdftime.num2date(time)
                     timestr=str(ncfiletime)
                     utc_dt = dt.strptime(timestr, '%Y-%m-%d %H:%M:%S')
                     au_tz = pytz.timezone('Australia/Sydney')
                     local_dt =
utc_dt.replace(tzinfo=pytz.utc).astimezone(au_tz)
                     strp_local=local_dt.strftime('%Y-%m-%d_%H') #strips
time down to date and hour
                     local_date=local_dt.strftime('%Y-%m-%d') #strips time
down to just date

                     all_the_dates.append(local_date)

#make a list that strips down the dates to only have one date per day
(rather than one for each hour)
onedateperday = sorted ( list (set (all_the_dates)))

#loop through each day and combine data (v) where the hours occur on the
same day
for days in onedateperday:
     if strp_local.startswith(days):
         oneday.append(v)


And just where does v come from? You're no longer in the loop that says "for v, time in..."

big_array=np.ma.dstack(oneday) #concatenate data
v_DailyMax=big_array.max(axis=2) # find max

#then go on to plot v_DailyMax for each day
map = Basemap(projection='merc',llcrnrlat=-40,urcrnrlat=-33,
         llcrnrlon=139.0,urcrnrlon=151.0,lat_ts=0,resolution='i')
map.drawcoastlines()
map.drawstates()
map.readshapefile(shapefile1, 'REGIONS')
x,y=map(*np.meshgrid(LON,LAT))
plottitle=ncvariablename+'v_DailyMax'+days
cmap=plt.cm.jet
CS = map.contourf(x,y,v_DailyMax, 15, cmap=cmap)
l,b,w,h =0.1,0.1,0.8,0.8
cax = plt.axes([l+w+0.025, b, 0.025, h])
plt.colorbar(CS,cax=cax, drawedges=True)
plt.savefig((os.path.join(OutputFolder, plottitle+'.png')))
plt.show()
plt.close()



It looks to me like you've jumbled up a bunch of different code fragments. Have you learned about writing functions yet? Or generators? By making everything global, you're masking a lot of mistakes you've made. For example, v has a value, but just the data from the last record of the last file.

I think you need to back off and design this, without worrying much at first about how to code it. It's generally a bad idea to try to collect all data before processing it and winnowing it down. If you know something about the filenaming (or directory naming), and can assume that all data from one day will be processed before encountering the next day, you can greatly simplify things. Likewise if one hour is complete before the next begins.

Is there one file per day? Is there one record per hour and are they in order? If some such assumptions can be made, then you might be able to factor the problem down to a set of less grandiose functions.

Whoever wrote Dataset() has produced some data from a single file that you can work with. What does that data look like, and how much of it are you really needing to save past the opening of the next one?

Sometimes when the data is large and not sorted, it pays off to make two or more passes through it. In the extreme, you might take one pass to make a sorted list of all the hours involved in all the files. Then for each of those hours, you'd take an additional pass looking for data related to that particular hour. Of course if you're talking years, you'll be opening each file many thousands of times. So if you know ANYTHING about the sorting of the data, you can make that more efficient.

I can't help you at all with numpy (presumably what np stands for), or the plotting stuff.



--
DaveA
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

Reply via email to