Re: [CF-metadata] Usage of histogram_of_X_over_Z

2016-10-27 Thread Jonathan Gregory
Dear Martin

> In broad usage, I have the impression that a "histogram" can be expressed as 
> either a count or a percentage, so we should be explicit in the convention if 
> we want a narrower definition here. A narrower definition is probably needed, 
> as there would otherwise be no way of distinguishing between the two.

I agree with that but the idea is that a standard name of histogram would be
for a count, while probability would be for a fraction. The latter could be
0-1 or 0-100% - they are dimensionally equivalent but different units. We
could clarify that in the guidelines.

> There are two further CMIP variables, both or which are bi-variate 
> distributions, with bins of spectral bands and cloud top height ranges, which 
> I'd like to bring into the discussion, but it might be useful to transfer the 
> conclusions of the exchange so far into a ticket first. I think the two 
> additional variables could be covered by a simple extension to 
> "probability_density_function_of_X_and_Y" ... though you might want to insert 
> "joint_" at the beginning of the term.

OK, that's interesting. I agree that it would fit.

Best wishes

Jonathan
> 
> Dear Martin and Alejandro  (following off-list discussions)
> 
> > The CF definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. 
> > number of counts for each range of X) of variations (over Z) of X.'
> 
> Yes, that's in the guidelines for construction of standard names, and there
> are only two of them at present, as you say. The simplest case is when you
> have some quantity Q depending on only one dimension, Q(Z). Then the histogram
> H(Q) is the number of values of Q which fall into each interval of Q,
> considering variation over Z.  In general there could be more than one
> dimension retained, and more than one removed. If the original field was
> Q(P,Y,Z,T), we might construct a histogram H(Q,Z,T), for instance, containing
> the frequencies of values of Q falling into joint intervals of Q, Z and T, for
> variation over P and Y. Following the guideline above, we would call this a
> histogram of Q over P and Y, I think.
> 
> It is not necessary to indicate in the standard name the dimensions which
> the histogram depends on (Z and T in my example) because the coordinate
> variables (of Z and T) make that clear. Martin suggests that by this argument
> we could also omit Q from the standard name, and just call it a histogram
> (or frequency distribution) rather than a histogram of Q, where Q is air
> temperature, precipitation amount, backscattering ratio, etc. I think there
> are two reasons why we include Q in the standard name,
> 
> * I think a histogram of air temperature is not the same geophysical quantity
> as a histogram of precipitation amount, for instance, so they should be
> distinguished by standard name.
> 
> * Although histograms are pure numbers, and so are probabilities, probability
> densities are not. Histograms, probability distributions and probability
> density functions are all related ways of expressing the same information.
> In the guidelines, we foresee that we might need names for all of them (though
> so far we have only histograms) and it would make sense to give them 
> consistent
> names. The probability density function of air temperature has units of K-1,
> and of precipitation amount kg-1 m2, for instance. Because they have different
> canonical units, they must have different standard names, so Q needs to be
> included in the standard name.
> 
> Cell methods describe how the values represent variation within the cells.
> The transformation from the values of a quantity to a histogram of the
> quantity makes the original quantity into a dimension. This seems more of
> a radical transformation than computing a mean or a standard deviation, which
> doesn't change the dimensions of the variable, but just reduces their size
> (to unity if completely collapsed). A frequency distribution of Q is
> regarded as a different geophysical quantity from Q itself, so we have not
> used cell methods to describe the relationship. Of course, this is a bit
> arbitrary (like everything else in the CF convention!).
> 
> I agree with Martin that we could omit the "over" part of the standard name 
> for
> histograms, probabilities and probability densities. It is useful to retain 
> the
> collapsed dimensions as size-1 dimensions, so that their original range can
> be recorded. They could be assigned cell_method of "sum", the default for
> extensive quantities, because the histogram applies to their entire range.
> The same applies to the variable with has been histogrammed and is now a
> dimension; the histogram is a sum for each of its cells.
> 
> For example, in the 1D case, suppose the original field is air_temperature
> as a function of time only. Then the histogram variable is
>   float hair(tair);
> hair:standard_name="histogram_of_air_temperature";
> hair:units="1";
> hair:cell_methods="time: sum tair: sum";
>   

Re: [CF-metadata] Usage of histogram_of_X_over_Z

2016-10-27 Thread martin.juckes
Dear Jonathan,

thanks for that detailed overview. I accept your justification for having the 
the key physical quantity of interest in the standard name.

In broad usage, I have the impression that a "histogram" can be expressed as 
either a count or a percentage, so we should be explicit in the convention if 
we want a narrower definition here. A narrower definition is probably needed, 
as there would otherwise be no way of distinguishing between the two.

Would you support the addition of a paragraph in the convention to explain the 
usage you have described below.

There are two further CMIP variables, both or which are bi-variate 
distributions, with bins of spectral bands and cloud top height ranges, which 
I'd like to bring into the discussion, but it might be useful to transfer the 
conclusions of the exchange so far into a ticket first. I think the two 
additional variables could be covered by a simple extension to 
"probability_density_function_of_X_and_Y" ... though you might want to insert 
"joint_" at the beginning of the term.

regards,
Martin



Dear Martin and Alejandro  (following off-list discussions)

> The CF definitions say ''"histogram_of_X[_over_Z]" means histogram (i.e. 
> number of counts for each range of X) of variations (over Z) of X.'

Yes, that's in the guidelines for construction of standard names, and there
are only two of them at present, as you say. The simplest case is when you
have some quantity Q depending on only one dimension, Q(Z). Then the histogram
H(Q) is the number of values of Q which fall into each interval of Q,
considering variation over Z.  In general there could be more than one
dimension retained, and more than one removed. If the original field was
Q(P,Y,Z,T), we might construct a histogram H(Q,Z,T), for instance, containing
the frequencies of values of Q falling into joint intervals of Q, Z and T, for
variation over P and Y. Following the guideline above, we would call this a
histogram of Q over P and Y, I think.

It is not necessary to indicate in the standard name the dimensions which
the histogram depends on (Z and T in my example) because the coordinate
variables (of Z and T) make that clear. Martin suggests that by this argument
we could also omit Q from the standard name, and just call it a histogram
(or frequency distribution) rather than a histogram of Q, where Q is air
temperature, precipitation amount, backscattering ratio, etc. I think there
are two reasons why we include Q in the standard name,

* I think a histogram of air temperature is not the same geophysical quantity
as a histogram of precipitation amount, for instance, so they should be
distinguished by standard name.

* Although histograms are pure numbers, and so are probabilities, probability
densities are not. Histograms, probability distributions and probability
density functions are all related ways of expressing the same information.
In the guidelines, we foresee that we might need names for all of them (though
so far we have only histograms) and it would make sense to give them consistent
names. The probability density function of air temperature has units of K-1,
and of precipitation amount kg-1 m2, for instance. Because they have different
canonical units, they must have different standard names, so Q needs to be
included in the standard name.

Cell methods describe how the values represent variation within the cells.
The transformation from the values of a quantity to a histogram of the
quantity makes the original quantity into a dimension. This seems more of
a radical transformation than computing a mean or a standard deviation, which
doesn't change the dimensions of the variable, but just reduces their size
(to unity if completely collapsed). A frequency distribution of Q is
regarded as a different geophysical quantity from Q itself, so we have not
used cell methods to describe the relationship. Of course, this is a bit
arbitrary (like everything else in the CF convention!).

I agree with Martin that we could omit the "over" part of the standard name for
histograms, probabilities and probability densities. It is useful to retain the
collapsed dimensions as size-1 dimensions, so that their original range can
be recorded. They could be assigned cell_method of "sum", the default for
extensive quantities, because the histogram applies to their entire range.
The same applies to the variable with has been histogrammed and is now a
dimension; the histogram is a sum for each of its cells.

For example, in the 1D case, suppose the original field is air_temperature
as a function of time only. Then the histogram variable is
  float hair(tair);
hair:standard_name="histogram_of_air_temperature";
hair:units="1";
hair:cell_methods="time: sum tair: sum";
hair:coordinates="time";
  float time; // scalar coordinate variable with bounds
  float tair(tair);
tair:units="K";

As a multidimensional example, suppose the original field is
  float tair(time,altitude,latitude,long

Re: [CF-metadata] Usage of histogram_of_X_over_Z

2016-10-13 Thread Bodas-Salcedo, Alejandro
Dear Martin,

You are right, those definitions are not correct.

> From your reply I understand now that these are univariate distributions 
> giving the
> frequency of different radar reflectivities in different height bands. Coming 
> from
> radar/lidar instruments (or an emulator of these instruments), there are 
> multiple
> observations in each GCM-scale height band. Presumably, there are also 
> multiple
> profiles in the GCM-scale grid square, so that we have a frequency 
> distribution over
> sub-grid scale variability in the vertical and the horizontal? Or is it 
> actually evaluated
> at a spatial point?
>
There is a sub-grid distribution of vertical profiles from which they are 
constructed.

The definition that you propose seems accurate to me. Thanks again for your 
time spent clarifying this.

Regards,

Alejandro

> -Original Message-
> From: CF-metadata [mailto:cf-metadata-boun...@cgd.ucar.edu] On Behalf Of
> martin.juc...@stfc.ac.uk
> Sent: 13 October 2016 13:05
> To: cf-metadata@cgd.ucar.edu
> Subject: [CF-metadata] Usage of histogram_of_X_over_Z
> 
> Dear Alejandro,
> 
> The two CMIP variables which I'm talking about are cfadDbze94 currently 
> defined
> as "CFAD (Cloud Frequency Altitude Diagrams) are joint height - radar 
> reflectivity
> (or lidar scattering ratio) distributions." and cfadLidarsr532, which has the 
> same
> definition. If they are not joint distributions we clearly have a problem 
> with these
> definitions.
> 
> From your reply I understand now that these are univariate distributions 
> giving the
> frequency of different radar reflectivities in different height bands. Coming 
> from
> radar/lidar instruments (or an emulator of these instruments), there are 
> multiple
> observations in each GCM-scale height band. Presumably, there are also 
> multiple
> profiles in the GCM-scale grid square, so that we have a frequency 
> distribution over
> sub-grid scale variability in the vertical and the horizontal? Or is it 
> actually evaluated
> at a spatial point?
> 
> If this is the case, you are right and we just need to correct the 
> definitions in the
> CMIP tables (though there is still a case for introducing a 
> frequencs_distribution for
> other variables, but that should ne another thread). I would favour a 
> slightly more
> verbose and explicit definition, e.g.
> "CFAD (Cloud Frequency Altitude Diagrams) are frequency distributions of radar
> reflectivity (or lidar scattering ratio) as a function of altitude. 
> cfadDbze94 is defined
> as the simulated relative frequency of radar reflectivity in sampling volumes 
> defined
> by altitude bins and model grid cells."
> 
> Note that I'm using "altitude" rather than "height" to match the standard 
> names: in
> the CF Convention, "altitude" means height above the geoid, and "height" means
> height above the surface.
> 
> Is that an accurate definition?
> 
> regards,
> Martin
> 
> 
> Dear Martin,
> 
> Thanks for your detailed explanation. I'd like to add a bit more information. 
> These
> variables are not joint distributions, they are 1D distributions for 
> different ranges of Z.
> The question is, does "histogram_of_X[_over_Z]" mean that the Z coordinate 
> has to
> be completely collapsed? It is not clear to that the current definition 
> implies that. If Z
> is not completely collapsed, you can then end up with a function of the form
> frequency(lat,lon,X,Z2), where the coordinate Z is only partially collapsed 
> into bins
> described by Z2. I'm using here Z2 to explicitly show when the Z coordinate
> represents bins. This would look like a joint histogram, but it is not. I 
> think that your
> proposal of dropping "_over_Z" from the standard name works for a joint
> distribution, but not for a collection of 1D distributions along Z, unless 
> there is a way
> of distinguishing between both cases with the use of attributes.
> 
> Another detail is that these histograms provide relative frequencies (values 
> between
> 0 and 1, not counts), not absolute frequencies. Is that inconsistent with the 
> current
> definition of histogram in CF?
> 
> Regards,
> 
> Alejandro
> 
> > -Original Message-
> > From: martin.juckes at
> stfc.ac.uk
> [mailto:martin.juckes at 
> stfc.ac.uk metadata>]
> > Sent: 12 October 2016 19:05
> > To: cf-metadata at 
> > cgd.ucar.edu metadata>
> > Cc: Bodas-Salcedo, Alejandro
> > Subject: Usage of histogram_of_X_over_Z
> >
> > Hello,
> >
> > There are two standard names of the form histogram_of_. in the CF 
> > Standard
> > Name list (at version 36):
> > histogram_of_backscattering_ratio_over_height_above_reference_ellipsoid and
> >
> histogram_of_equivalent_reflectivity_factor_over_height_above_reference_ellipsoid
> > . Both of these where used in CMIP5 and set to be used in CMIP6, but the 
> > usage
> > does not appea

Re: [CF-metadata] Usage of histogram_of_X_over_Z

2016-10-13 Thread Bodas-Salcedo, Alejandro
Dear Martin,

Thanks for your detailed explanation. I'd like to add a bit more information. 
These variables are not joint distributions, they are 1D distributions for 
different ranges of Z. The question is, does "histogram_of_X[_over_Z]" mean 
that the Z coordinate has to be completely collapsed? It is not clear to that 
the current definition implies that. If Z is not completely collapsed, you can 
then end up with a function of the form frequency(lat,lon,X,Z2), where the 
coordinate Z is only partially collapsed into bins described by Z2. I'm using 
here Z2 to explicitly show when the Z coordinate represents bins. This would 
look like a joint histogram, but it is not. I think that your proposal of 
dropping "_over_Z" from the standard name works for a joint distribution, but 
not for a collection of 1D distributions along Z, unless there is a way of 
distinguishing between both cases with the use of attributes.

Another detail is that these histograms provide relative frequencies (values 
between 0 and 1, not counts), not absolute frequencies. Is that inconsistent 
with the current definition of histogram in CF?

Regards,

Alejandro

> -Original Message-
> From: martin.juc...@stfc.ac.uk [mailto:martin.juc...@stfc.ac.uk]
> Sent: 12 October 2016 19:05
> To: cf-metadata@cgd.ucar.edu
> Cc: Bodas-Salcedo, Alejandro
> Subject: Usage of histogram_of_X_over_Z
> 
> Hello,
> 
> There are two standard names of the form histogram_of_. in the CF Standard
> Name list (at version 36):
> histogram_of_backscattering_ratio_over_height_above_reference_ellipsoid and
> histogram_of_equivalent_reflectivity_factor_over_height_above_reference_ellipsoid
> . Both of these where used in CMIP5 and set to be used in CMIP6, but the usage
> does not appear to match the standard name desecriptions.
> 
> The possible confusion is over the role of different coordinates. The CF 
> definitions
> say ''"histogram_of_X[_over_Z]" means histogram (i.e. number of counts for 
> each
> range of X) of variations (over Z) of X.' This implies to me that you start 
> with a
> function of Z and possibly other coordinates and end up with a function of X 
> and the
> other coordinates. E.g. if the source data is X(lat,lon,Z), then the 
> histogram data will
> be of the form frequency(lat,lon,X).
> 
> In the two CMIP5/CMIP6 draft variables (cfadLidarsr532, cfadDbze94) using 
> these
> standard names the "Z" coordinate  which is included in the standard name
> ("height_above_reference_ellipsoid") is one of the coordinates of the 
> histogram data
> variable. Both these variables appear to be joint distributions (frequency of 
> X and Y
> values) over sub-grid variability as a function of latitude, longitude and 
> time.
> 
> I've been reviewing these existing definitions in some detail because there 
> are some
> new distribution variables in the request and I'd like to make sure that we 
> have a
> consistent approach.
> 
> If we need to described a variable which carries a joint distribution of X 
> and Y, then
> the variable will have to use X and Y as coordinates, so perhaps we can 
> simplify the
> process by leaving them out of the standard name. Similarly the "over_Z" part 
> of the
> name would be better expressed as a cell_methods construct. This line of 
> reasoning
> suggests using a new standard name such as "frequency_distribution" (units 
> "1").
> The only difficulty is that the frequency distribution might be a function of 
> the
> quantities X and Y (scattering ratio and cloud top height for cfadLidarsr532) 
> and also
> of latitude, longitude and time. There should be some way of distinguishing 
> the
> different roles of these 5 coordinates: is is the distribution of X and Y as 
> a function of
> latitude, longitude and time. I think this could be done conveniently by 
> introducing a
> single new attribute, e.g. "bin_coords: X Y".
> 
> "frequency_distribution" could be used for single or joint distributions.
> 
> My questions to the list are:
> (1) am I missing something in my interpretation of the existing 
> histogram_of_...
> names?
> (2) if not, is the adoption of a "frequency_distribution" standard name an 
> appropriate
> way forward?
> 
> regards,
> Martin
> 
> regards,
> Martin
___
CF-metadata mailing list
CF-metadata@cgd.ucar.edu
http://mailman.cgd.ucar.edu/mailman/listinfo/cf-metadata