Re: [R] density of hist(freq = FALSE) inversely affected by data magnitude

2013-01-23 Thread William Dunlap
I think it is a fair bit of work to interpret the freq=TRUE (prob=FALSE)
version of hist() when the bins have unequal sizes.  E.g.,
in the following the bins are sized so that each contains
an equal number of observations.  The resulting flat
frequency plot is hard for me to interpret.  The density plot
is easy.

   x - rnorm(1000, sd=50)
   hist(x, breaks=quantile(x,(0:10)/10), prob=TRUE)
   hist(x, breaks=quantile(x,(0:10)/10), prob=FALSE)
  Warning message:
  In plot.histogram(r, freq = freq1, col = col, border = border, angle = angle, 
 :
the AREAS in the plot are wrong -- rather use freq=FALSE

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: J Toll [mailto:jct...@gmail.com]
 Sent: Tuesday, January 22, 2013 5:32 PM
 To: William Dunlap
 Cc: r-help
 Subject: Re: [R] density of hist(freq = FALSE) inversely affected by data 
 magnitude
 
 Bill,
 
 Thank you.  I got it.  That can require a fair amount of work to
 interpret the density, especially with odd or irregular bin sizes.
 
 Thanks again,
 
 James
 
 
 
 On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap wdun...@tibco.com wrote:
  The probability density function is not unitless - it is the derivative of 
  the
  [cumulative] probability distribution function so it has units 
  delta-probability-mass
  over delta-x.  It must integrate to 1 (over the all possible x).  
  hist(freq=FALSE,x)
  or hist(prob=TRUE,x) displays an estimate of the density function and the 
  following
  example shows how the scale matches what you get from the presumed
  population density function.
 
  f
  function (n, sd)
  {
  x - rnorm(n, sd = sd)
  hist(x, freq = FALSE) # estimated density
  s - seq(min(x), max(x), len = 129)
  lines(s, dnorm(s, sd = sd), col = red) # overlay expected density for 
  this sample
  }
  f(1e6, sd=1)
  f(100, sd=1)
  f(100, sd=0.0001)
  f(1e6, sd=0.0001)
 
  Bill Dunlap
  Spotfire, TIBCO Software
  wdunlap tibco.com
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] density of hist(freq = FALSE) inversely affected by data magnitude

2013-01-22 Thread J Toll
Hi,

I have a couple of observations, a question or two, and perhaps a
suggestion related to the plotting of density on the y-axis within the
hist() function when freq=FALSE.  I was using the function and trying
to develop an intuitive understanding of what the density is telling
me.  After reading through this fairly helpful post:

http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

I finally realized that in the case where freq = FALSE, the y-axis
isn't really telling me the density.  It's actually indicating the
density multiplied by the bin size.  I assume this is for the case
where the bins may be of non-regular size.

from hist.default:

dens - counts/(n * diff(breaks))

So the count in each bin is divided by the total number of
observations (n) multiplied by the size of the bin.  The problem, as I
see it, is that the density ends up being scaled by the size of the
bins, which is inversely proportional to the magnitude of the data.
Therefore the magnitude of the data is directly affecting the density,
which seems problematic.

For example*:

set.seed()
x - runif(100)
y - x / 1000

par(mfrow = c(2, 1))
hist(x, prob = TRUE)
hist(y, prob = TRUE)

From this example, you see that the density for the y histogram is
1000 times larger, simply because the y data is 1000 times smaller.
Again, that seems problematic.  It seems to me, that the density
should be unit-less, but here it's affected by the magnitude of the
data.

So, my question is, why is density calculated this way?

For the case where all the bins are of the same size, I would think
density should simply be calculated as:

dens - counts / n

Of course, that might be somewhat misleading for the case where the
bin sizes vary.  So then why not calculate density as:

dens - counts / (n * diff(breaks) / min(diff(breaks)))

Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
of the magnitude of the data, and simply leaves the relative
difference in bin size.

For the case where all the bins are the same size, the calculation is
equivalent to dens - counts / n

For all other cases, the density is scaled by the size of the bin, but
unaffected by the magnitude of the data.

So, what am I misunderstanding?  Why is density calculated as it is,
and what does it mean?

Thanks,


James


*example from 
http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-with-a-relative-frequency-axis

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] density of hist(freq = FALSE) inversely affected by data magnitude

2013-01-22 Thread William Dunlap
The probability density function is not unitless - it is the derivative of the
[cumulative] probability distribution function so it has units 
delta-probability-mass
over delta-x.  It must integrate to 1 (over the all possible x).  
hist(freq=FALSE,x)
or hist(prob=TRUE,x) displays an estimate of the density function and the 
following
example shows how the scale matches what you get from the presumed 
population density function.

 f
function (n, sd) 
{
x - rnorm(n, sd = sd)
hist(x, freq = FALSE) # estimated density
s - seq(min(x), max(x), len = 129)
lines(s, dnorm(s, sd = sd), col = red) # overlay expected density for 
this sample
}
 f(1e6, sd=1)
 f(100, sd=1)
 f(100, sd=0.0001)
 f(1e6, sd=0.0001)

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
 Behalf
 Of J Toll
 Sent: Tuesday, January 22, 2013 2:48 PM
 To: r-help
 Subject: [R] density of hist(freq = FALSE) inversely affected by data 
 magnitude
 
 Hi,
 
 I have a couple of observations, a question or two, and perhaps a
 suggestion related to the plotting of density on the y-axis within the
 hist() function when freq=FALSE.  I was using the function and trying
 to develop an intuitive understanding of what the density is telling
 me.  After reading through this fairly helpful post:
 
 http://stats.stackexchange.com/questions/17258/odd-problem-with-a-histogram-in-r-
 with-a-relative-frequency-axis
 
 I finally realized that in the case where freq = FALSE, the y-axis
 isn't really telling me the density.  It's actually indicating the
 density multiplied by the bin size.  I assume this is for the case
 where the bins may be of non-regular size.
 
 from hist.default:
 
 dens - counts/(n * diff(breaks))
 
 So the count in each bin is divided by the total number of
 observations (n) multiplied by the size of the bin.  The problem, as I
 see it, is that the density ends up being scaled by the size of the
 bins, which is inversely proportional to the magnitude of the data.
 Therefore the magnitude of the data is directly affecting the density,
 which seems problematic.
 
 For example*:
 
 set.seed()
 x - runif(100)
 y - x / 1000
 
 par(mfrow = c(2, 1))
 hist(x, prob = TRUE)
 hist(y, prob = TRUE)
 
 From this example, you see that the density for the y histogram is
 1000 times larger, simply because the y data is 1000 times smaller.
 Again, that seems problematic.  It seems to me, that the density
 should be unit-less, but here it's affected by the magnitude of the
 data.
 
 So, my question is, why is density calculated this way?
 
 For the case where all the bins are of the same size, I would think
 density should simply be calculated as:
 
 dens - counts / n
 
 Of course, that might be somewhat misleading for the case where the
 bin sizes vary.  So then why not calculate density as:
 
 dens - counts / (n * diff(breaks) / min(diff(breaks)))
 
 Dividing diff(breaks) by min(diff(breaks)) removes the scaling effect
 of the magnitude of the data, and simply leaves the relative
 difference in bin size.
 
 For the case where all the bins are the same size, the calculation is
 equivalent to dens - counts / n
 
 For all other cases, the density is scaled by the size of the bin, but
 unaffected by the magnitude of the data.
 
 So, what am I misunderstanding?  Why is density calculated as it is,
 and what does it mean?
 
 Thanks,
 
 
 James
 
 
 *example from 
 http://stats.stackexchange.com/questions/17258/odd-problem-with-a-
 histogram-in-r-with-a-relative-frequency-axis
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] density of hist(freq = FALSE) inversely affected by data magnitude

2013-01-22 Thread J Toll
Bill,

Thank you.  I got it.  That can require a fair amount of work to
interpret the density, especially with odd or irregular bin sizes.

Thanks again,

James



On Tue, Jan 22, 2013 at 5:33 PM, William Dunlap wdun...@tibco.com wrote:
 The probability density function is not unitless - it is the derivative of the
 [cumulative] probability distribution function so it has units 
 delta-probability-mass
 over delta-x.  It must integrate to 1 (over the all possible x).  
 hist(freq=FALSE,x)
 or hist(prob=TRUE,x) displays an estimate of the density function and the 
 following
 example shows how the scale matches what you get from the presumed
 population density function.

 f
 function (n, sd)
 {
 x - rnorm(n, sd = sd)
 hist(x, freq = FALSE) # estimated density
 s - seq(min(x), max(x), len = 129)
 lines(s, dnorm(s, sd = sd), col = red) # overlay expected density for 
 this sample
 }
 f(1e6, sd=1)
 f(100, sd=1)
 f(100, sd=0.0001)
 f(1e6, sd=0.0001)

 Bill Dunlap
 Spotfire, TIBCO Software
 wdunlap tibco.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.