[R] Markov chain resources and questions

2003-10-19 Thread Paul Meagher
Can someone give me a pointer to where I should be looking for markov chain
resources in R?

Longer term, I am also interested in the question of whether explanatory
variables can coupled to a probability transition matrix to assist in
predicting the next state that a an object/system will go into.  I'm
imagining that this gets kind of ugly when you have nominal data (i.e., the
next state) that you are trying to predict using a transition matrix and you
want to try to boost your predictive power by incorporating other regressor
variables.  Can this be done, how, and is there something in R that does
this?

Another issue is how to assess the potential usefulness of the probability
transition matrix and the corresponding frequency matrix.  If your frequency
matrix only has a few observations in each cell then it is not too useful
for predictive purposes.  Also, if the probabilties are close to .50 in all
the cells, again it is not useful because it is not informative about the
next state.  Is their research that speaks to the issue of assessing the
utility or informativeness of the transition matrix for predictive
purposes.

Regards,
Paul Meagher

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] R Production Performance

2003-09-23 Thread Paul Meagher
Hi Zitan,

Below is the test I ran awhile back on invoking R as a system call.  It
might be faster if you had a c-extension to R but before I went that route I
would want to know 1) roughly how fast Python and Perl are in returning
results with their c-bindings/embedded stuff/dcom stuff, 2) whether R can be
run as a daemon process so you don't incur start up costs, and 3) whether R
can act as a math server in the sense that it will fork children or threads
as multiple users establish sessions with it.  I agree it would be nice to
have a better interface to R than via a system call.

Regards,
Paul Meagher

=

I just timed how long it took to pipe a file containing 2 lines below into R
(via a PHP script executed from my browser):

input.R
--
x = cbind(4,5,3,2,3,4)
x


?php
// timer.php

function getmicrotime(){
  list($usec, $sec) = explode( ,microtime());
  return ((float)$usec + (float)$sec);
}

$time_start = getmicrotime();

$Input  = ./input.R;
$Output = ./output.R;
$RPath = /usr/local/bin/R;
system($RPath --no-save  $Input  $Output);
$fp = fopen($Output, r);
while (!feof($fp)) {
  $line = fgets($fp,4096);
  echo $line .br;
}
fclose($fp);

$time_end = getmicrotime();
$time = $time_end - $time_start;
echo pTime To Execute: $time seconds/p;
?

The time to execute this script was 3.1081320047379 seconds (if I execute
the script a few times this around what I get).

I then removed the line that calls R only.  There was something in the
output.R file so, in essense, the only difference between the original
script and modified script is the removal of the system call to R.

The time to execute was 0.0010089874267578 seconds

By subtractive logic, this means the call to R incurs an overhead of around
3 seconds on a average web server box using the php-apache module.








- Original Message - 
From: Zitan Broth [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Wednesday, September 24, 2003 2:46 PM
Subject: Re: [R] R Production Performance


 Hi James,

 Thanks for your response :-)

 - Original Message -
  It is like anything else that you want to run as part of web services:
 what
  do you want it to do?  Yes, it is fast in doing computations, but what
 will
  you have it do?  It is probably as fast as anything else that you will
 find
  out there that is fairly general purpose.

 I just want to use R for mathematical computations, and will call it via
PHP
 from the commandline with infile. We'll need to obviously test this
 ourselves, but I just thought I'd raise the question :-))

  Are you going to be creating a lot of graphics that have to be displayed
  back on the screen?  How is the user going to input data (flat files,
XML,
  Excel worksheets, Oracle database, ...)?  Will you be invoking a unique
  process each time a user calls, or will you be using a 'daemon' that
will
  communicate with DCOM and such?  How many people will be trying to
access
  it once and what is the mix of transactions that they will use?

 Well for sure the rest of the app needs to scale as well and be fast,
 failsafe etc..., but I am just asking about R.

 I was imagining using a unique process call each time I access R, which is
 how the apache/php/*nix environment works best (although keeping processes
 in memory is achievable as well).  My experience to date on integration
with
 C packages deploying to *nix is that this works quite effectively although
 certain packages require process management that are not multiprocess (to
 ensure that R for example only executes one computation at a time), but
this
 is no problem. There are ways to call c packages directly with PHP (swig)
 and I am investigating this at present.

  You can probably get a real good feel by enclosing the operations that
you
  want to do in a system.time function to see how long it will take.
This
  really depends on what you are trying to do.  I can definitely say that
it
  is faster than trying to code the algorithm in PERL or another scripting
  language.

 Makes sense because R is written in C, where PERL and PHP are also written
 in C, so R is a layer deep so to speak :-)

 Thanks again,
 Z.

  Greetings All,
 
  Been playing with R and it is very easy to get going with the UI or
infile
  batch commands :-)
 
  What I am wondering is how scalable and fast R is for running as part of
a
  web service.  I believe R is written in C which is a great start, but
what
  are peoples general thoughts on this?
 
  Thanks greatly,
  Z.
 
   [[alternative HTML version deleted]]
 
  __
  [EMAIL PROTECTED] mailing list
  https://www.stat.math.ethz.ch/mailman/listinfo/r-help
 
 
 
 
  --
  NOTICE:  The information contained in this electronic mail transmission
 is
  intended by Convergys Corporation for the use of the named individual or
  entity to which it is directed and may contain information

[R] Updating a linear model

2003-09-22 Thread Paul Meagher
Say I have collected data and used it to construct a linear model.

I now have a new observation and want to use it to update my linear model.

Is there a more efficient way to update the model than recomputing the
linear model from the complete data set?

In other words, can I incrementally update a linear model as new
observations come in without recomputing the linear model from the full data
set?   I've looked at the help for the lm package but can't see anything
obvious that answers my question.

I am trying to use a set of regressors to predict what the next observation
will be.  I would like my coeeficient and intercept estimates to improve as
the data comes in - it will need to work well under limited information at
first (or not at all if there is some lower bound of information quantity
that is required).  If the next data point comes in, I will have (binary or
quantitative) information about whether the prediction was successful or
not.  Not sure if the option to use feedback would dictate a different
approach than the one I am thinking of.   Not sure if a backprop neural
network can be used to estimate the coefficients in a dynamically evolving
multiple linear regression equation.

If anyone has any pointers to packages I might want to look at, suggestions,
etc... I would appreciate it.

Regards,

Paul Meagher

Datavore Productions
50 Wood Grove Drive
Truro, Nova Scotia
B2N-6Y4
1.902.895.9671
www.datavore.com

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Updating a linear model

2003-09-22 Thread Paul Meagher
My google search for Plackett's Algorithm didn't return too much except that
Plackett's algorithm appears to be useful in Control Theory - it is
elaborated as Plackett's algorithm for on-line recursive least squares
estimation.  Sounds something like what I want.

I am looking at developing a user modelling type app (new data points coming
in and wanting to dynamically update regression co-efficients for each user)
which could be viewed as a type of control problem.

Regards,
Paul

- Original Message - 
From: Rolf Turner [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, September 22, 2003 3:36 PM
Subject: Re: [R] Updating a linear model



 As Thomas Lumley just said, there's nothing implemented in R.  The
 updating procedure is well-known and worked out; it is (in my
 experience) referred to as ``Plackett's Algorithm''.

 The reference I have to hand for this --- should you wish to roll
 your own --- is:

 Rao, C. Radhakrishna, Linear Statistical Inference and Its
 Applications, John Wiley \ Sons, New York, 1968,  page 29,
 Exercise 2.8

 There are --- almost surely --- better references, but I don't
 know them.
 cheers,

 Rolf Turner
 [EMAIL PROTECTED]


__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Overlaying graphs

2003-09-05 Thread Paul Meagher
From: Damon Wischik [EMAIL PROTECTED]

 Paul Meagher wrote:
  2. Does R have a suite of best-fit tools for finding the best
  fitting-probability distribution for any observed probability
distribution?

 I think that the best-fitting probability distribution for an observed
 probability distribution is the empirical distribution of your
 observations.

 (Perhaps you have some other criteria than just goodness of fit?)

You can certainly use the empirical distribution of observations to
construct your probability distribution and you are correct that, in some
sense, this would be the best fitting probability distribution.

Lately I have been asking myself why we bother in the first place to use
theoretical probability distributions to model our empirically
distributions.  Why not construct the probability distribution directly from
the data itself?  I think that in some cases, this is the correct route to
go.  Computers allow us to make inferences about the probability of certain
outcomes using these irregularly shaped distributions.  These inferences may
be more accurate than using any of the available theoretical probability
distributions.

The main reasons I can come up with for not using the empirical distribution
itself as your probability distribution are:

1. Over-fitting which limits your ability to generalize to new situations.
This, I think, is most important reason for engaging in the exercise of
fitting your data to a theoretical distribution.

2. It is easier to derive inferences about your random variable.  This is
the second most important reason.

3. Anyone who plays with numbers constitutionally tends towards platonism.

Regards,
Paul

 Damon.



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Overlaying graphs

2003-09-04 Thread Paul Meagher

- Original Message - 
From: Richard A. O'Keefe [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Thursday, September 04, 2003 2:56 AM
Subject: Re: [R] Overlaying graphs


 I do not know how to overlay the curve graphic on top of hist graphic.

 Do you know about the add=TRUE option for plot()?

 I am hoping to show visually that the normal curve overlays the obtained
 probability distribution when plotted on the same graph.  Unfortunately, I
 an not sure how to overlay them. Can anyone point me in the right
direction
 or show me the code.

 This is a bad way to do it anyway.  What you want is a qqnorm plot.
 See ?qqnorm.



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Overlaying graphs

2003-09-03 Thread Paul Meagher
I am wanting to construct a probability distribution for height and then,
hopefully, visually and analytically demonstrate that it is normally
distributed.

These are the commands I have developed so far:

fat   - read.table(fat.dat, header=TRUE)
mu- mean(fat$height)
sdev  - sd(fat$height)
hist(fat$height, br=20, freq=FALSE, xlab=Male Height in Inches)
curve(dnorm(x, mu, sdev), from=64, to=78)

I do not know how to overlay the curve graphic on top of hist graphic.

I am hoping to show visually that the normal curve overlays the obtained
probability distribution when plotted on the same graph.  Unfortunately, I
an not sure how to overlay them. Can anyone point me in the right direction
or show me the code.

The data set is here:

http://www.datavore.com/fat.dat

Regards,
Paul Meagher

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Generating routine for Poisson random numbers

2003-08-27 Thread Paul Meagher
 You can generate Poisson random numbers from a Poisson process like this:

 rfishy-function(lambda){
 t - 0
 i - -1
 while(t=lambda){
 t-t-log(runif(1))
 i-i+1
 }
 return(i)
 }

This is a nice compact algorithm for generating Poisson random numbers.  It
appears to work.  I am used to seeing a Poisson counting process implemented
using a num frames parameter, a success probability per frame, and
counting the unform numbers in the 0 to 1 range that fall below the success
probability over num frames.  It is interesting to see an implementation
that bypasses having to use a num_frames parameter, just the lambda value,
and relies instead on incrementing a t value on each iteration and counting
the number of times you can do so (while t = lambda).

 The name of this generator is descriptive, not a pub. It is very slow for
 large lambda, and incorrect for extremely large lambda (and possibly for
 extremely small lambda). If you only wanted a fairly small number of
 random variates with, say, 1e-6lambda100, then it's not too bad.

One could impose a lambda range check so that you can only invoke the
function using a lamba range where the Poisson RNG is expected to be
reasonably accurate.  The range you are giving is probably the most commonly
used range where a Poisson random number generator might be used?  Brian
Ripley also mentioned that the counting process based implementation would
not work well for large lambdas.  Do you encounter such large lambdas in
practice?   Can't you always, in theory, avoid such large lambdas by
changing the size of the time interval you want to consider?

 But why would anyone *want* to code their own Poisson random number
generator, except perhaps as an interesting student exercise?

Yes this is meant as an interesting exercise for someone who wants to
understand how to implement probability distributions in an object oriented
way (I am writing an article introducing people to probability modelling).
I am looking for a compact algorithm that I can easily explain to people how
it works and which will be a good enough rpois() approximation in many
cases.  I don't want to be blown out of the water for suggesting such an
algorithm to represent a Poisson RNG so if you think it is inappropriate to
learn about what how a Poisson RNG works using the above described
generating process,  then I would be interested in your views.

Thank you for your thoughts on this matter.

Regards,
Paul Meagher




 -thomas



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Re: Generating routine for Poisson random numbers

2003-08-27 Thread Paul Meagher
 I do think that in an article you should also point out to people that
 there is a lot of numerical code available out there, written by people
 who know a lot more than we do about what they are doing. It's often
 easier than writing your own code and the results are better. One
 advantage of an object-oriented approach is that you can just rip out your
 implementation and slot in a new one if it is better.

Exactly.  That is another reason I do not feel the need to implement the RNG
algorithm perfectly.  If someone really wants a more fool proof rpois-like
algorithm with better running time characteristics they can reimplement
method using the rpois.c normal deviates approach.

BTW, here is what my final Poisson RNG method looks like coded in PHP - it
is modelled after the JSci library approach.  I am only showing the
constructor and the RNG method:

class PoissonDistribution extends ProbabilityDistribution {

  var $lambda;

  function PoissonDistribution($lambda=1) {
if($lambda = 0.0) {
  die(Lamda parameter should be positive.);
}
$this-lambda = $interval;
  }

  function RNG($num_vals=1) {
if ($num_vals  1) {
  die(Number of random numbers to return must be 1 or greater);
}
for ($i=0; $i  $num_vals; $i++) {
  $temp  = 0;
  $count = -1;
  while($temp = $this-lambda) {
$rand_val = mt_rand() / mt_getrandmax();
$temp  = $temp - log($rand_val);
$count++;
  }
  // record count value(s)
  if ($num_vals == 1) {
$counts = $count;
  } else {
$counts[$i] = $count;
  }
}
return $counts;
  }

My simple eyeball tests indicate that the algorithm appears to generate
unbiased estimates of the expected mean and variance given lambas in the
range of .02 and 900.  I guess to confirm the unbiasedness I would need to
generate a bunch of sample estimates of the mean and variance from my
Poisson random number sequences, plot the relative frequency of these
estimates, and see if the central tendency of the estimates correspond to
the mean and variance expected theoretically for a poisson random variable
(i.e., mean and variance = lambda).

I see why the performance characteristics get bad when lambda is big - the
counting process involves more iterations.  Most of the text book examples
never use a lambda this big, often lambda is less than 100 and often not
less than .02 or so.  In other words, the typical parameter space for the
algorithm may be such that areas where it breaks down are not that common in
practice.

I think this will be a perfectly acceptable RNG for a Poisson random
variable provided you don't use unusually large or small lambda values - if
I knew the break down range, I could implement a check-range test to
disallow usage of the function for that range.

Not sure yet exactly what characteristics of the algorithm would lead it to
behave incorrectly at extremely small or large lambda values?

BTW, is this simple method of generating a poisson random number discussed
in detail in any other books or papers that I might consult?

Regards,
Paul Meagher

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


[R] Viewing function source

2003-08-26 Thread Paul Meagher
I know I should probably RTFM for this question, but could someone tell me
if R supports the idea of viewing source on any particular function you
want to use?

If I want to view source on the rpois() function, for example, can I do
somethink like:

source(rpois)

To see how the function is implemented?

Regards,

Paul Meagher

Datavore Productions
50 Wood Grove Drive
Truro, Nova Scotia
B2N-6Y4
1.902.895.9671
www.datavore.com

__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help


Re: [R] Viewing function source

2003-08-26 Thread Paul Meagher
 Just type the name of the function to see the R code
rpois
   function (n, lambda)
   .Internal(rpois(n, lambda))

 But in this case it tells you that rpois is implemented in C code :(

  By convention, it is likely to be a function called do_rpois,
  however in this case we have an exception to the convention.  You can
 look in src/main.names.c for the name of the C function.

 abacus% fgrep rpois names.c
 {rpois,   do_random1, 3,  11, 2,  {PP_FUNCALL,
 PREC_FN,0}},

 so the function do_random1 is actually involved. If you grep for that, you
 are directed to src/main/random.c and you will eventually find that the
 underlying code is in src/nmath/rpois.c and is commented with a reference
 to the ACM Transactions on Mathematical Software.

Thank you for helping me with the research on this.

I am interested in the rpois source in particular because I am wanting to
implement a compact algorithm for it.

The rpois.c code is estimated at around 140 lines of code when you subtract
out line spaces and commenting.   Fairly inscrutable if you haven't read the
article or don't have a mathematical programming background.

I had hoped that I would see a compact algorithm of around 20 lines of code
based on a counting process sensitive to a lambda parameter.  Instead it
uses a normal deviates method that is probably much more optimal in many
mathematical senses.

Personally, I would like to see a counting process implementation of an
poisson random number generator.  I suspect it would be much slower than
rpois.c (because it would likely depend upon setting a num_frames iteration
counter) and less accurate, but would be much more compact and give more
intuitive insight and understanding of a numerical process that generates
the poisson random deviates.  I am not very fluent yet in R programming to
attempt this.  I could take a stab at it with PHP which I am much more
fluent in if I am not barking up the wrong tree on this conjecture.

Regards,
Paul Meagher



 -thomas



__
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help