[R] Markov chain resources and questions
Can someone give me a pointer to where I should be looking for markov chain resources in R? Longer term, I am also interested in the question of whether explanatory variables can coupled to a probability transition matrix to assist in predicting the next state that a an object/system will go into. I'm imagining that this gets kind of ugly when you have nominal data (i.e., the next state) that you are trying to predict using a transition matrix and you want to try to boost your predictive power by incorporating other regressor variables. Can this be done, how, and is there something in R that does this? Another issue is how to assess the potential usefulness of the probability transition matrix and the corresponding frequency matrix. If your frequency matrix only has a few observations in each cell then it is not too useful for predictive purposes. Also, if the probabilties are close to .50 in all the cells, again it is not useful because it is not informative about the next state. Is their research that speaks to the issue of assessing the utility or informativeness of the transition matrix for predictive purposes. Regards, Paul Meagher __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] R Production Performance
Hi Zitan, Below is the test I ran awhile back on invoking R as a system call. It might be faster if you had a c-extension to R but before I went that route I would want to know 1) roughly how fast Python and Perl are in returning results with their c-bindings/embedded stuff/dcom stuff, 2) whether R can be run as a daemon process so you don't incur start up costs, and 3) whether R can act as a math server in the sense that it will fork children or threads as multiple users establish sessions with it. I agree it would be nice to have a better interface to R than via a system call. Regards, Paul Meagher = I just timed how long it took to pipe a file containing 2 lines below into R (via a PHP script executed from my browser): input.R -- x = cbind(4,5,3,2,3,4) x ?php // timer.php function getmicrotime(){ list($usec, $sec) = explode( ,microtime()); return ((float)$usec + (float)$sec); } $time_start = getmicrotime(); $Input = ./input.R; $Output = ./output.R; $RPath = /usr/local/bin/R; system($RPath --no-save $Input $Output); $fp = fopen($Output, r); while (!feof($fp)) { $line = fgets($fp,4096); echo $line .br; } fclose($fp); $time_end = getmicrotime(); $time = $time_end - $time_start; echo pTime To Execute: $time seconds/p; ? The time to execute this script was 3.1081320047379 seconds (if I execute the script a few times this around what I get). I then removed the line that calls R only. There was something in the output.R file so, in essense, the only difference between the original script and modified script is the removal of the system call to R. The time to execute was 0.0010089874267578 seconds By subtractive logic, this means the call to R incurs an overhead of around 3 seconds on a average web server box using the php-apache module. - Original Message - From: Zitan Broth [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, September 24, 2003 2:46 PM Subject: Re: [R] R Production Performance Hi James, Thanks for your response :-) - Original Message - It is like anything else that you want to run as part of web services: what do you want it to do? Yes, it is fast in doing computations, but what will you have it do? It is probably as fast as anything else that you will find out there that is fairly general purpose. I just want to use R for mathematical computations, and will call it via PHP from the commandline with infile. We'll need to obviously test this ourselves, but I just thought I'd raise the question :-)) Are you going to be creating a lot of graphics that have to be displayed back on the screen? How is the user going to input data (flat files, XML, Excel worksheets, Oracle database, ...)? Will you be invoking a unique process each time a user calls, or will you be using a 'daemon' that will communicate with DCOM and such? How many people will be trying to access it once and what is the mix of transactions that they will use? Well for sure the rest of the app needs to scale as well and be fast, failsafe etc..., but I am just asking about R. I was imagining using a unique process call each time I access R, which is how the apache/php/*nix environment works best (although keeping processes in memory is achievable as well). My experience to date on integration with C packages deploying to *nix is that this works quite effectively although certain packages require process management that are not multiprocess (to ensure that R for example only executes one computation at a time), but this is no problem. There are ways to call c packages directly with PHP (swig) and I am investigating this at present. You can probably get a real good feel by enclosing the operations that you want to do in a system.time function to see how long it will take. This really depends on what you are trying to do. I can definitely say that it is faster than trying to code the algorithm in PERL or another scripting language. Makes sense because R is written in C, where PERL and PHP are also written in C, so R is a layer deep so to speak :-) Thanks again, Z. Greetings All, Been playing with R and it is very easy to get going with the UI or infile batch commands :-) What I am wondering is how scalable and fast R is for running as part of a web service. I believe R is written in C which is a great start, but what are peoples general thoughts on this? Thanks greatly, Z. [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help -- NOTICE: The information contained in this electronic mail transmission is intended by Convergys Corporation for the use of the named individual or entity to which it is directed and may contain information
[R] Updating a linear model
Say I have collected data and used it to construct a linear model. I now have a new observation and want to use it to update my linear model. Is there a more efficient way to update the model than recomputing the linear model from the complete data set? In other words, can I incrementally update a linear model as new observations come in without recomputing the linear model from the full data set? I've looked at the help for the lm package but can't see anything obvious that answers my question. I am trying to use a set of regressors to predict what the next observation will be. I would like my coeeficient and intercept estimates to improve as the data comes in - it will need to work well under limited information at first (or not at all if there is some lower bound of information quantity that is required). If the next data point comes in, I will have (binary or quantitative) information about whether the prediction was successful or not. Not sure if the option to use feedback would dictate a different approach than the one I am thinking of. Not sure if a backprop neural network can be used to estimate the coefficients in a dynamically evolving multiple linear regression equation. If anyone has any pointers to packages I might want to look at, suggestions, etc... I would appreciate it. Regards, Paul Meagher Datavore Productions 50 Wood Grove Drive Truro, Nova Scotia B2N-6Y4 1.902.895.9671 www.datavore.com __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Updating a linear model
My google search for Plackett's Algorithm didn't return too much except that Plackett's algorithm appears to be useful in Control Theory - it is elaborated as Plackett's algorithm for on-line recursive least squares estimation. Sounds something like what I want. I am looking at developing a user modelling type app (new data points coming in and wanting to dynamically update regression co-efficients for each user) which could be viewed as a type of control problem. Regards, Paul - Original Message - From: Rolf Turner [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Monday, September 22, 2003 3:36 PM Subject: Re: [R] Updating a linear model As Thomas Lumley just said, there's nothing implemented in R. The updating procedure is well-known and worked out; it is (in my experience) referred to as ``Plackett's Algorithm''. The reference I have to hand for this --- should you wish to roll your own --- is: Rao, C. Radhakrishna, Linear Statistical Inference and Its Applications, John Wiley \ Sons, New York, 1968, page 29, Exercise 2.8 There are --- almost surely --- better references, but I don't know them. cheers, Rolf Turner [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Overlaying graphs
From: Damon Wischik [EMAIL PROTECTED] Paul Meagher wrote: 2. Does R have a suite of best-fit tools for finding the best fitting-probability distribution for any observed probability distribution? I think that the best-fitting probability distribution for an observed probability distribution is the empirical distribution of your observations. (Perhaps you have some other criteria than just goodness of fit?) You can certainly use the empirical distribution of observations to construct your probability distribution and you are correct that, in some sense, this would be the best fitting probability distribution. Lately I have been asking myself why we bother in the first place to use theoretical probability distributions to model our empirically distributions. Why not construct the probability distribution directly from the data itself? I think that in some cases, this is the correct route to go. Computers allow us to make inferences about the probability of certain outcomes using these irregularly shaped distributions. These inferences may be more accurate than using any of the available theoretical probability distributions. The main reasons I can come up with for not using the empirical distribution itself as your probability distribution are: 1. Over-fitting which limits your ability to generalize to new situations. This, I think, is most important reason for engaging in the exercise of fitting your data to a theoretical distribution. 2. It is easier to derive inferences about your random variable. This is the second most important reason. 3. Anyone who plays with numbers constitutionally tends towards platonism. Regards, Paul Damon. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Overlaying graphs
- Original Message - From: Richard A. O'Keefe [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, September 04, 2003 2:56 AM Subject: Re: [R] Overlaying graphs I do not know how to overlay the curve graphic on top of hist graphic. Do you know about the add=TRUE option for plot()? I am hoping to show visually that the normal curve overlays the obtained probability distribution when plotted on the same graph. Unfortunately, I an not sure how to overlay them. Can anyone point me in the right direction or show me the code. This is a bad way to do it anyway. What you want is a qqnorm plot. See ?qqnorm. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Overlaying graphs
I am wanting to construct a probability distribution for height and then, hopefully, visually and analytically demonstrate that it is normally distributed. These are the commands I have developed so far: fat - read.table(fat.dat, header=TRUE) mu- mean(fat$height) sdev - sd(fat$height) hist(fat$height, br=20, freq=FALSE, xlab=Male Height in Inches) curve(dnorm(x, mu, sdev), from=64, to=78) I do not know how to overlay the curve graphic on top of hist graphic. I am hoping to show visually that the normal curve overlays the obtained probability distribution when plotted on the same graph. Unfortunately, I an not sure how to overlay them. Can anyone point me in the right direction or show me the code. The data set is here: http://www.datavore.com/fat.dat Regards, Paul Meagher __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Generating routine for Poisson random numbers
You can generate Poisson random numbers from a Poisson process like this: rfishy-function(lambda){ t - 0 i - -1 while(t=lambda){ t-t-log(runif(1)) i-i+1 } return(i) } This is a nice compact algorithm for generating Poisson random numbers. It appears to work. I am used to seeing a Poisson counting process implemented using a num frames parameter, a success probability per frame, and counting the unform numbers in the 0 to 1 range that fall below the success probability over num frames. It is interesting to see an implementation that bypasses having to use a num_frames parameter, just the lambda value, and relies instead on incrementing a t value on each iteration and counting the number of times you can do so (while t = lambda). The name of this generator is descriptive, not a pub. It is very slow for large lambda, and incorrect for extremely large lambda (and possibly for extremely small lambda). If you only wanted a fairly small number of random variates with, say, 1e-6lambda100, then it's not too bad. One could impose a lambda range check so that you can only invoke the function using a lamba range where the Poisson RNG is expected to be reasonably accurate. The range you are giving is probably the most commonly used range where a Poisson random number generator might be used? Brian Ripley also mentioned that the counting process based implementation would not work well for large lambdas. Do you encounter such large lambdas in practice? Can't you always, in theory, avoid such large lambdas by changing the size of the time interval you want to consider? But why would anyone *want* to code their own Poisson random number generator, except perhaps as an interesting student exercise? Yes this is meant as an interesting exercise for someone who wants to understand how to implement probability distributions in an object oriented way (I am writing an article introducing people to probability modelling). I am looking for a compact algorithm that I can easily explain to people how it works and which will be a good enough rpois() approximation in many cases. I don't want to be blown out of the water for suggesting such an algorithm to represent a Poisson RNG so if you think it is inappropriate to learn about what how a Poisson RNG works using the above described generating process, then I would be interested in your views. Thank you for your thoughts on this matter. Regards, Paul Meagher -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: Generating routine for Poisson random numbers
I do think that in an article you should also point out to people that there is a lot of numerical code available out there, written by people who know a lot more than we do about what they are doing. It's often easier than writing your own code and the results are better. One advantage of an object-oriented approach is that you can just rip out your implementation and slot in a new one if it is better. Exactly. That is another reason I do not feel the need to implement the RNG algorithm perfectly. If someone really wants a more fool proof rpois-like algorithm with better running time characteristics they can reimplement method using the rpois.c normal deviates approach. BTW, here is what my final Poisson RNG method looks like coded in PHP - it is modelled after the JSci library approach. I am only showing the constructor and the RNG method: class PoissonDistribution extends ProbabilityDistribution { var $lambda; function PoissonDistribution($lambda=1) { if($lambda = 0.0) { die(Lamda parameter should be positive.); } $this-lambda = $interval; } function RNG($num_vals=1) { if ($num_vals 1) { die(Number of random numbers to return must be 1 or greater); } for ($i=0; $i $num_vals; $i++) { $temp = 0; $count = -1; while($temp = $this-lambda) { $rand_val = mt_rand() / mt_getrandmax(); $temp = $temp - log($rand_val); $count++; } // record count value(s) if ($num_vals == 1) { $counts = $count; } else { $counts[$i] = $count; } } return $counts; } My simple eyeball tests indicate that the algorithm appears to generate unbiased estimates of the expected mean and variance given lambas in the range of .02 and 900. I guess to confirm the unbiasedness I would need to generate a bunch of sample estimates of the mean and variance from my Poisson random number sequences, plot the relative frequency of these estimates, and see if the central tendency of the estimates correspond to the mean and variance expected theoretically for a poisson random variable (i.e., mean and variance = lambda). I see why the performance characteristics get bad when lambda is big - the counting process involves more iterations. Most of the text book examples never use a lambda this big, often lambda is less than 100 and often not less than .02 or so. In other words, the typical parameter space for the algorithm may be such that areas where it breaks down are not that common in practice. I think this will be a perfectly acceptable RNG for a Poisson random variable provided you don't use unusually large or small lambda values - if I knew the break down range, I could implement a check-range test to disallow usage of the function for that range. Not sure yet exactly what characteristics of the algorithm would lead it to behave incorrectly at extremely small or large lambda values? BTW, is this simple method of generating a poisson random number discussed in detail in any other books or papers that I might consult? Regards, Paul Meagher __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Viewing function source
I know I should probably RTFM for this question, but could someone tell me if R supports the idea of viewing source on any particular function you want to use? If I want to view source on the rpois() function, for example, can I do somethink like: source(rpois) To see how the function is implemented? Regards, Paul Meagher Datavore Productions 50 Wood Grove Drive Truro, Nova Scotia B2N-6Y4 1.902.895.9671 www.datavore.com __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Viewing function source
Just type the name of the function to see the R code rpois function (n, lambda) .Internal(rpois(n, lambda)) But in this case it tells you that rpois is implemented in C code :( By convention, it is likely to be a function called do_rpois, however in this case we have an exception to the convention. You can look in src/main.names.c for the name of the C function. abacus% fgrep rpois names.c {rpois, do_random1, 3, 11, 2, {PP_FUNCALL, PREC_FN,0}}, so the function do_random1 is actually involved. If you grep for that, you are directed to src/main/random.c and you will eventually find that the underlying code is in src/nmath/rpois.c and is commented with a reference to the ACM Transactions on Mathematical Software. Thank you for helping me with the research on this. I am interested in the rpois source in particular because I am wanting to implement a compact algorithm for it. The rpois.c code is estimated at around 140 lines of code when you subtract out line spaces and commenting. Fairly inscrutable if you haven't read the article or don't have a mathematical programming background. I had hoped that I would see a compact algorithm of around 20 lines of code based on a counting process sensitive to a lambda parameter. Instead it uses a normal deviates method that is probably much more optimal in many mathematical senses. Personally, I would like to see a counting process implementation of an poisson random number generator. I suspect it would be much slower than rpois.c (because it would likely depend upon setting a num_frames iteration counter) and less accurate, but would be much more compact and give more intuitive insight and understanding of a numerical process that generates the poisson random deviates. I am not very fluent yet in R programming to attempt this. I could take a stab at it with PHP which I am much more fluent in if I am not barking up the wrong tree on this conjecture. Regards, Paul Meagher -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help