Re: [R] Question on output of vectors from a for loop into a matrix
On Thursday 16 August 2007 15:35, Ryan Briscoe Runquist wrote: > Hello R-help, > > I am a recent convert to R from SAS and I am having some difficulty with > output of a for loop into a matrix. I have searched the help manuals and > the archives, but I can't get my code to work. It is probably a syntax > error that I am not spotting. I am trying to make a distance matrix by > extracting the distances from each point to all others in a vector (the for > loop). I can get them all to print and can save each individually to a > file, but I cannot get them to be bound into a matrix. > > Here is the code that I have tried: > > m<-length(Day.1.flower.coords$x) #31 grid points > > output.matix<-matrix(0.0,m,m) > for(i in 1:m){ > dist<-spDistsN1(Day.1.coords.matrix, Day.1.coords.matrix[i,]) > dist<-as.vector(dist) > output.matrix[i,]<-dist > print(dist)} > > The error message that I get is: > > Error in output.matrix[i,] <- dist : incorrect number of subscripts on > matrix > > Thanks for your help. > > Ryan > > ~~ > Ryan D. Briscoe Runquist > Population Biology Graduate Group > University of California, Davis > [EMAIL PROTECTED] > Hi Bryan, for all UCD students you have the luxury of not using a loop! :) Would something like this work - for the creation of a 'distance matrix' : # example dataset: 2 x 2 grid d <- expand.grid(x=1:2, y=1:2) # an instructive plot plot(d, type='n') text(d, rownames(d)) # create distance object and convert to matrix: m <- as.matrix(dist(d)) m 1234 1 0.00 1.00 1.00 1.414214 2 1.00 0.00 1.414214 1.00 3 1.00 1.414214 0.00 1.00 4 1.414214 1.00 1.00 0.00 # is that the kind of distance matrix you are looking for : ?dist Distance Matrix Computation Description: This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix. [...] cheers, Dylan > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC curve in R
On Thursday 26 July 2007 10:45, Frank E Harrell Jr wrote: > Dylan Beaudette wrote: > > On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: > >> Note that even though the ROC curve as a whole is an interesting > >> 'statistic' (its area is a linear translation of the > >> Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation > >> statistics), each individual point on it is an improper scoring rule, > >> i.e., a rule that is optimized by fitting an inappropriate model. Using > >> curves to select cutoffs is a low-precision and arbitrary operation, and > >> the cutoffs do not replicate from study to study. Probably the worst > >> problem with drawing an ROC curve is that it tempts analysts to try to > >> find cutoffs where none really exist, and it makes analysts ignore the > >> whole field of decision theory. > >> > >> Frank Harrell > > > > Frank, > > > > This thread has caught may attention for a couple reasons, possibly > > related to my novice-level experience. > > > > 1. in a logistic regression study, where i am predicting the probability > > of the response being 1 (for example) - there exists a continuum of > > probability values - and a finite number of {1,0} realities when i either > > look within the original data set, or with a new 'verification' data set. > > I understand that drawing a line through the probabilities returned from > > the logistic regression is a loss of information, but there are times > > when a 'hard' decision requiring prediction of {1,0} is required. I have > > found that the ROCR package (not necessarily the ROC Curve) can be useful > > in identifying the probability cutoff where accuracy is maximized. Is > > this an unreasonable way of using logistic regression as a predictor? Thanks for the detailed response Frank. My follow-up questions are below: > Logistic regression (with suitable attention to not assuming linearity > and to avoiding overfitting) is a great way to estimate P[Y=1]. Given > good predicted P[Y=1] and utilities (losses, costs) for incorrect > positive and negative decisions, an optimal decision is one that > optimizes expected utility. The ROC curve does not play a direct role > in this regard. Ok. > If per-subject utilities are not available, the analyst > may make various assumptions about utilities (including the unreasonable > but often used assumption that utilities do not vary over subjects) to > find a cutoff on P[Y=1]. Can you elaborate on what exactly a "per-subject utility" is? In my case, I am trying to predict the occurance of specific soil features based on two predictor variables: 1 continuous, the other categorical. Thus far my evaluation of how well this method works is based on how often I can correctly predict (a categorical) quality. > A very nice feature of P[Y=1] is that error > probabilities are self-contained. For example if P[Y=1] = .02 for a > single subject and you predict Y=0, the probability of an error is .02 > by definition. One doesn't need to compute an overall error probability > over the whole distribution of subjects' risks. If the cost of a false > negative is C, the expected cost is .02*C in this example. Interesting. The hang-up that I am having is that I need to predict from {O,1}, as the direct users of this information are not currently interested in in raw probabilities. As far as I know, in order to predict a class from a probability I need use a cutoff... How else can I accomplish this without imposing a cutoff on the entire dataset? One thought, identify a cutoff for each level of the categorical predictor term in the model... (?) > > 2. The ROC curve can be a helpful way of communicating false positives / > > false negatives to other users who are less familiar with the output and > > interpretation of logistic regression. > > What is more useful than that is a rigorous calibration curve estimate > to demonstrate the faithfulness of predicted P[Y=1] and a histogram > showing the distribution of predicted P[Y=1] Ok. I can make that histogram - how would one go about making the 'rigorous calibration curve' ? Note that I have a training set, from which the model is built, and a smaller testing set for evaluation. > . Models that put a lot of > predictions near 0 or 1 are the most discriminating. Calibration curves > and risk distributions are easier to explain than ROC curves. By 'risk discrimination' do you mean said histogram ? > Too often > a statistician will solve for a cutoff on P[Y=1], imposing her own > utility function without querying any subjects. in this case I have
Re: [R] ROC curve in R
On Thursday 26 July 2007 06:01, Frank E Harrell Jr wrote: > Note that even though the ROC curve as a whole is an interesting > 'statistic' (its area is a linear translation of the > Wilcoxon-Mann-Whitney-Somers-Goodman-Kruskal rank correlation > statistics), each individual point on it is an improper scoring rule, > i.e., a rule that is optimized by fitting an inappropriate model. Using > curves to select cutoffs is a low-precision and arbitrary operation, and > the cutoffs do not replicate from study to study. Probably the worst > problem with drawing an ROC curve is that it tempts analysts to try to > find cutoffs where none really exist, and it makes analysts ignore the > whole field of decision theory. > > Frank Harrell Frank, This thread has caught may attention for a couple reasons, possibly related to my novice-level experience. 1. in a logistic regression study, where i am predicting the probability of the response being 1 (for example) - there exists a continuum of probability values - and a finite number of {1,0} realities when i either look within the original data set, or with a new 'verification' data set. I understand that drawing a line through the probabilities returned from the logistic regression is a loss of information, but there are times when a 'hard' decision requiring prediction of {1,0} is required. I have found that the ROCR package (not necessarily the ROC Curve) can be useful in identifying the probability cutoff where accuracy is maximized. Is this an unreasonable way of using logistic regression as a predictor? 2. The ROC curve can be a helpful way of communicating false positives / false negatives to other users who are less familiar with the output and interpretation of logistic regression. 3. I have been using the area under the ROC Curve, kendall's tau, and cohen's kappa to evaluate the accuracy of a logistic regression based prediction, the last two statistics based on a some probability cutoff identified before hand. How does the topic of decision theory relate to some of the circumstances described above? Is there a better way to do some of these things? Cheers, Dylan > > [EMAIL PROTECTED] wrote: > > http://search.r-project.org/cgi-bin/namazu.cgi?query=ROC&max=20&result=no > >rmal&sort=score&idxname=Rhelp02a&idxname=functions&idxname=docs > > > > there is a lot of help try help.search("ROC curve") gave > > Help files with alias or concept or title matching 'ROC curve' using > > fuzzy matching: > > > > > > > > granulo(ade4) Granulometric Curves > > plot.roc(analogue)Plot ROC curves and associated > > diagnostics > > roc(analogue) ROC curve analysis > > colAUC(caTools) Column-wise Area Under ROC > > Curve (AUC) > > DProc(DPpackage) Semiparametric Bayesian ROC > > curve analysis > > cv.enet(elasticnet) Computes K-fold cross-validated > > error curve for elastic net > > ROC(Epi) Function to compute and draw > > ROC-curves. > > lroc(epicalc) ROC curve > > cv.lars(lars) Computes K-fold cross-validated > > error curve for lars > > roc.demo(TeachingDemos) Demonstrate ROC curves by > > interactively building one > > > > HTH > > see the help and examples those will suffice > > > > Type 'help(FOO, package = PKG)' to inspect entry 'FOO(PKG) TITLE'. > > > > > > > > Regards, > > > > Gaurav Yadav > > +++ > > Assistant Manager, CCIL, Mumbai (India) > > Mob: +919821286118 Email: [EMAIL PROTECTED] > > Bhagavad Gita: Man is made by his Belief, as He believes, so He is > > > > > > > > "Rithesh M. Mohan" <[EMAIL PROTECTED]> > > Sent by: [EMAIL PROTECTED] > > 07/26/2007 11:26 AM > > > > To > > > > cc > > > > Subject > > [R] ROC curve in R > > > > > > > > > > > > > > Hi, > > > > > > > > I need to build ROC curve in R, can you please provide data steps / code > > or guide me through it. > > > > > > > > Thanks and Regards > > > > Rithesh M Mohan > > > > > > [[alternative HTML version deleted]] > > - > Frank E Harrell Jr Professor and Chair School of Medicine > Department of Biostatistics Vanderbilt University > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] DF and intercept term meaning for mixed (lme) models
Hi, I am using the lme package to fit mixed effects models to a set of data. I am having a difficult time understanding the *meaning* of the numDF (degrees of freedom in the numerator), denDF (DF in the denomenator), as well as the Intercept term in the output. For example: I have a groupedData object called 'Soil', and am fitting an lme model as follows: ## fit a simple model # errors partitioned among replicates fit1 <- lme( log(ksat) ~ log(conc) + ordered(sar) + soil_id , random = ~ 1 | rep, data=Soil ) ## check significance of model terms anova(fit1) numDF denDF F-value p-value (Intercept) 1 1253 64313.21 <.0001 log(conc) 1 597 173.34 <.0001 ordered(sar) 2 597 13.87 <.0001 soil_id 29 597 54.92 <.0001 I am pretty sure that I am interpreting the p-values for the predictor terms to mean that these terms contribute significantly to the variation in the response variable, (?) . I am not sure what the significance of the Intercept term really means. Does it have something to do with the significance of the random effects in the model? Also, from a practical standpoint, how can I best describe / interpret the numDF and denDF terms to others... or do they even matter to a person who is looking to see if the 'treatment' predictor terms had any effect on the response term? Thanks in advance. Dylan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] reversing the x-axis terms on a boxplot
Hi, I am able to reverse the order of plotting on regular plots (i.e. with the plot() function) by manually setting the xlim variable. Is there some trick like this which will work for a boxplot? * for example: l <- sample(letters, 500, replace=TRUE) n <- runif(500) boxplot(n ~ l) this will produce a plot with the x-axis ranging from a->z ... i know that these are labels, associated with an integer index-- but is there someway to reverse the plotting order? Thanks in advance, Dylan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] inter-rater agreement index kappa
On Tuesday 26 June 2007 10:14, Nair, Murlidharan T wrote: > Is there a function that calculates the inter-rater agreement index > (kappa) in R? > > Thanks ../Murli I have found a couple useful approaches: # PCC, kappa, rand index require(e1701) classAgreement(2x2.table) # kendall's tau cor(x,y, method='kendall') cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] off-topic: affine transformation matrix
Thanks for the prompt and clear reply! The simplicity of the solution may have been why I initially overlooked this approach... The results look convincing (http://169.237.35.250/~dylan/temp/affine.png), now I just need to verify that the output from coef() is in the format that I need it in. l <- lm(cbind(nx,ny) ~ x + y, data=g) coef(l) nx ny (Intercept) 6.87938629 5.515261158 x1.01158806 -0.005449152 y -0.04481893 0.996895878 ## convert to format needed for affine() function in postGIS? t(coef(l)) (Intercept)x y nx6.879386 1.011588063 -0.04481893 ny5.515261 -0.005449152 0.99689588 note that the format that I am looking for looks something like the matrix defined on this page: http://www.geom.uiuc.edu/docs/reference/CRC-formulas/node15.html cheers, dylan On Monday 28 May 2007 15:18, Prof Brian Ripley wrote: > Isn't this just a regression (hopefully with a near-zero error). > > coef(lm(cbind(xnew, ynew) ~ xold + yold)) > > should do what I think you are asking for. (I am not clear which > direction you want the transformation, so choose 'old' and 'new' > accordingly.) > > On Mon, 28 May 2007, Dylan Beaudette wrote: > > This may sound like a very naive question, but... > > > > give two lists of coordinate pairs (x,y - Cartesian space) is there any > > simple way to compute the affine transformation matrix in R. > > > > I have a set of data which is offset from where i know it should be. I > > have coordinates of the current data, and matching coordinates of where > > the data should be. I need to compute the composition of the affine > > transformation matrix, so that I can apply an affine transform the entire > > dataset. > > > > any ideas? > > > > thanks in advance! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] off-topic: affine transformation matrix
This may sound like a very naive question, but... give two lists of coordinate pairs (x,y - Cartesian space) is there any simple way to compute the affine transformation matrix in R. I have a set of data which is offset from where i know it should be. I have coordinates of the current data, and matching coordinates of where the data should be. I need to compute the composition of the affine transformation matrix, so that I can apply an affine transform the entire dataset. any ideas? thanks in advance! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] cross-validation / sensitivity anaylsis for logistic regression model
Hi, I have developed a logistic regression model in the form of (factor_1~ numeric + factor_2) and would like to perform a cross-validation or some similar form of sensitivity analysis on this model. using cv.glm() from the boot package: # dataframe from which model was built in 'z' # model is called 'm_geo.lrm' # as suggested in the man page for a binomial model: cost <- function(r, pi=0) mean(abs(r-pi)>0.5) cv.10.err <- cv.glm(z, m_geo.lrm, cost, K=10)$delta I get the following: cv.10.err 1 1 0.275 0.281 Am I correct in interpreting that this is the mean estimated error percentage for this specified model, after 10 runs of the cross-validation? any tips on understanding the output from cv.glm() would be greatly appreciated. I am mostly looking to perform a sensitivity analysis with this model and dataset - perhaps there are other methods? thanks -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] aggregate similar to SPSS
?table On Wednesday 25 April 2007 14:32, Natalie O'Toole wrote: > Hi, > > Does anyone know if: with R can you take a set of numbers and aggregate > them like you can in SPSS? For example, could you calculate the percentage > of people who smoke based on a dataset like the following: > > smoke = 1 > non-smoke = 2 > > variable > 1 > 1 > 1 > 2 > 2 > 1 > 1 > 1 > 2 > 2 > 2 > 2 > 2 > 2 > > > When aggregated, SPSS can tell you what percentage of persons are smokers > based on the frequency of 1's and 2's. Can R statistical package do a > similar thing? > > Thanks, > > Nat > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code. -- Dylan Beaudette Soil Resource Laboratory http://casoilresource.lawr.ucdavis.edu/ University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] locate nearest value in lookup table
Hi everyone, I have a pile of data derived from an analytical device, which reports values as a continuous distribution. I need to associate classes (based on the Munsell color system) using a standard look-up table - the problem is that I would like to find the *closest* matching entry in the lookup table. I have attempted to do this by first creating as difference vector between color space coordinates associated with a single analytical value and the entire lookup table. Then I found the smallest difference vector, and used that entry in the lookup table. Hoerver, This does not always produce logical results... here are the steps: # read in the data soil <- read.table("http://169.237.35.250/~dylan/temp/munsell-all.dat";, header=T) x <- read.csv('http://169.237.35.250/~dylan/temp/data_dump_4-7-2007.csv', header=FALSE) # extract important variables X,Y,Z c <- data.frame(id=x$V1, X=x$V2, Y=x$V3, Z=x$V4) # convert munsell to XYZ soil$X <- soil$x * (soil$Y/soil$y) soil$Y <- soil$Y soil$Z <- (1 - soil$x - soil$y) * (soil$Y / soil$y) ## init some vars and try to find the closest in the table res <- list() for( i in as.numeric(rownames(c)) ) { # compute a difference for each color component d <- cbind( abs(soil$X - c$X[i]), abs(soil$Y - c$Y[i]), abs(soil$Z - c$Z[i]) ) # find the difference vector # 3D distance formula b <- sqrt( sqrt( d[,1]^2 + d[,2]^2) + d[,3]^2) # get the smallest difference vector i.closest <- head( soil[order(b), ], 1) res$input_row[i] <- i res$H[i] <- as.vector(unlist(i.closest[1])) res$V[i] <- as.vector(unlist(i.closest[2])) res$C[i] <- as.vector(unlist(i.closest[3])) res$diff_vect[i] <- min(b) } # summarize the conversions paste(res$input_row, res$H, res$V, res$C) Does this approach even make sense? thanks, dylan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] prop.test or chisq.test ..?
On Wednesday 28 February 2007 01:07, Christoph Buser wrote: > Hi > > Some comments are inside. > > Dylan Beaudette writes: > > Hi everyone, > > > > Suppose I have a count the occurrences of positive results, and the > > total number of occurrences: > > > > > > pos <- 14 > > total <- 15 > > > > testing that the proportion of positive occurrences is greater than 0.5 > > gives a p-value and confidence interval: > > > > prop.test( pos, total, p=0.5, alternative='greater') > > > > 1-sample proportions test with continuity correction > > > > data: 14 out of 15, null probability 0.5 > > X-squared = 9.6, df = 1, p-value = 0.0009729 > > alternative hypothesis: true p is greater than 0.5 > > 95 percent confidence interval: > > 0.706632 1.00 > > sample estimates: > > p > > 0.933 > > First of all by default there is a continuity correction in > prop.test(). If you use > > prop.test(pos, total, p=0.5, alternative="greater", correct = FALSE) > > 1-sample proportions test without continuity correction > > data: pos out of total, null probability 0.5 > X-squared = 11.2667, df = 1, p-value = 0.0003946 > alternative hypothesis: true p is greater than 0.5 > 95 percent confidence interval: > 0.7492494 1.000 > sample estimates: > p > 0.933 > > Remark that know the X-squared is identical to your result from > chisq.test(), but the p-value is 0.0007891/2 > > The reason is that you are testing here the against the > alternative "greater" > > If you use a two sided test > > prop.test(pos, total, p=0.5, alternative="two.sided", correct = FALSE) > > then you reporduce the results form chisq.test(). > > > My question is how does the use of chisq.test() differ from the above > > operation. For example: > > > > chisq.test(table( c(rep('pos', 14), rep('neg', 1)) )) > > > > Chi-squared test for given probabilities > > > > data: table(c(rep("pos", 14), rep("neg", 1))) > > X-squared = 11.2667, df = 1, p-value = 0.0007891 > > > > ... gives slightly different results. I am corrent in interpreting that > > the chisq.test() function in this case is giving me a p-value associated > > with the test that the probabilities of pos are *different* than the > > probabilities of neg -- and thus a larger p-value than the prop.test(... > > , p=0.5, alternative='greater') ? > > Yes. In your example chisq.test() tests the null hypothesis if > all population probabilities are equal > > ?chisq.test says: > "In this case, the hypothesis tested is whether the population > probabilities equal those in 'p', or are all equal if 'p' is not > given." > > which means p1 = p2 = 0.5 in your two population case against > the alternative p1 != p2. > > This is similar to the test in prop.test() p=0.5 against p!=0.5 > and therefore you get identical results if you choose > alternative="two.sided" in prop.test(). > > > I realize that this is a rather elementary question, and references to a > > text would be just as helpful. Ideally, I would like a measure of how > > much I can 'trust' that a larger proportion is also statistically > > meaningful. Thus far the results from prop.test() match my intuition, > > but > > affirmation would be > > Your intuition was correct. Nevertheless in your original > results (with the continuity correction), the p-value of > prop.test() (0.0009729) was larger than the p-value of > chisq.test() (0.0007891) and therefore against your intuition. > > > great. > > > > Cheers, > > > > > > -- > > Dylan Beaudette > > Soils and Biogeochemistry Graduate Group > > University of California at Davis > > 530.754.7341 > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html and provide commented, > > minimal, self-contained, reproducible code. > > Hope this helps > > Christoph Buser > Thanks for the tips Christoph, this was the help that I was looking for. cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] prop.test or chisq.test ..?
Hi everyone, Suppose I have a count the occurrences of positive results, and the total number of occurrences: pos <- 14 total <- 15 testing that the proportion of positive occurrences is greater than 0.5 gives a p-value and confidence interval: prop.test( pos, total, p=0.5, alternative='greater') 1-sample proportions test with continuity correction data: 14 out of 15, null probability 0.5 X-squared = 9.6, df = 1, p-value = 0.0009729 alternative hypothesis: true p is greater than 0.5 95 percent confidence interval: 0.706632 1.00 sample estimates: p 0.933 My question is how does the use of chisq.test() differ from the above operation. For example: chisq.test(table( c(rep('pos', 14), rep('neg', 1)) )) Chi-squared test for given probabilities data: table(c(rep("pos", 14), rep("neg", 1))) X-squared = 11.2667, df = 1, p-value = 0.0007891 ... gives slightly different results. I am corrent in interpreting that the chisq.test() function in this case is giving me a p-value associated with the test that the probabilities of pos are *different* than the probabilities of neg -- and thus a larger p-value than the prop.test(... , p=0.5, alternative='greater') ? I realize that this is a rather elementary question, and references to a text would be just as helpful. Ideally, I would like a measure of how much I can 'trust' that a larger proportion is also statistically meaningful. Thus far the results from prop.test() match my intuition, but affirmation would be great. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] model diagnostics for logistic regression
Greetings, I am using both the lrm() {Design} and glm( , family=binomial()) to perform a a logisitic regression in R. Apart from the typical summary() methods, what other methods of diagnosing logistic regression models does R provide? i.e. plotting an 'lm' object, etc. Secondly, is there any facility to calculate the R^{2)_{L} as suggested by Menard in "Applied Logistic Regression Analysis" (2002) ? Any thought would be greatly appreciated. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] sample size calculations
Greetings, I have experimented with the MBESS and pwr packages for the estimation of sample size for a given CV, precision, and confidence interval. Thus far I have found the ss.aipe.cv {MBESS} (Sample size planning for the coefficient of variation given the goal of Accuracy in Parameter Estimation approach to sample size planning.) function to be best suited for my needs. However, the data from which I am calculating my CV is approximately log-normally distributed- and thus has a large CV (1.4). Using this CV, precision (20% within the pop mean) and confidence interval (95%) parameters I obviously get a suggested sample size that is very large (n = 1182). By reducing my precision and confidence interval requirements to something like: ss.aipe.cv(C.of.V=1.4, width=0.5, conf.level=0.9) ... the function still suggests about 230 samples which is near the upper limit of feasibility. I would like to deduce an optimal number of samples, however the log-normal distribution of this data suggests that the above approach is not well suited to this task. Are there any better approaches or references which might send me in the right direction? Thanks in advance, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] extra panel arguments to plot.nmGroupedData {nlme}
Greetings, I have a groupedData (nmGroupedData) object created with the following syntax: Soil <- groupedData( ksat ~ conc | soil_id/sar/rep, data=soil.data, labels=list(x='Solution Concentration', y='Saturated Hydraulic Conductivity'), units=list(x='(cmol_c)', y='(cm/s)') ) the original data represents longitudinal observations in the form of: 'data.frame': 1197 obs. of 5 variables: $ soil_id: Factor w/ 19 levels "Arbuckle","Campbell",..: 16 16 16 16 16 16 16 16 16 16 ... $ sar: int 12 12 12 12 12 12 12 6 6 6 ... $ conc : num 500 100 50 10 5 1 0.003 500 100 50 ... $ rep: Factor w/ 3 levels "C1","C2","C3": 1 1 1 1 1 1 1 1 1 1 ... $ ksat : num 0.000214 0.000207 0.000198 0.000160 0.000108 ... the default plotting behaviour of this groupedData object works as expected: plot( Soil, collapse=1, inner=~sar, aspect='fill', scales=list( x=list(log=TRUE), y=list(log=TRUE)) ) ... however, there is no way for me to alter the panels... for example, attempting to add a horizontal line with a call to panel.abline= ... : plot( Soil, collapse=1, inner=~sar, aspect='fill', scales=list( x=list(log=TRUE), y=list(log=TRUE)), FUN=mean, panel= function(x,y, subscripts, groups) { panel.xyplot(x, y, type='o') panel.abline(h=0.005, col='black', lty=2) } ) ... results in an identical plot. Trying to re-create the results of plot.nmGroupedData with a direct call to xyplot() has thus far been unsucsessful- as I cannot figure out how to specifiy the original formula in a meaningful way to xyplot: ksat ~ conc | soil_id/sar/rep . Any tips on passing panel arguments to plot.nmGroupedData() would be greatly appreciated. Cheers, Dylan __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory problem [cluster]
Hi Stephano, Looks like you used my example verbatim (http://casoilresource.lawr.ucdavis.edu/drupal/node/221) :) While my approach has not *yet* been published, the original source [4] by Roger Bivand certainly has. Just a reminder. That said, I would highly recommend reading up on the background literature assocated with both the cluster package [1] and terrain classificartion i.e. [2] and [3]. Note that although the clara() function was created to work on massive datasets, it is still possible to overwhelm the available memory with multiple gridded objects- recall that all R objects are held in memory. I have asked the maintainer of the cluster package, Martin Maechler, about integrating a known medoid option into the clara() function- which would be extremely useful in adding some 'supervision' to landscape classification with clara(). Hopefully there will be enough requests for the feature, that Martin will kindly add it :) . 1. Kaufman, L. & Rousseeuw, P.J. Finding Groups in Data An Introduction to Cluster Analysis Wiley-Interscience, 2005 2. Blaszczynski, J. Landform characterization with geographical information systems Photogrammetric Engineering and Remote Sensing, 1997, 63, 183-191 3. Wood, W.F. & Snell, J.B. A Quatitative system for classifying landforms U.S. Quatermaster Research & Engineering Center, 1960 4. Bivand, R. Integrating GRASS 5.0 and R: GIS and modern statistics Computers & Geosciences, 2000, 26, 1043–1052 On Friday 01 December 2006 14:04, Massimo Di Stefano wrote: > hi to all, > frustated for this error, to day i buy a 1 GB memory > slot for my laptop > now it have 1,28GB instead the old 512, but i've the > same error :-( > damn!damn!how can i do? > repeat for a little area (about 20X20 km and res=20m) > it work fine! > have you any suggestion? > is ther a method for look if this error depend from my > ram or other? > thanks foe any suggestion! > i need your help. > thanks. > Massimo > > > Il giorno 01/dic/06, alle ore 16:05, massimodisasha ha > scritto: > hi, > i'm trying to perform a clustering on a big dataframe > the code is this: > > > print("load required R packages") > > require(spgrass6) > > require(cluster) > > gmeta6 <- gmeta6() > > print("read in our 7 raster files from GRASS") > > x <- > readFLOAT6sp(c("er","crosc","longc","slope","profc","minic","maxic")) > > print("assemble a matrix of our terrain variables") > > morph <- data.frame(cbind(x$er, x$crosc, x$longc, > x$slope, x$profc, x$minic, x$maxic)) > > print("normailize slope by dividing my max(slope)") > > morph <- data.frame(cbind(x$er, x$crosc, x$longc, > x$slope/max(x$slope), x$profc, x$minic, x$maxic)) > > names(morph) <- > c("er","crosc","longc","slope_n","profc","minic","maxic") > > print("perform the clustering") > > morph.clara <- clara(morph, k=5, stand=F) > > x$morph_class <- morph.clara$clustering > > print("send result back to GRASS") > > rast.put6(x,"morph", zcol="morph_class") > > > > during the step : perform the clustering > after a lot of time, > i've this error: > > > > > Errore in sprintf(fmt, ...) : La lunghezza della > stringa eccede la dimensione del buffer di 8192 > Inoltre: Warning messages: > 1: perl = TRUE è implementato solo nei locale UTF-8 > 2: perl = TRUE è implementato solo nei locale UTF-8 > 3: perl = TRUE è implementato solo nei locale UTF-8 > 4: perl = TRUE è implementato solo nei locale UTF-8 > 5: perl = TRUE è implementato solo nei locale UTF-8 > 6: perl = TRUE è implementato solo nei locale UTF-8 > 7: perl = TRUE è implementato solo nei locale UTF-8 > 8: La stringa di caratteri verrà probabilmente > troncata > Esecuzione interrotta > > > > if i try the same code on a subregion of my data, it > works very fine! > but for a large region i've this error :-( > > obviously i think that is a memory problem, right ? > (i'm working with a notebook PPC-1.33-512ram) > my data are : 7 raster-map on a region of about 50X40 > km at a resolution of 20m. > is there some wolkaround about the memory problems? > > an other question is: > what is this : > Warning messages: > 1: perl = TRUE è implementato solo nei locale UTF-8 > 2: perl = TRUE è implementato solo nei locale UTF-8 > 3: perl = TRUE è implementato solo nei locale UTF-8 > 4: perl = TRUE è implementato solo nei locale UTF-8 > 5: perl = TRUE è implementato solo nei locale UTF-8 > 6: perl = TR
[R] evaluation of 2 matrices: categorical comparison
Greetings, I have two matrices, read in from raster data stored in GRASS. Each matrix represents the output of a aggregation process on categorical data (Nomial / Ordinal) - derived from imagery. I have compared these two data by the following methods: * pretending the data is continuous approaches (flawed i am sure...) 1. computing a difference map by computing the absolute difference (cell-by-cell) between the two outputs 2. simple linear regression / scatter plot of the two outputs 3. 2D histogram via. hist2d() in the gplots package *** categorical comparisons 4. conversion of each matrix to a vector of factors and: 5. frequency tabulation for each level 6. mosiacplot 7. computation of Cohen's Kappa (using the cohen.kappa function from the 'concord' package) : # a and b are vectors with the 'factor-ized' matrices, stored as vectors cohen.kappa(table(a,b), type='count') A lengthy summary of this small adventure has been recorded here: http://casoilresource.lawr.ucdavis.edu/drupal/node/275/ Please note that I am looking for a *better?* way to compare these two matrices. It could be that the Cohen's Kappa, and difference maps are the best that I am going to get. Thanks in advance, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] any good way to convert sp class objects to splancs object
On Wednesday 25 October 2006 12:02, Roger Bivand wrote: > On Wed, 25 Oct 2006, Dylan Beaudette wrote: > > Greetings: > > > > are there any simple ways to convert an sp class object to a splancs ppp > > class object, outside of reading the coordinates and computing a > > bounding box? > > Aren't ppp class objects from the spatstat package? Have you tried the > spspatstat wrapper package on the R-spatial sourceforge repository? The > as() coerce method should get you there. > > If you really mean splancs, there are not wrappers yet, but are on their > way. > > (Consider posting on R-sig-geo for specific questions like this). > > Roger > > > cheers, Thanks for the reply Roger. I made a slight mistake. I meant to say: convert sp class object (point / polygon) to a *spatstat* ppp class object. in the mean time i had come up with the following hack: # extract data from sp coords slot # extract data from sp bbox slot #manually build ppp class object with the ppp() function. #note that in spatstat the ordering of vertices in a polygon must be in counter-clockwise direction, so i had to reverse my vertex vectors. i will post the worked up examples with data in a bit. thanks! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] any good way to convert sp class objects to splancs object
Greetings: are there any simple ways to convert an sp class object to a splancs ppp class object, outside of reading the coordinates and computing a bounding box? cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] compiling rgdal package on windows / macos
On Thursday 05 October 2006 02:06, Roger Bivand wrote: > On Wed, 4 Oct 2006, Dylan Beaudette wrote: > > Greetings: > > > > As I am not a windows user, I cannot try this: is it possible to install > > rgdal on windows without having to compile it from source ? > > Andy Jaworski already replied that rgdal Windows binaries are available > from CRAN mirrors, thanks to Uwe Ligges. > > > Compilation on MacOS is within my abilities, however each time i try and > > install the rgdal package it dies complaining that it cannot find > > gdal-config --- which was recently installed with GRASS. I have updated > > my PATH environment variable, logged out, but R still cannot find the > > gdal-config program. > > For Mac OSX, please consult the detailed instructions on: > > http://www.r-project.org/Rgeo -> Maps (in the left navigation bar). > > Hint: see the configure.args= argument of install.packages() and use > --with-gdal-config= to get the address right. > > No Mac OSX binaries will be provided from CRAN, there are simply too many > Mac OSX varieties out there to be sure that the installed external > software harmonises with the rgdal binary build. However, one way of > getting there is referenced in Rgeo, using William Kyngesburye's > frameworks (rgdal binaries are provided in harmony with PROJ.4 and GDAL). > > By the way: > > RSiteSearch("rgdal macosx") > > first hit takes you to a recent reply on this topic on the R-sig-geo list. > > > any tips on getting the rgdal package up and running on MacOS or Windows > > would be greatly appreciated. > > > > Cheers, Thanks Roger. i had been reading some data information in regards to the rgdal package, and thought I would put out a question as soon as possible. Thanks for all of the hard work! PS: I asked this question as I am currently leading a seminar on open source software tools for soil scientists, and wanted to highlight the GRASS-R coupling. Since many of the students are using macs, I wanted to make sure that I would be able to give them an example that they could implement themselves. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] compiling rgdal package on windows / macos
Greetings: As I am not a windows user, I cannot try this: is it possible to install rgdal on windows without having to compile it from source ? Compilation on MacOS is within my abilities, however each time i try and install the rgdal package it dies complaining that it cannot find gdal-config --- which was recently installed with GRASS. I have updated my PATH environment variable, logged out, but R still cannot find the gdal-config program. any tips on getting the rgdal package up and running on MacOS or Windows would be greatly appreciated. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mysterious error on compile R 2.3.1
On Wednesday 20 September 2006 12:38, Peter Dalgaard wrote: > Dylan Beaudette <[EMAIL PROTECTED]> writes: > > Getting a very strange error with a new install of R from source on x86; > > > > make[3]: Leaving directory `/tmp/R.INSTALL.r20887/cluster/src' > > ** R > > ** data > > ** moving datasets to lazyload DB > > Error in factor(c(1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1), > > : invalid labels; length 2 should be 1 or 1 > > Execution halted > > ERROR: lazydata failed for package 'cluster' > > ** Removing '/home/dylan/src/R-2.3.1/library/cluster' > > make[2]: *** [cluster.ts] Error 1 > > make[2]: Leaving directory > > `/home/dylan/src/R-2.3.1/src/library/Recommended' make[1]: *** > > [recommended-packages] Error 2 > > make[1]: Leaving directory > > `/home/dylan/src/R-2.3.1/src/library/Recommended' make: *** > > [stamp-recommended] Error 2 > > > > note that i am using the GCC flags: > > CFLAGS=-march=opteron -ffast-math > > CXXFLAGS=-march=opteron -ffast-math > > > > any ideas? > > Don't do that > > Seriously! > > -ffast-math will allow the compiler to break IEEE math specifications, > which in turn will break R all over the place. yikes! i wont do that anymore. this fixed the problem. thanks! cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] mysterious error on compile R 2.3.1
Getting a very strange error with a new install of R from source on x86; make[3]: Leaving directory `/tmp/R.INSTALL.r20887/cluster/src' ** R ** data ** moving datasets to lazyload DB Error in factor(c(1, 2, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1), : invalid labels; length 2 should be 1 or 1 Execution halted ERROR: lazydata failed for package 'cluster' ** Removing '/home/dylan/src/R-2.3.1/library/cluster' make[2]: *** [cluster.ts] Error 1 make[2]: Leaving directory `/home/dylan/src/R-2.3.1/src/library/Recommended' make[1]: *** [recommended-packages] Error 2 make[1]: Leaving directory `/home/dylan/src/R-2.3.1/src/library/Recommended' make: *** [stamp-recommended] Error 2 note that i am using the GCC flags: CFLAGS=-march=opteron -ffast-math CXXFLAGS=-march=opteron -ffast-math any ideas? -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] questions regarding spline functions
examples lines(spline(y.2 ~ x.2), col="blue", lty=2) lines(new_x.2, new_y.2, col='red', lty=2) lines(ispline_y.2, col='green', lty=2) legend(8.4,24, legend=c('soil data', 'splines()', 'interpSpline()', 'bs()'), col=c('black', 'blue', 'red', 'green'), lty=c(1,1,1,1), lwd=c(2,1,1,1), cex=0.7) Cheers, Dylan > Brent > > D. Brenton Myers > Graduate Fellow > University of Missouri > Soil Environmental and Atmospheric Sciences > 269 Agriculture Engineering Building > Columbia, MO 65211 > Work: (573) 882-1146 > Home: (573) 446-6856 > Cell: (573) 881-6855 > email: [EMAIL PROTECTED] > > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Dylan Beaudette > Sent: Monday, July 31, 2006 6:43 PM > To: RHELP > Subject: [R] questions regarding spline functions > > Greetings, > > A couple general questions regarding the use of splines to interpolate > depth > profile data. > > Here is an example of a set of depths, with associated attributes for a > given > soil profile, along with a function for calculating midpoints from a > set of > soil horizon boundaries: > > #calculate midpoints: > mid <- function(x) { > for( i in 1:length(x)) { > if( i > 1) { >a[i] = (x[i] - x[i-1]) / 2 + x[i-1] > } > } > #reurn the results > a[which(!is.na(a))] > } > > #horizon depth bounds > z <- c(0,2,18,24,68,160,170,192,200) > > #horizon midpoints, associated with horizon attribute > x <- mid(z) > > #clay pct > y <- c(0,1,2,2,4,7,6,1) > > #plot them > plot(y ~ x, xlab="Depth", ylab="Percent Clay", type="s") > points(y ~ x, cex=0.5, pch=16) > > These point pairs usually represent a trend with depth, which I would > like to > model with splines - or some similar approach, as they have been found > to > work better than other methods such as a fitted polynomial. > > Using the B Spline function from the 'splines' package, it is possible > to fit > a model of some property with depth based on the bs() function: > > #natual, B-Splines > library(splines) > > #fit a b-spline model: > fm <- lm(y ~ bs(x, df=5) ) > > I am able to predict a soil property with depth, at unsampled locations > with > this model with: > > new_x <- seq(range(x)[1], range(x)[2], len = 200) > > #predict attribute at unsampled depths: > new_y <- predict(fm, data.frame(x=new_x) ) > > #plot the predicted attribute at the unsampled depths > lines(new_x, new_y, col='red') > > This tends to work fairly well (see attached), but I am wondering if I > can use > the 'knots' parameter in the bs() function for incorporation of horizon > boundary information into the creation of the spline. Moreover, it would > be > nice if the spline-based model 'fm' would predict a set of values with > similar mean and range as the original data points: i.e > > #summary of clay values from original data: > summary(y) >Min. 1st Qu. MedianMean 3rd Qu.Max. > 0.000 1.000 2.000 2.875 4.500 7.00 > > #see above > summary(new_y) > Min. 1st Qu. Median Mean 3rd Qu. Max. > -0.05786 2.09500 3.13200 3.62800 5.17100 7.08700 > > > This is based on an article I read : > http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V67-3WWRDYY-3 > &_user=4421&_handle=V-WA-A-W-AU-MsSAYWW-UUA-U-AACZEDZWBC-AACVCCDUBC-BZAY > UEWB-AU-U&_fmt=summary&_coverDate=08%2F31%2F1999&_rdoc=3&_orig=browse&_s > rch=%23toc%235807%231999%23999089998%23108393!&_cdi=5807&view=c&_acct=C0 > 00059598&_version=1&_urlVersion=0&_userid=4421&md5=488f1e114d8d64265ff65 > 506e9587e71 > > where the author talks about a so-called 'equal-area quadratic smoothing > > spline' approach to describing a soil property depth function. > Unfortunately > the author did not provide sample code > > Any thoughts / input would be greatly appreciated! > > Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 spline2.png Description: PNG image __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] questions regarding spline functions
Greetings, A couple general questions regarding the use of splines to interpolate depth profile data. Here is an example of a set of depths, with associated attributes for a given soil profile, along with a function for calculating midpoints from a set of soil horizon boundaries: #calculate midpoints: mid <- function(x) { for( i in 1:length(x)) { if( i > 1) { a[i] = (x[i] - x[i-1]) / 2 + x[i-1] } } #reurn the results a[which(!is.na(a))] } #horizon depth bounds z <- c(0,2,18,24,68,160,170,192,200) #horizon midpoints, associated with horizon attribute x <- mid(z) #clay pct y <- c(0,1,2,2,4,7,6,1) #plot them plot(y ~ x, xlab="Depth", ylab="Percent Clay", type="s") points(y ~ x, cex=0.5, pch=16) These point pairs usually represent a trend with depth, which I would like to model with splines - or some similar approach, as they have been found to work better than other methods such as a fitted polynomial. Using the B Spline function from the 'splines' package, it is possible to fit a model of some property with depth based on the bs() function: #natual, B-Splines library(splines) #fit a b-spline model: fm <- lm(y ~ bs(x, df=5) ) I am able to predict a soil property with depth, at unsampled locations with this model with: new_x <- seq(range(x)[1], range(x)[2], len = 200) #predict attribute at unsampled depths: new_y <- predict(fm, data.frame(x=new_x) ) #plot the predicted attribute at the unsampled depths lines(new_x, new_y, col='red') This tends to work fairly well (see attached), but I am wondering if I can use the 'knots' parameter in the bs() function for incorporation of horizon boundary information into the creation of the spline. Moreover, it would be nice if the spline-based model 'fm' would predict a set of values with similar mean and range as the original data points: i.e #summary of clay values from original data: summary(y) Min. 1st Qu. MedianMean 3rd Qu.Max. 0.000 1.000 2.000 2.875 4.500 7.00 #see above summary(new_y) Min. 1st Qu. Median Mean 3rd Qu. Max. -0.05786 2.09500 3.13200 3.62800 5.17100 7.08700 This is based on an article I read : http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6V67-3WWRDYY-3&_user=4421&_handle=V-WA-A-W-AU-MsSAYWW-UUA-U-AACZEDZWBC-AACVCCDUBC-BZAYUEWB-AU-U&_fmt=summary&_coverDate=08%2F31%2F1999&_rdoc=3&_orig=browse&_srch=%23toc%235807%231999%23999089998%23108393!&_cdi=5807&view=c&_acct=C59598&_version=1&_urlVersion=0&_userid=4421&md5=488f1e114d8d64265ff65506e9587e71 where the author talks about a so-called 'equal-area quadratic smoothing spline' approach to describing a soil property depth function. Unfortunately the author did not provide sample code Any thoughts / input would be greatly appreciated! Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 spline1.png Description: PNG image __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to use large data set ?
> > > > > > > Yohan C. > > > > > > > > -- > > Ce message est confidentiel. Son contenu ne represente en aucun cas un > > engagement de la part du Groupe Soft Computing sous reserve de tout > > accord conclu par ecrit entre vous et le Groupe Soft Computing. Toute > > publication, utilisation ou diffusion, meme partielle, doit etre > > autorisee prealablement. > > Si vous n'etes pas destinataire de ce message, merci d'en avertir > > immediatement l'expediteur. > > This message is confidential. Its content does not constitute a > > commitment by Soft Computing Group except where provided for in a > > written agreement between you and Soft Computing Group. Any unauthorised > > disclosure, use or dissemination, either whole or partial, is > > prohibited. If you are not the intended recipient of this message, > > please notify the sender immediately. > > ------ > > > > > > > > [[alternative HTML version deleted]] > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > [[alternative HTML version deleted]] > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html and provide commented, minimal, > self-contained, reproducible code. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] coloring leaves in a hclust or dendrogram plot [solved]
On Saturday 11 March 2006 07:37, Romain Francois wrote: > Le 11.03.2006 15:56, Gabor Grothendieck a écrit : > > Where does one find A2Rplot that is called in that example? > > > > A2R does not appear to be on CRAN. I did find this: > > > > http://addictedtor.free.fr/Download/A2R.zip > > > > which is a zip file containing some R code but it is not a package, > > which the line library(A2R) seems to need, and it does not include > > A2Rplot in any case. > > (...) > > Hi Gabor and all, > > the package A2R is really highly experimental now and i don't have much > time to improve it to such an extent that it would be possible to commit > it to CRAN. For now on, the source of the package is there : > http://addictedtor.free.fr/packages/A2R/ > For people that can't build it, it is possible to source the files from > here : > http://addictedtor.free.fr/packages/A2R/lastVersion/R/ > > Plus, i am (slowly) preparing an rgg package, so the content of A2R will > probably go there > > Romain Thanks for the update Romain, I have successfully installed the A2R package, and I nearly have a working graph from A2Rplot. sample output here: http://169.237.35.250/~dylan/temp/a2rplot.pdf however, the labels at the bottom of the dendrogram are truncated. Is there any way to prevent this - some parameter that I can pass to A2Rplot() ? Also, is it possible to add more factors to the bottom of the plot? Thanks! Dylan -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] standardization of values before call to pam() or clara()
Greetings, Experimenting with the cluster package, and am starting to scratch my head in regards to the *best* way to standardize my data. Both functions can pre-standardize columns in a dataframe. according to the manual: Measurements are standardized for each variable (column), by subtracting the variable's mean value and dividing by the variable's mean absolute deviation. This works well when input variables are all in the same units. When I include new variables with a different intrinsic range, the ones with the largest relative values tend to be _weighted_ . this is certainly not surprising, but complicates things. Does there exist a robust technique to effectively re-scale each of the variables, regardless of their intrinsic range to some set range, say from {0,1} ? I have tried dividing a variable by the maximum value of that variable, but I am not sure if this is statistically correct. Any ideas, thoughts would be greatly appreciated. Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] passing known medoids to clara() in the cluster package
Martin, Just wanted to check on the status of including known medoids into calls to the clara() function within the cluster package. Cheers, Dylan On Monday 10 April 2006 14:25, Dylan Beaudette wrote: > Thanks for the reply. > > On Sunday 09 April 2006 11:46 pm, Martin Maechler wrote: > > >>>>> "DylanB" == Dylan Beaudette <[EMAIL PROTECTED]> > > >>>>> on Sun, 9 Apr 2006 19:28:44 -0700 writes: > > > > DylanB> Greetings, I have had good success using the clara() > > DylanB> function to perform a simple cluster analysis on a > > DylanB> large dataset (1 million+ records with 9 variables). > > > > DylanB> Since the clara function is a wrapper to pam(), > > DylanB> which will accept known medoid data - I am wondering > > DylanB> if this too is possible with clara() ... The > > DylanB> documentation does not suggest that this is > > DylanB> possible. > > > > indeed, it doesn't -- because it's not yet possible. > > I (as maintainer of "cluster") had added the ``known medoid'' > > option to pam() a while ago last June (for cluster version 1.10.0), > > and had left a note my TODO file to do the same for clara(). > > Ah. that would explain things ! :) . I will check back periodically to see > when this feature is completed. > > > Unfortunately it's not true that clara() was a wrapper to pam() > > as you state above. > > I must have misread the manual pages... > > > Given your wish and clear "use case" situation, I'm more > > motivated to approach this particular 'TODO' item! > > > > Martin Maechler, ETH Zurich > > > > DylanB> Essentially I am trying to implement a "supervised > > DylanB> classification" of numerous geographic data > > DylanB> layers. The "unsupervised" approach using clara() > > DylanB> works well, but I feel the output classes would be > > DylanB> more meaningful if I were able to let clara() know > > DylanB> about the classes that I have in mind. > > > > DylanB> Is this at all feasible, or am I trying to > > DylanB> accomplish something that is not possible? > > Thanks Martin! > > I will give pam() a try, and see if it can handle the large dataset that I > am currently using clara() for -- usually only about 5 seconds are required > for clara() to complete. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] passing known medoids to clara() in the cluster package
Thanks for the reply. On Sunday 09 April 2006 11:46 pm, Martin Maechler wrote: > >>>>> "DylanB" == Dylan Beaudette <[EMAIL PROTECTED]> > >>>>> on Sun, 9 Apr 2006 19:28:44 -0700 writes: > > DylanB> Greetings, I have had good success using the clara() > DylanB> function to perform a simple cluster analysis on a > DylanB> large dataset (1 million+ records with 9 variables). > > DylanB> Since the clara function is a wrapper to pam(), > DylanB> which will accept known medoid data - I am wondering > DylanB> if this too is possible with clara() ... The > DylanB> documentation does not suggest that this is > DylanB> possible. > > indeed, it doesn't -- because it's not yet possible. > I (as maintainer of "cluster") had added the ``known medoid'' > option to pam() a while ago last June (for cluster version 1.10.0), > and had left a note my TODO file to do the same for clara(). Ah. that would explain things ! :) . I will check back periodically to see when this feature is completed. > Unfortunately it's not true that clara() was a wrapper to pam() > as you state above. I must have misread the manual pages... > Given your wish and clear "use case" situation, I'm more > motivated to approach this particular 'TODO' item! > > Martin Maechler, ETH Zurich > > DylanB> Essentially I am trying to implement a "supervised > DylanB> classification" of numerous geographic data > DylanB> layers. The "unsupervised" approach using clara() > DylanB> works well, but I feel the output classes would be > DylanB> more meaningful if I were able to let clara() know > DylanB> about the classes that I have in mind. > > DylanB> Is this at all feasible, or am I trying to > DylanB> accomplish something that is not possible? Thanks Martin! I will give pam() a try, and see if it can handle the large dataset that I am currently using clara() for -- usually only about 5 seconds are required for clara() to complete. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] passing known medoids to clara() in the cluster package
Greetings, I have had good success using the clara() function to perform a simple cluster analysis on a large dataset (1 million+ records with 9 variables). Since the clara function is a wrapper to pam(), which will accept known medoid data - I am wondering if this too is possible with clara() ... The documentation does not suggest that this is possible. Essentially I am trying to implement a "supervised classification" of numerous geographic data layers. The "unsupervised" approach using clara() works well, but I feel the output classes would be more meaningful if I were able to let clara() know about the classes that I have in mind. Is this at all feasible, or am I trying to accomplish something that is not possible? Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] transform shapefiles in UTM to lat/long
On Tuesday 21 March 2006 08:13 am, Fredy O. Mejia wrote: > Dear all: > > I have a shapefile in UTM coordinate system and I would like to transform > it to Lat/Log coordinates (WSG84). The package PBSmapping contains > function convUL to transform between the two coordinate systems when data > is in the form of a data frame with attributes specifying the coordinate > system. However, when shapefiles are imported using function read.shape > (package maptools), a list instead of a data frame is generated. I could > extract all the polygons in the list and transform them, individually, but > I hope there is an easier way of doing that. Besides, the shapefile i'm > trying to convert has two different UTM zones (is from Guatemala, and it > covers zones 15 and 16). Any sugestions would be greatly appreciated. > > Thanks, > > Fredy > > [[alternative HTML version deleted]] > Fredy, I would recommend performing the inverse projection with the GDAL/OGR library first. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] coloring leaves in a hclust or dendrogram plot [solved]
On Thursday 09 March 2006 06:12 pm, Dylan Beaudette wrote: > Greetings, > > I have perused the r-help mailing list archives for an answer to this > question, without avail. > > I would like to color the "leaves" of a dendrogram plot based on a cutoff > in one of the variables involved in the initial clustering. > > My input data is in the form of: > B K > Alameda 0.2475770 0.7524230 > Alpine0.4546784 0.5453216 > Amador0.6278610 0.3721390 > > essentially rows labeled by county name, with two variables: percent voted > for B and percent voted for K. While it is obvious that this is somewhat of > a contrived example, I intend to use this as a learning device. > > Here is the code used to create and plot the dendrogram: > hc <- hclust(dist(y), "ave") > dend <- as.dendrogram(hc) > plot(dend, main="CA 2004 Election Results by County") > > An example of the output can be found here: > http://casoilresource.lawr.ucdavis.edu/drupal/node/206?size=_original > > > I have experimented with the edgePar and nodePar parameters for the > plot.dendrogram() method, but have not been able to make sense of the > output. > > The basis for setting the colors of the leaves in the dendrogram is a > simple majority calculation: > > reds <- y[y$B > 0.5, ] > blues <- y[y$K > 0.5, ] > > Such that leaves in the tree will be colored based on the membership in > either of the two above groups. > > Is there a resource documenting how this might be accomplished? > > Any thoughts or ideas would be greatly appreciated. > > Cheers, > > Dylan Replying to my own post... Discovered the dendapply() function: reds <<- as.factor(row.names(y[y$B > 0.5, ])) blues <<- as.factor(row.names(y[y$K > 0.5, ])) #define a function for coloring and sizing node elements: colLab <- function(n) { if(is.leaf(n)) { a <- attributes(n) if ( length(which(blues == a$label)) == 1 ) { attr(n, "nodePar") <- c(a$nodePar, list(lab.col = "blue", lab.cex=.7, col="blue", cex=pop[n], pch=16 )) } else { attr(n, "nodePar") <- c(a$nodePar, list(lab.col = "red", lab.cex=.7, col="red", cex=pop[n], pch=16)) } } n } #modfiy dendrogram nodes and re-plot dend_colored <- dendrapply(dend, colLab) ...which did the trick http://casoilresource.lawr.ucdavis.edu/drupal/node/210 -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] coloring leaves in a hclust or dendrogram plot
Greetings, I have perused the r-help mailing list archives for an answer to this question, without avail. I would like to color the "leaves" of a dendrogram plot based on a cutoff in one of the variables involved in the initial clustering. My input data is in the form of: B K Alameda 0.2475770 0.7524230 Alpine0.4546784 0.5453216 Amador0.6278610 0.3721390 essentially rows labeled by county name, with two variables: percent voted for B and percent voted for K. While it is obvious that this is somewhat of a contrived example, I intend to use this as a learning device. Here is the code used to create and plot the dendrogram: hc <- hclust(dist(y), "ave") dend <- as.dendrogram(hc) plot(dend, main="CA 2004 Election Results by County") An example of the output can be found here: http://casoilresource.lawr.ucdavis.edu/drupal/node/206?size=_original I have experimented with the edgePar and nodePar parameters for the plot.dendrogram() method, but have not been able to make sense of the output. The basis for setting the colors of the leaves in the dendrogram is a simple majority calculation: reds <- y[y$B > 0.5, ] blues <- y[y$K > 0.5, ] Such that leaves in the tree will be colored based on the membership in either of the two above groups. Is there a resource documenting how this might be accomplished? Any thoughts or ideas would be greatly appreciated. Cheers, Dylan -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] extracting RGB values from a colorspace class object
Greetings, After pouring over the documentation for the 'colorspace' package, I have not been able to figure out how the plot() method converts colorspace coordinates to RGB values for display on the screen. I am convert between colorspaces with the various as() methods... but cannot seem to find a way to extract RGB (i.e. for displaying on a computer screen) triplets from color space coordinates. Here is a sample of the data that I am working with: H V C x y Y 2.5Y0.2 2 1.43 0.97 0.237 2.5Y0.4 2 0.729 0.588 0.467 2.5Y0.6 2 0.563 0.491 0.699 2.5Y0.6 4 0.995 0.734 0.699 2.5Y0.8 2 0.479 0.439 0.943 2.5Y0.8 4 0.82 0.64 0.943 2.5Y 1 20.43620.4177 1.21 I have converted xyY coordinates to XYZ coordinates, based on the equations presented here: http://www.brucelindbloom.com/index.html?ColorCalcHelp.html #read in the soil colors: munsell + xyY soil <- read.table("soil_colors", header=F, col.names=c("H","V","C","x","y","Y")) #convert to XYZ X <- (soil$x * soil$Y ) / soil$y Y <- soil$Y Z <- ( (1- soil$x - soil$y) * soil$Y ) / soil$y #make an XYZ colorspace object: require(colorspace) soil_XYZ <- XYZ(X,Y,Z) #visualize the new XYZ colorspace object in the L*U*V colorspace: plot(as(soil_XYZ, "LUV"), cex=2) here is an example of the output: http://169.237.35.250/~dylan/temp/soil_colors-LUV_space.png Somehow the plot() method in the colorspace package is converting color space coordinates to their RGB values... Does anyone have an idea as to how to access these RGB triplets? Thanks in advance! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] understanding patterns in categorical vs. continuous data
Thanks to all for the helpful suggestions, I was able to get good start from there. Cheers, Dylan On Thursday 26 January 2006 12:03 pm, Gabor Grothendieck wrote: > Would this do? > > boxplot(Sepal.Length ~ Species, iris, horizontal = TRUE) > library(Hmisc) > summary(Sepal.Length ~ Species, iris, fun = summary) > > On 1/26/06, Dylan Beaudette <[EMAIL PROTECTED]> wrote: > > Greetings, > > > > I have a set of bivariate data: one variable (vegetation type) which is > > categorical, and one (computed annual insolation) which is continuous. > > Plotting veg_type ~ insolation produces a nice overview of the patterns > > that I can see in the source data. However, due to the large number of > > samples (1,000), and the apparent "spread" in the distribution of a > > single vegetation type over a range of insolation values- I having a hard > > time quantitatively describing the relationship between the two > > variables. > > > > Here is a link to a sample graph: > > http://casoilresource.lawr.ucdavis.edu/drupal/node/162 > > > > Since the data along each vegetation type "line" is not a distribution in > > the traditional sense, I am having problems applying descriptive > > statistical methods. Conceptually, I would like to some how describe the > > variation with insolation, along each vegetation type "line". > > > > Any guidance, or suggested reading material would be greatly appreciated. > > > > > > -- > > Dylan Beaudette > > Soils and Biogeochemistry Graduate Group > > University of California at Davis > > 530.754.7341 > > > > ______ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide! > > http://www.R-project.org/posting-guide.html -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] understanding patterns in categorical vs. continuous data
Greetings, I have a set of bivariate data: one variable (vegetation type) which is categorical, and one (computed annual insolation) which is continuous. Plotting veg_type ~ insolation produces a nice overview of the patterns that I can see in the source data. However, due to the large number of samples (1,000), and the apparent "spread" in the distribution of a single vegetation type over a range of insolation values- I having a hard time quantitatively describing the relationship between the two variables. Here is a link to a sample graph: http://casoilresource.lawr.ucdavis.edu/drupal/node/162 Since the data along each vegetation type "line" is not a distribution in the traditional sense, I am having problems applying descriptive statistical methods. Conceptually, I would like to some how describe the variation with insolation, along each vegetation type "line". Any guidance, or suggested reading material would be greatly appreciated. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] overlay additional axes
Greetings, I am trying to add an extra labled axis in position 3 (top x-axis), with numbers that do not match up with the existing axes. Surely this must be possible, and I am just doing it incorectly. So far I have tried the following: #make a plot plot(TIK, type="l", cex=.25, xlim=c(2,32), ylim=c(0,1600)) #try and add a new axis with different numbers in position 3 axis(3,0.154/(2*sin(TIK[,1]/2*pi/180))) ...obviously the nature of the numbers in both axes is quite different. is it possible to have the bottom axis (degrees 2Theta) line up with a corosponding top axis of 0.154/(2*sin(TIK[,1]/2*pi/180)) ? thanks in advance! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] finding peaks in a simple dataset with R
On Nov 25, 2005, at 10:27 AM, Martin Maechler wrote: > Let me try to summarize my view on this: > > - I still it would make sense to have a *simple* peaks() function > in R which provides the same (or more) functionality as the > corresponding S-plus one.From > For a proper data analysis situation, I think one would have to > do something more sophisticated, based on a model (with a random > component), such as nonparametric regression, time-series, > Hence peaks() should be kept as simple as reasonable. > > - Of course I know that which() or %in% can be used to deal > with logicals containing NAs {As a matter of fact, I've had > my fingers in both implementations for R!}. > Still, the main use of logical vectors in S often is for > situations where NAs only appear because of missing data: > > Indexing ([]), all(), any(), sum() are all very nice and > useful for logical vectors particularly when there are no NAs. > > - I agree that a different more flexible function returning > values from {-1,0,1} would be desirable, "for symmetry reasons". > ===> added a peaksign() function > > Here's code that implements the above {and other concerns > mentioned in this thread}, including some ``consistency > checking'' : > > peaks <- function(series, span = 3, do.pad = TRUE) { > if((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd") > s1 <- 1:1 + (s <- span %/% 2) > if(span == 1) return(rep.int(TRUE, length(series))) > z <- embed(series, span) > v <- apply(z[,s1] > z[, -s1, drop=FALSE], 1, all) > if(do.pad) { > pad <- rep.int(FALSE, s) > c(pad, v, pad) > } else v > } > > peaksign <- function(series, span = 3, do.pad = TRUE) > { > ## Purpose: return (-1 / 0 / 1) if series[i] is ( trough / > "normal" / peak ) > ## > -- > ## Author: Martin Maechler, Date: 25 Nov 2005 > > if((span <- as.integer(span)) %% 2 != 1 || span == 1) > stop("'span' must be odd and >= 3") > s1 <- 1:1 + (s <- span %/% 2) > z <- embed(series, span) > d <- z[,s1] - z[, -s1, drop=FALSE] > ans <- rep.int(0:0, nrow(d)) > ans[apply(d > 0, 1, all)] <- as.integer(1) > ans[apply(d < 0, 1, all)] <- as.integer(-1) > if(do.pad) { > pad <- rep.int(0:0, s) > c(pad, ans, pad) > } else ans > } > > > check.pks <- function(y, span = 3) > stopifnot(identical(peaks( y, span), peaksign(y, span) == 1), > identical(peaks(-y, span), peaksign(y, span) == -1)) > > for(y in list(1:10, rep(1,10), c(11,2,2,3,4,4,6,6,6))) { > for(sp in c(3,5,7)) > check.pks(y, span = sp) > stopifnot(peaksign(y) == 0) > } > > y <- c(1,4,1,1,6,1,5,1,1) ; (ii <- which(peaks(y))); y[ii] > ##- [1] 2 5 7 > ##- [1] 4 6 5 > check.pks(y) > > set.seed(7) > y <- rpois(100, lambda = 7) > check.pks(y) > py <- peaks(y) > plot(y, type="o", cex = 1/4, main = "y and peaks(y,3)") > points(seq(y)[py], y[py], col = 2, cex = 1.5) > > p7 <- peaks(y,7) > points(seq(y)[p7], y[p7], col = 3, cex = 2) > mtext("peaks(y,7)", col=3) > > set.seed(2) > x <- round(rnorm(500), 2) > y <- cumsum(x) > check.pks(y) > > plot(y, type="o", cex = 1/4) > p15 <- peaks(y,15) > points(seq(y)[p15], y[p15], col = 3, cex = 2) > mtext("peaks(y,15)", col=3) > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html > > Wow! This was the exact sort of simple peak finding algorithm I was looking for. Would it be ok for me to post this to our dept. webpage so that others may use it? Cheers, -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] finding peaks in a simple dataset with R
On Wednesday 23 November 2005 10:15 am, Tuszynski, Jaroslaw W. wrote: > >> I am looking for some way to locate peaks in a simple x,y data set. > > See my 'msc.peaks.find' function in 'caMassClass', it has a simple peak > finding algorithm. > > Jarek Tuszynski > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide! > http://www.R-project.org/posting-guide.html Jarek, Thanks for the tip. I was able to install the caMassClass package and all of its dependancies. In addition, I was able to run the examples on the manual pages. However, The format of the input data to the 'msc.peaks.find' function is not apparent to me. In its simplest form, my data looks something like this: 2.00 233 2.04 220 ... 11.60 540 12.00 600 <-- a peak! 12.04 450 ... Here is an example R session, trying out the function you suggested: #importing my data like this: X <- read.table("input.dat", header=TRUE) #from the example: Peaks = msc.peaks.find(X) #errors with: Error in sort(x, partial = unique(c(lo, hi))) : 'x' must be atomic Also: I have tried one of the functions ( 'getPeaks' ) listed on the 'msc.peaks.find' manual page, however I am still having a problem with the format of my data vs. what the function is expecting. #importing my data like this: X <- read.table("input.dat", header=TRUE) #setup an output file for peak information peakfile <- paste("peakinfo.csv", sep="/") #run the analysis: getPeaks(X,peakfile) #errors with: Error in area/max(area) : non-numeric argument to binary operator In addition: Warning message: no finite arguments to max; returning -Inf any ideas would be greatly appreciated! -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] finding peaks in a simple dataset with R
On Nov 23, 2005, at 8:10 AM, Martin Maechler wrote: >>>>>> "Marc" == Marc Kirchner <[EMAIL PROTECTED]> >>>>>> on Wed, 23 Nov 2005 14:33:28 + writes: > >>> >>> I wonder if we shouldn't polish that a bit and add to R's >>> standard 'utils' package. >>> > > Marc> Hm, I figured out there are (at least) two versions out > there, one being > Marc> the "original" idea and a modification. > > Marc> === Petr Pikal in 2001 (based on Brian Ripley's idea)== > Marc> peaks <- function(series, span=3) { > Marc> z <- embed(series, span) > Marc> result <- max.col(z) == 1 + span %/% 2 > Marc> result > Marc> } > > Marc> versus > > Marc> === Petr Pikal in 2004 == > Marc> peaks2<-function(series,span=3) { > Marc> z <- embed(series, span) > Marc> s <- span%/%2 > Marc> v<- max.col(z) == 1 + s > Marc> result <- c(rep(FALSE,s),v) > Marc> result <- result[1:(length(result)-s)] > Marc> result > Marc> } > > Thank you, Marc, > > Marc> Comparison shows >>> peaks(c(1,4,1,1,6,1,5,1,1),3) > Marc> [1] TRUE FALSE FALSE TRUE FALSE TRUE FALSE > Marc> which is a logical vector for elements 2:N-1 and > >>> peaks2(c(1,4,1,1,6,1,5,1,1),3) > Marc> [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE > Marc> which is a logical vector for elements 1:N-2. > > Marc> As I would expect to "lose" (span-1)/2 elements on each side > Marc> of the vector, to me the 2001 version feels more natural. > > I think for the function to be more useful it the result should > have the original vector length and hence I'd propose to also > pad with FALSE at the upper end. > Hi, been lurking for a while, and thought that I would chime in. I have similar need: x,y data with many "peaks" - finding them with R would be a nice feature. However, as mentioned above the ability to find more than one peak would be especially helpful. In addition preserving the length of the input vector would help with linking peak locations to their real index. In my case the data is generated by an X-ray diffraction machine: x-axis in Degrees, y-axis in relative intensity. The data looks like this: http://casoilresource.lawr.ucdavis.edu/drupal/node/71 > Marc> Also, both "suffer" from being non-deterministic in the > Marc> multiple-maxima-case (the two 4s here) > > yes, of course, because of max.col(). > Note that Venables & Ripley would consider this to be rather a feature. > > Marc> This could (should?) be fixed by modifying the call to > max.col() > Marc> result <- max.col(z, "first") == 1 + span %/% 2; > > Actually I think it should become an option, but I'd use "first" > as default. > I would second that. > Marc> Just my two cents, > > Thank you. > > Here is my current proposal which also demonstrates why it's > useful to pad with FALSE : > > peaks <-function(series, span = 3, ties.method = "first") { > if((span <- as.integer(span)) %% 2 != 1) stop("'span' must be odd") > z <- embed(series, span) > s <- span %/% 2 > v <- max.col(z, ties.method = ties.method) == s + 1:1 > pad <- rep(FALSE, s) > c(pad, v, pad) > } > > y <- c(1,4,1,1,6,1,5,1,1) ; (ii <- which(peaks(y))); y[ii] > ##- [1] 2 5 7 > ##- [1] 4 6 5 > > set.seed(7) > y <- rpois(100, lambda = 7) > py <- peaks(y) > plot(y, type="o", cex = 1/4, main = "y and peaks(y,3)") > points(seq(y)[py], y[py], col = 2, cex = 1.5) > > p7 <- peaks(y,7) > points(seq(y)[p7], y[p7], col = 3, cex = 2) > mtext("peaks(y,7)", col=3) Thanks for working on this, as I would imagine there are other lurkers out there who are waiting for a solution to this problem. -- Dylan Beaudette Soils and Biogeochemistry Graduate Group University of California at Davis 530.754.7341 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html