RE: [R] Rd Files?
Wolski [EMAIL PROTECTED] wrotes: I have seen the output and it does not matter to me anymore if prompt or package.skeleton works on any platform. I hope it wasn't a too big heresy. If someone would ask me what are the week point of R, then the only one that pops up immediately, is that the documentation to functions have to be stored in a separate file than the code. I am a big R/S fan. But its a pity that comments above or below the function declaration are not recognized by the help system. Therefore prompt Toni Rossini answered: Doug Bates commented on the possibility of patching Doxygen to do this, once long ago. Not sure if anyone took it anywhere, though. It's a reasonable system for assisting with documentation in a number of languages, though it could be improved. That would be nice, since then one would get C docs for free. Another alternative is to write a noweb lit-prog file, and then generate your package via noweb (NOT Sweave, though you get double duty, since if it's written right, you can stick the original doc in as a vignette). Well, writing a quick and durty help for a function with a few lines of comment above or below the function code (a la Matlab) should be nice. I don't think that it should be a good idea to provide a complex alternative solution for documenting the functions than the current mechanism which is both powerful and efficient (but, of course, a little bit complex). Here is a quick and durty implementation of a mechanism to include quick and durty help messages inside the code of an R function. I guess this is enough. qhelp - function(topic) { if (is.character(topic)) topic - get(topic) if (!is.function(topic)) stop(`topic` must be a function, or the name of a function) fcode - sub(, , deparse(topic)) # Because 4 spaces are added by deparse # Look for quick help text, that is, strings starting with `#` qhlp - fcode[grep(^\\#, fcode)] qhlp - as.character(parse(text=qhlp)) cat(paste(qhlp, \n, sep=), sep=) return(invisible()) # Quick help # `qhelp()` provides a mechanism to include \quick help\ # embedded inside the code of an R function. # # Just end the function code with return(res) # and add some strings starting with `#` after it # with the content of your quick help message... } # An example of a very simple function with quick help cube - function(x) { # This is some comment that will appear only when I print the function... return(x^3) # Quick help # `cube(x)` returns the cube of its `x` argument # Version 0.1, by Ph. Grosjean ([EMAIL PROTECTED]) } qhelp(cube)# Should return quick help qhelp(qhelp) # Strings also allowed for `topic` argument qhelp(log) # No quick help, should print just an empty lines Best, Philippe ...](({?...?}))... ) ) ) ) ) ( ( ( ( ( Prof. Philippe Grosjean ) ) ) ) ) ( ( ( ( ( Numerical Ecology Laboratory ) ) ) ) ) Mons-Hainaut University ( ( ( ( ( 8, Av. du Champ de Mars, 7000 Mons, Belgium ) ) ) ) ) ( ( ( ( ( phone: 00-32-65.37.34.97 ) ) ) ) ) email: [EMAIL PROTECTED]; [EMAIL PROTECTED] ( ( ( ( ( SciViews project coordinator (http://www.sciviews.org) ) ) ) ) ) ... __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Rd Files?
Philippe Grosjean [EMAIL PROTECTED] writes: Well, writing a quick and durty help for a function with a few lines of comment above or below the function code (a la Matlab) should be nice. I don't think that it should be a good idea to provide a complex alternative solution for documenting the functions than the current mechanism which is both powerful and efficient (but, of course, a little bit complex). Here is a quick and durty implementation of a mechanism to include quick and durty help messages inside the code of an R function. I guess this is enough. That's a nice quicky and dirty solution. Works in simple cases, but fails the works in all cases. But a 90-percent solution is probably enough for the task at hand, especially for software limited to individual deployment. However, note that it's the basic idea behind the Doxygen framework, which does a more robust job of parsing and documenting. best, -tony -- [EMAIL PROTECTED]http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Simulating correlated distributions
Hi How can one simulate correlated distributions in R for windows? Coomaren P. Vencatasawmy - Download Yahoo! Messenger now for a chance to WIN Robbie Williams Live At Knebworth DVD [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Re: question regarding variance components
Assuming you are measuring Y and you have factor A fixed and factor B random, I would create a model like: mod-lme(Y ~ A, random=~1|B/A, mydata) VarCorr(mod1) the term random=~1|B tells the model that B is a random factor, adding the /A to get random =~1|B/A tells the model you want the interaction between the fixed and random factors. VarCorr gives you the variance components of the model. All is answered much better (and with examples) in Pinheiro and Bates 2000 (it's in the first chapter) and in Crawley 2002. I have posted a question similar to yours times ago, and got an excellent reply from Prof Bates; search the archives for it. If ALL your factors are random try something: mod-lme(Y~1,random=~1|A/B, mydata) VarCorr(mod) but here I am more guessing than anything. Get Pinheiro and Bates 2000 for this. Cheers, Federico -- = Federico C. F. Calboli PLEASE NOTE NEW ADDRESS Dipartimento di Biologia Via Selmi 3 40126 Bologna Italy tel (+39) 051 209 4187 fax (+39) 051 251 208 f.calboli at ucl.ac.uk __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] amap : hclust agglomeration
Hi, I'm trying to understand the complete linkage method in hclust. Can anyone provide a breakdown of the formula (p9 of the pdf documentation) or tell me what the sup operator does/means? thanks in advance Tom [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] lme: reproducing example
Thanks! I think the minor differences taking the values with rnorm result of the homogen distribution without an effect. But the results of aov and lme should be similiar for data with effects, too (at least for simple and balanced designs). Karl --- Pascal A. Niklaus [EMAIL PROTECTED] schrieb: Karl Knoblick wrote: Dear R-community! I still have the problem reproducing the following example using lme. id-factor(rep(rep(1:5,rep(3,5)),3)) factA - factor(rep(c(a1,a2,a3),rep(15,3))) factB - factor(rep(c(B1,B2,B3),15)) Y-numeric(length=45) Y[ 1: 9]-c(56,52,48,57,54,46,55,51,51) Y[10:18]-c(58,51,50,54,53,46,54,50,49) Y[19:27]-c(53,49,48,56,48,52,52,52,50) Y[28:36]-c(55,51,46,57,49,50,55,51,47) Y[37:45]-c(56,48,51,58,50,48,58,46,52) df-data.frame(id, factA, factB, Y) df.aov - aov(Y ~ factA*factB + Error(factA:id), data=df) summary(df.aov) Is there a way to get the same results with lme as with aov with Error()? HOW??? One idea was the following: df$factAid=factor(paste(as.character(df$factA),:,as.character(df$id),sep=)) df.lme - lme(Y~factA*factB,df,random=~1|factAid,method=REML) The degrees of freedom look right, but the F values don't match aov. Hope somebody can help! Thanks!! Karl Hmmm, strange, it works if I use factB:id as plot... it also works when I use factA:id as plot and replace your Y's by random numbers... is this a problem with convergence? Pascal df$Y=rnorm(45) summary(aov(Y ~ factB*factA + Error(id:factA),data=df)) Error: id:factA Df Sum Sq Mean Sq F value Pr(F) factA 2 2.9398 1.4699 0.9014 0.4318 Residuals 12 19.5675 1.6306 Error: Within Df Sum Sq Mean Sq F value Pr(F) factB2 7.1431 3.5716 7.4964 0.002956 ** factB:factA 4 4.2411 1.0603 2.2254 0.096377 . Residuals 24 11.4345 0.4764 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 anova(lme(Y ~ factB*factA ,data=df, random = ~ 1 | plot)) numDF denDF F-value p-value (Intercept) 124 0.014294 0.9058 factB 224 7.496097 0.0030 factA 212 0.901489 0.4318 factB:factA 424 2.225317 0.0964 Pascal summary(aov(Y ~ factA*factB + Error(factB:id))) Error: factB:id Df Sum Sq Mean Sq F valuePr(F) factB 2 370.71 185.36 51.488 1.293e-06 *** Residuals 12 43.203.60 --- Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 Error: Within Df Sum Sq Mean Sq F value Pr(F) factA2 9.911 4.956 1.6248 0.21788 factA:factB 4 45.556 11.389 3.7341 0.01686 * Residuals 24 73.200 3.050 df$plot - factor(paste(df$factB,df$id)) anova(lme(Y ~ factB*factA , data=df, random = ~1 | plot)) numDF denDF F-value p-value (Intercept) 124 33296.02 .0001 factB 21251.47 .0001 factA 224 1.63 0.2178 factB:factA 424 3.73 0.0168 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Simulating correlated distributions
On Wed, 3 Dec 2003 10:08:04 + (GMT), you wrote: Hi How can one simulate correlated distributions in R for windows? I'm not sure exactly what you're asking, but maybe the MASS function mvrnorm() is what you want. Duncan Murdoch __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] Changing Colors
Hello, I've got a big problem. I'm using R for geostatistical analyses, especially the field-package. I try to generate plots after the kriging process with help of image.plot(..., col=terrain.colors, ...). Everything works fine, but I want to reverse the color-palettes (heat.colors, topo.colors or gray()) to get darkest colors at highest data-values instead the other way round. Could anyone give me hints or some syntax to resolve that problem?? Thanks and best regards, Lars Peters - Lars Peters University of Konstanz Limnological Institute D-78457 Konstanz Germany phone: +49 (0)7531 88-2930 fax: +49 (0)7531 88-3533 e-mail: [EMAIL PROTECTED] http://www.uni-konstanz.de/sfb454/tp_eng/A1/doc/peters/peters.html http://www.uni-konstanz.de/sfb454/tp_eng/A1/index.htm __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Changing Colors
On Wed, 3 Dec 2003, Lars Peters wrote: Hello, I've got a big problem. I'm using R for geostatistical analyses, especially the field-package. I try to generate plots after the kriging process with help of image.plot(..., col=terrain.colors, ...). Everything works fine, but I want to reverse the color-palettes (heat.colors, topo.colors or gray()) to get darkest colors at highest data-values instead the other way round. Could anyone give me hints or some syntax to resolve that problem?? rev()? Thanks and best regards, Lars Peters - Lars Peters University of Konstanz Limnological Institute D-78457 Konstanz Germany phone: +49 (0)7531 88-2930 fax: +49 (0)7531 88-3533 e-mail: [EMAIL PROTECTED] http://www.uni-konstanz.de/sfb454/tp_eng/A1/doc/peters/peters.html http://www.uni-konstanz.de/sfb454/tp_eng/A1/index.htm __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help -- Roger Bivand Economic Geography Section, Department of Economics, Norwegian School of Economics and Business Administration, Breiviksveien 40, N-5045 Bergen, Norway. voice: +47 55 95 93 55; fax +47 55 95 93 93 e-mail: [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] Vector Assignments
Your recommendations have worked great. I have found both cut and ifelse to be useful. I have one more question. When should I use factors over a character vector. I know that they have different uses. However, I am still trying to figure out how I can best take advantage of factors. The following is what I am really trying to do: colors - c(red,blue,green,black) y.col - colors[cut(y,c(-Inf,250,500,700,Inf),right=F,lab=F)] plot(x,y,col=y.col) Would using factors make this any cleaner? I think a character vector is all I need but I thought I would ask. Thanks for your help, Arend van der Veen On Tue, 2003-12-02 at 00:32, Gabor Grothendieck wrote: And one other thing. Are you sure you want character variables as the result of all this? A column whose entries are each one of four colors seems like a good job for a factor: colours - c(red, blue, green,black) cut(x, c(-Inf,250,500,700,Inf),right=F,lab=colours) --- Date: Mon, 1 Dec 2003 23:47:39 -0500 (EST) From: Gabor Grothendieck [EMAIL PROTECTED] To: [EMAIL PROTECTED], [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [R] Vector Assignments Just some small refinements/corrections: colours - c(red, blue, green,back) colours[cut(x, c(-Inf,250,500,700,Inf),right=F,lab=F)] --- Date: Tue, 02 Dec 2003 14:38:55 +1300 From: Hadley Wickham [EMAIL PROTECTED] To: Arend P. van der Veen [EMAIL PROTECTED] Cc: R HELP [EMAIL PROTECTED] Subject: Re: [R] Vector Assignments One way would be to create a vector of colours and then cut() to index the vector: colours - c(red, blue, green,back) colours[cut(x, c(min(x),250,500,700,max(x)),lab=F)] Hadley Arend P. van der Veen wrote: Hi, I have simple R question. I have a vector x that contains real numbers. I would like to create another vector col that is the same length of x such that: if x[i] 250 then col[i] = red else if x[i] 500 then col[i] = blue else if x[i] 750 then col[i] = green else col[i] = black for all i I am convinced that there is probably a very efficient way to do this in R but I am not able to figure it out. Any help would be greatly appreciated. Thanks in advance, Arend van der Veen ___ No banners. No pop-ups. No kidding. Introducing My Way - http://www.myway.com __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] multidimensional Fisher or Chi square test
Hello, Is there a test for independence available based on a multidimensional contingency table? I've about 300 processes, and for each of them I get numbers for failures and successes. I've two or more conditions under which I test these processes. If I had just one process to test I could just perform a fisher or chisquare test on a 2x2 contigency table, like this: for one process: conditionA conditionB ok 20 6 failed 190 156 From the table I can figure out if the outcome (ok/failed) is bound to one of the conditions for a process. However, I'd like to know how different the 2 conditions are from each other considering all 300 processes, and I consider the processes to be an additional dimension. My H0 is that both conditions are overall (considering all processes) the same. Could you give me a hint what kind of test of package I should look into? kind regars + thanks for your help, Arne __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] intraclass correlation
Hi, Can R calculate an intraclass correlation coefficient for clustered data, when the outcome variable is dichotomous? By now I calculate it by hand, estimating between- and intracluster variance by one-way ANOVA - however I don't feel very comfortable about this, since the distributional assumptions are not really met Maybe anyone can help me? Best regards and many many thanks, Veronique Verhoeven University of Antwerp __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] reason for Factors -- was -- Vector Assignments
On Wed, 3 Dec 2003, Arend P. van der Veen wrote: Your recommendations have worked great. I have found both cut and ifelse to be useful. I have one more question. When should I use factors over a character vector. I know that they have different uses. However, I am still trying to figure out how I can best take advantage of factors. The following is what I am really trying to do: colors - c(red,blue,green,black) y.col - colors[cut(y,c(-Inf,250,500,700,Inf),right=F,lab=F)] plot(x,y,col=y.col) Would using factors make this any cleaner? I think a character vector is all I need but I thought I would ask. Thanks for your help, Arend van der Veen Arend - When setting the colors of plotted points, you definitely want a vector of character strings as the color names. Factor was invented so that regression and analysis of variance functions would properly recognize a grouping variable and not fit simply a linear coefficient to the integer codes. In the context of a linear (or similar) model, each factor or interaction has to be expanded from a single column of integer codes into a matrix of [0,1] indicator variables, with a separate column for each possible level of the factor. (I oversimplify a bit here: some columns are omitted, to keep the design matrix from being over-specified.) - tom blackwell - u michigan medical school - ann arbor - __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] non-uniqueness in cluster analysis
Hi, I'm clustering objects defined by categorical variables with a hierarchical algorithm - average linkage. My distance matrix (general dissimilarity coefficient) includes several distances with exactly the same values. As I see, a standard agglomerative procedure ignores this problems, simply selecting, above equal distances, the one that comes first. For this reason the analysis in output depends strongly on the orderings of the objects within the raw data matrix. Is there a standard procedure to deal with this? Thanks Bruno __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] multidimensional Fisher or Chi square test
On Wed, 2003-12-03 at 14:34, [EMAIL PROTECTED] wrote: Is there a test for independence available based on a multidimensional contingency table? I've about 300 processes, and for each of them I get numbers for failures and successes. I've two or more conditions under which I test these processes. You may look for ?mantelhaen.test Dennis __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] non-uniqueness in cluster analysis
Bruno - Many people add a tiny random number to each of the distances, or deliberately randomize the input order. This means that any clustering is not reproducible, unless you go back to the original randoms, but it forces you not to pay attention to minor differences. Ah, I think you're asking about bootstrap confidence intervals for the set of descendants from each interior vertex. This is certainly routine procedure when inferring evolutionary trees, but I'm not sure any of that code has been re-implemented in R or Splus. - tom blackwell - u michigan medical school - ann arbor - On Wed, 3 Dec 2003, Bruno Giordano wrote: Hi, I'm clustering objects defined by categorical variables with a hierarchical algorithm - average linkage. My distance matrix (general dissimilarity coefficient) includes several distances with exactly the same values. As I see, a standard agglomerative procedure ignores this problems, simply selecting, above equal distances, the one that comes first. For this reason the analysis in output depends strongly on the orderings of the objects within the raw data matrix. Is there a standard procedure to deal with this? Thanks Bruno __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] non-uniqueness in cluster analysis
On Wed, 3 Dec 2003, Bruno Giordano wrote: Hi, I'm clustering objects defined by categorical variables with a hierarchical algorithm - average linkage. My distance matrix (general dissimilarity coefficient) includes several distances with exactly the same values. As I see, a standard agglomerative procedure ignores this problems, simply selecting, above equal distances, the one that comes first. For this reason the analysis in output depends strongly on the orderings of the objects within the raw data matrix. Is there a standard procedure to deal with this? Don't use average linkage! -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] non-uniqueness in cluster analysis
Hi, Brian Ripley already replied don't use average linkage... You may think about k-medoid (pam) in package cluster instead. However, often average linkage is not such a bad choice, and if you really want to use it for your data, you may try the following: Among the hierarchical methods, single linkage has the smallest problem with equal distances, because possible agglomerations based on equal distances between clusters are all carried out regardless of the order. If at some step the smallest between cluster-distance is d(a,b)=d(a,c)d(b,c), it may happen that a and b are merged first, or a and c are merged first, but before merging anything else with distance larger than d(a,b), a, b *and* c are merged. Thus, you have order dependence only between the steps where you merge clusters with the same distance, but not afterwards. If your problem occurs only at a low level of agglomeration (and you don't have situations where d(a,b) and d(a,c) are small and d(b,c) is very large; I do not know if the triangle inequality holds for your data), you may do some first steps with Single Linkage and then continue with average linkage (I haven't thought about if this can be done in R without extra effort). But if you have already observed that the averarge linkage outcome depends critically (from the viewpoint of interpretation) on the order of points, then it seems that you are in an unstable situation, if you are able to define a unique clustering or not. Christian On Wed, 3 Dec 2003, Bruno Giordano wrote: Hi, I'm clustering objects defined by categorical variables with a hierarchical algorithm - average linkage. My distance matrix (general dissimilarity coefficient) includes several distances with exactly the same values. As I see, a standard agglomerative procedure ignores this problems, simply selecting, above equal distances, the one that comes first. For this reason the analysis in output depends strongly on the orderings of the objects within the raw data matrix. Is there a standard procedure to deal with this? Thanks Bruno __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help *** Christian Hennig Fachbereich Mathematik-SPST/ZMS, Universitaet Hamburg [EMAIL PROTECTED], http://www.math.uni-hamburg.de/home/hennig/ ### ich empfehle www.boag-online.de __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] non-uniqueness in cluster analysis
What I did was, in presence of equal values distances, to randomize the selection of them, and compute the distortion of the solution using cophenetic correlation. I computed 1 random trees for each of three methods: average, single and complete linkage. Among the randomly selected solutions, for the three methods, average linkage was able to give the highest cophenetic correlation, followed by complete and then by single linkage. Among the random trees single linkage, for obvious reasons, gave a constant cophenetic correlation. My data set is rather small (25 objects). I'm seriously thinking of calculating all the possible solutions (I guess about 3), picking the ones that give the highest cophenetic correlation, and analyzing the consistency among those solutions, after establishing a natural number of clusters. Bruno __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] setMethod(min, myclass, ...)
Thomas Stabla wrote: Hello, I have defined a new class setClass(myclass, representation(min = numeric, max = numeric)) and want to write accessor functions, so that for foo = new(myclass, min = 0, max = 1) min(foo) # prints 0 max(foo) # prints 1 At first i created a generic function for min setGeneric(min, function(..., na.rm = FALSE) standardGeneric(min)) and then tried to use setMethod. And there's my problem, I don't know the name of the first argument which is to be passed to min. I can't just write: setMethod(min, myclass, function(..., na.rm = FALSE) [EMAIL PROTECTED]) The same problem occurs with max. Generally, it's not a good idea to take a well-known function name and make it into an accessor function. In a functional language, basic function calls such as min(x), sin(x), etc. should have a natural interpretation. Defining methods for these functions is meant to do what it says: provide a method to achieve the general purpose of the function for particular objects. If you want accessor functions, they should probably have names that make their purpose obvious. One convention, a la Java properties, would be getMin(x), etc. (the capitalizing is potentially an issue since slot names are case sensitive). If you do really want a method for the min() function for myclass, that's a different problem. The argument ... is different from all other argument names, and it can't be used in a signature. To define methods for functions such as min(), the formal arguments of the basic function would have to be changed to, say, function(x, ..., na.rm) Then methods can be defined for argument x. The R versions of these functions don't currently allow methods. If methods are needed, a package could currently define its own version of the (non-generic) functions to include argument x. More than this is needed to handle multiple arguments generally. The problem is that defining a method for argument x does not cause that method to be called if the object appears as a later argument. If you had a method for min() for myclass and myX was an object from that class min(myX, 1) would work, but min(1, myX) would not. To get all examples right would require changes to the basic code for these functions. There is a brief discussion of one approach in Programming with Data, pages 343 and 351. Thanks for your help. Greetings, Thomas Stabla __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help -- John M. Chambers [EMAIL PROTECTED] Bell Labs, Lucent Technologiesoffice: (908)582-2681 700 Mountain Avenue, Room 2C-282 fax:(908)582-3340 Murray Hill, NJ 07974web: http://www.cs.bell-labs.com/~jmc __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] intraclass correlation
I have been using a little function I wrote myself; look at http://www.unc.edu/home/aperrin/tips/src/icc.R for the code. Not pretty, but it works. ap -- Andrew J Perrin - http://www.unc.edu/~aperrin Assistant Professor of Sociology, U of North Carolina, Chapel Hill [EMAIL PROTECTED] * andrew_perrin (at) unc.edu On Wed, 3 Dec 2003, Veronique Verhoeven wrote: Hi, Can R calculate an intraclass correlation coefficient for clustered data, when the outcome variable is dichotomous? By now I calculate it by hand, estimating between- and intracluster variance by one-way ANOVA - however I don't feel very comfortable about this, since the distributional assumptions are not really met Maybe anyone can help me? Best regards and many many thanks, Veronique Verhoeven University of Antwerp __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] Error in randomForest.default(m, y, ...) : negative lengt h vectors are not allowed
Christian -- You don't provide enough information (like a call) to answer this. I suspect, though, that you may be subsetting in a way that passes randomForest no data. I'm not aware offhand of an easy way to get this error from randomForest. I tried creating some data superficially similar to yours to see whether something would break if there were only a single value in the variable to be explained, but everything worked fine (though it does give a reasonable warning). test.dat - data.frame(a = rep(0, 1000), b = runif(1000), c = sample(0:1, 1000, replace = TRUE, p = c(.8, .2)) t8 - randomForest(a ~ b + c, data = test.dat) Warning message: The response has five or fewer unique values. Are you sure you want to do regression? in: randomForest.default(m, y, ...) test.dat[sample(1:1000, 100),a] - runif(100, 1, 200) t8 - randomForest(a ~ b + c, data = test.dat) Some other generated data might come up with the error, but I'd bet on the subsetting problem. Hope this helps, -Matt Matthew Wiener RY84-202 Applied Computer Science Mathematics Dept. Merck Research Labs 126 E. Lincoln Ave. Rahway, NJ 07065 732-594-5303 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Christian Schulz Sent: Wednesday, December 03, 2003 9:42 AM To: [EMAIL PROTECTED] Subject: [R] Error in randomForest.default(m, y, ...) : negative length vectors are not allowed Hi, what i'm doing wrong? I'm using a data.frame with ~ 90.000 instances and 7 attributes, 5 are binary recoded 1 independend variable are a real one and the target is a real one,too. The distributions are not very skewed in the dummy variables ,but in the real variables are ~ 60.000 zero values instances, but zero means no money is payed and is a important value! Many thanks for help suggestions, regards,christian __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] HMisc describe -- error with dates
Thank you Frank and Gabor for the fixes and checking and rechecking! Everything seems to work well with the Hmisc functions tried--upData, describe and summary. To summarize: 1. Add the testDateTime and formatDateTime functions (copied from Frank's messages) to the Hmisc file (or run prior to loading Hmisc) testDateTime - function(x, what=c('either','both','timeVaries')) { what - match.arg(what) cl - class(x) # was oldClass 22jun03 if(!length(cl)) return(FALSE) dc - if(.R.) c('POSIXt','POSIXct','dates','times','chron') else c('timeDate','date','dates','times','chron') dtc - if(.R.) c('POSIXt','POSIXct','chron') else c('timeDate','chron') switch(what, either = any(cl %in% dc), both = any(cl %in% dtc), timeVaries = { if('chron' %in% cl || !.R.) { ## chron or S+ timeDate y - as.numeric(x) length(unique(round(y - floor(y),13))) 1 } else if(.R.) length(unique(format(x,'%H%M%S'))) 1 else FALSE }) } formatDateTime - function(x, at, roundDay=FALSE) { cl - at$class w - if(any(cl %in% c('chron','dates','times'))) { attributes(x) - at fmt - at$format if(roundDay) { if(length(fmt)==2 is.character(fmt)) format.dates(x, fmt[1]) else format.dates(x) } else x } else if(.R.) { attributes(x) - at if(roundDay) as.POSIXct(round(x, 'days')) else x } else timeDate(julian=if(roundDay)round(x) else x) format(w) } 2. Replace the decribe function with the new one (available as an attachment in Frank's most recent message on the subject). Instead of editing the original Hmisc file, this could be run after the Hmisc library is loaded. Right? Tanya __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
AW: [R] Error in randomForest.default(m, y, ...) : negative length vectors are not allowed
Hmmm, thanks for your suggestions i'm in the same opinion with any subsetting problem, but curious is that my model i.e. with library(gbm) or simple lm works, because my task is to find out the weights/importance values for the attributes and i would like compare the results between the randomForest classifier and a linear approach. I check it with your suggestions and code snippets in detail and feedback you the problem, if i found the solution. regards,Christian -Ursprüngliche Nachricht- Von: Wiener, Matthew [mailto:[EMAIL PROTECTED] Gesendet: Mittwoch, 3. Dezember 2003 17:26 An: 'Christian Schulz'; [EMAIL PROTECTED] Betreff: RE: [R] Error in randomForest.default(m, y, ...) : negative lengt h vectors are not allowed Christian -- You don't provide enough information (like a call) to answer this. I suspect, though, that you may be subsetting in a way that passes randomForest no data. I'm not aware offhand of an easy way to get this error from randomForest. I tried creating some data superficially similar to yours to see whether something would break if there were only a single value in the variable to be explained, but everything worked fine (though it does give a reasonable warning). test.dat - data.frame(a = rep(0, 1000), b = runif(1000), c = sample(0:1, 1000, replace = TRUE, p = c(.8, .2)) t8 - randomForest(a ~ b + c, data = test.dat) Warning message: The response has five or fewer unique values. Are you sure you want to do regression? in: randomForest.default(m, y, ...) test.dat[sample(1:1000, 100),a] - runif(100, 1, 200) t8 - randomForest(a ~ b + c, data = test.dat) Some other generated data might come up with the error, but I'd bet on the subsetting problem. Hope this helps, -Matt Matthew Wiener RY84-202 Applied Computer Science Mathematics Dept. Merck Research Labs 126 E. Lincoln Ave. Rahway, NJ 07065 732-594-5303 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Christian Schulz Sent: Wednesday, December 03, 2003 9:42 AM To: [EMAIL PROTECTED] Subject: [R] Error in randomForest.default(m, y, ...) : negative length vectors are not allowed Hi, what i'm doing wrong? I'm using a data.frame with ~ 90.000 instances and 7 attributes, 5 are binary recoded 1 independend variable are a real one and the target is a real one,too. The distributions are not very skewed in the dummy variables ,but in the real variables are ~ 60.000 zero values instances, but zero means no money is payed and is a important value! Many thanks for help suggestions, regards,christian __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] checking for identical columns in a mxn matrix
Hi, I have a rectangular matrix and I need to check whether any columns are identical or not. Currently I'm looping over the columns and checking each column with all the others with identical(). However, as experience has shown me, getting rid of loops is a good idea :) Would anybody have any suggestions as to how I could do this job more efficiently. (It would be nice to know which columns are identical but thats not a necessity.) --- Rajarshi Guha [EMAIL PROTECTED] http://jijo.cjb.net GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE --- Entropy isn't what it used to be. __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] checking for identical columns in a mxn matrix
On Wed, 2003-12-03 at 12:06, Rajarshi Guha wrote: Hi, I have a rectangular matrix and I need to check whether any columns are identical or not. Currently I'm looping over the columns and checking each column with all the others with identical(). However, as experience has shown me, getting rid of loops is a good idea :) Would anybody have any suggestions as to how I could do this job more efficiently. (It would be nice to know which columns are identical but thats not a necessity.) If your matrix is 'x' and contains text and/or integer values (since float comparisons can be problematic) you can use: any(duplicated(x, MARGIN = 2)) to find out if any of the columns are duplicated and which(duplicated(x, MARGIN = 2)) to get the column numbers that are duplicates in the matrix. If you want to extract the unique columns, you can use: unique(x, MARGIN = 2) See ?duplicated and ?unique for more information. Example: x - matrix(c(1:3, 4:6, 1:3, 7:9), ncol = 4) x [,1] [,2] [,3] [,4] [1,]1417 [2,]2528 [3,]3639 any(duplicated(x, MARGIN = 2)) [1] TRUE which(duplicated(x, MARGIN = 2)) [1] 3 unique(x, MARGIN = 2) [,1] [,2] [,3] [1,]147 [2,]258 [3,]369 HTH, Marc Schwartz __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] nameless functions in R
On Wed, 3 Dec 2003, Rajarshi Guha wrote: Hi, I have an apply statement that looks like: check.cols - function(v1, v2) { + return( identical(v1,v2) ); + } x [,1] [,2] [,3] [1,]133 [2,]454 [3,]276 apply(x, c(2), check.cols, v2=c(7,8,9)) [1] FALSE FALSE FALSE Is it possible to make the function check.cols() inline to the apply statement. Some thing like this: apply(x, c(2), funtion(v1,v2){ identical(v1,v2) }, v2=c(1,4,2)) The above gives me a syntax error. I also tried: apply(x, c(2), fun - funtion(v1,v2){ return(identical(v1,v2)) }, v2=c(1,4,2)) and I still get a syntax error. Is this type of syntax allowed in R? Yes, anonymous functions are allowed. Anonymous funtions aren't -- you appear to have a typographical problem. -thomas __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] model of fish over exploitation
It looks like you are trying to fit Schaeffer model (a special case of the Pella-Tomlinsion general production model) to the data. Such models can be solved in a completely general way using ADModel Builder, and an example of the general production model application can be found at http://otter-rsch.com/examples.htm#docs Bon courage, John John Sibert, Manager Pelagic Fisheries Research Program University of Hawaii at Manoa 1000 Pope Road, MSB 313 Honolulu, HI 96822 United States Phone: (808) 956-4109 Fax: (808) 956-4104 Washington DC Phone: (202) 861 2363 Fax: (202) 861 4767 PFRP Web Site: http://www.soest.hawaii.edu/PFRP/ email: [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] checking for identical columns in a mxn matrix
From: Rajarshi Guha On Wed, 2003-12-03 at 13:18, J.R. Lockwood wrote: list will come up with something clever. the other issues is that you need to be careful when doing equality comparisons with floating point numbers. unless your matrix consists of characters or integers, you'll need to think about some level of numerical tolerance of your comparison. Yes, the matrix will always be integer. Other than what J.R. and Marc suggested, you might could try to use dist(t(x), method=manhattan) and see which entries are 0 (or close enough to 0). HTH, Andy --- Rajarshi Guha [EMAIL PROTECTED] http://jijo.cjb.net GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE --- All theoretical chemistry is really physics; and all theoretical chemists know it. -- Richard P. Feynman __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] nameless functions in R
Rajarshi Guha [EMAIL PROTECTED] writes: apply(x, c(2), funtion(v1,v2){ identical(v1,v2) }, v2=c(1,4,2)) The above gives me a syntax error. I also tried: No wonder! Try with `function' instead of `funtion'. -- Bjørn-Helge Mevik __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] volume of an irregular grid
I have a 3d irregular grid of a surface (closed surface) I would like to calculate the volume enclosed inside this surface can this be done in R any help is very much appreciated best regards karim Karim __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] add a point to regression line and cook's distance
What is the context? What do the outliers represent? If you think carefully about the context, you may find the answer. hope this helps. spencer graves p.s. I know statisticians who worked for HP before the split and who still work for either HP or Agilent, I'm not certain which. If you want to contact me off-line, I can give you a couple of names if that might help. [EMAIL PROTECTED] wrote: Hi, This is more a statistics question rather than R question. But I thought people on this list may have some pointers. MY question is like the following: I would like to have a robust regression line. The data I have are mostly clustered around a small range. So the regression line tend to be influenced strongly by outlier points (with large cook's distance). From the application 's background, I know that the line should pass (0,0), which is far away from the data cloud. I would like to add this point to have a more robust line. The question is: does it make sense to do this? what are the negative impacts if any? thanks, jonathan __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] predict.gl1ce question
Richard Bonneau [EMAIL PROTECTED] writes: Hi, I'm using gl1ce with family=binomial like so: yy succ fail [1,] 76 23 [2,] 32 67 [3,] 56 43 ... [24,] 81 18 xx c1219 c643 X1 0.04545455 0.64274145 X2 0.17723669 0.90392792 ... X24 0.80629054 0.12239320 test.gl1ce - gl1ce(yy ~ xx, family = binomial(link=logit), bound = 0.5 ) or omit - c(2,3) test.gl1ce - gl1ce(yy[-omit,] ~ xx[-omit,], family = binomial(link=logit), bound = 1 ) this seems to work fine and as i change the shrinkage parameter everything behaves as expected. if i try to get the fitted values (y-hat) using predict i have no problems: predict.gl1ce(test.gl1ce) [1] 0.38129977 0.16513661 0.47666779 0.45348757 0.09916513 0.18167674 [7] 0.11047684 0.15786664 0.14765670 0.40657031 0.19072570 0.80259477 [13] 0.36317090 0.35930557 0.23700520 0.17579282 0.18835043 0.52306049 [19] 0.28388953 0.41262864 0.29933710 0.43556139 0.15276727 0.73017401 *** I have problems, however, when i try to use predict.gl1ce() with newdata. so, the following tries all give errors: predict.gl1ce(test.gl1ce, xx, family=binomial(link-logit)) predict.gl1ce(test.gl1ce, xx) predict.gl1ce(test.gl1ce, xx[omit,], family=binomial(link-logit)) Error in predict.l1ce(test.gl1ce, xx, family = binomial(link - logit)) : Argument `newdata' is not a data frame, and cannot be coerced to an appropriate model matrix the following weak try also bombs: predict.gl1ce(test.gl1ce, data.frame(xx), family=binomial(link-logit)) Error in eval(expr, envir, enclos) : attempt to apply non-function I've tried quite a few variations. It seems I'm missing something, but if glm or gl1ce take a certain data format then the corresponding predict me,thods should (what am i missing). 1. try link=logit, as opposed to what you typed 2. xx is probably not a data frame (or yy~xx would unlikely work), so as.data.frame might do the trick. 3. what is gl1ce? I you're having trouble with an add-on package, you might have the courtesy to tell us which one. Not everyone uses lasso2 on a daily basis, you know. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] add a point to regression line and cook's distance
If you know that the line should pass through (0,0), would it make sense to do a regression without an intercept? You can do that by putting -1 in the formula, like: lm(y ~ x - 1). Hope this helps, Matt Matthew Wiener RY84-202 Applied Computer Science Mathematics Dept. Merck Research Labs 126 E. Lincoln Ave. Rahway, NJ 07065 732-594-5303 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves Sent: Wednesday, December 03, 2003 5:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [R] add a point to regression line and cook's distance What is the context? What do the outliers represent? If you think carefully about the context, you may find the answer. hope this helps. spencer graves p.s. I know statisticians who worked for HP before the split and who still work for either HP or Agilent, I'm not certain which. If you want to contact me off-line, I can give you a couple of names if that might help. [EMAIL PROTECTED] wrote: Hi, This is more a statistics question rather than R question. But I thought people on this list may have some pointers. MY question is like the following: I would like to have a robust regression line. The data I have are mostly clustered around a small range. So the regression line tend to be influenced strongly by outlier points (with large cook's distance). From the application 's background, I know that the line should pass (0,0), which is far away from the data cloud. I would like to add this point to have a more robust line. The question is: does it make sense to do this? what are the negative impacts if any? thanks, jonathan __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] add a point to regression line and cook's distance
Not a good idea, unless the regression function is *known* to be linear. More likely it is only approximately linear over small ranges. Murray Jorgensen Wiener, Matthew wrote: If you know that the line should pass through (0,0), would it make sense to do a regression without an intercept? You can do that by putting -1 in the formula, like: lm(y ~ x - 1). Hope this helps, Matt Matthew Wiener RY84-202 Applied Computer Science Mathematics Dept. Merck Research Labs 126 E. Lincoln Ave. Rahway, NJ 07065 732-594-5303 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves Sent: Wednesday, December 03, 2003 5:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [R] add a point to regression line and cook's distance What is the context? What do the outliers represent? If you think carefully about the context, you may find the answer. hope this helps. spencer graves p.s. I know statisticians who worked for HP before the split and who still work for either HP or Agilent, I'm not certain which. If you want to contact me off-line, I can give you a couple of names if that might help. [EMAIL PROTECTED] wrote: Hi, This is more a statistics question rather than R question. But I thought people on this list may have some pointers. MY question is like the following: I would like to have a robust regression line. The data I have are mostly clustered around a small range. So the regression line tend to be influenced strongly by outlier points (with large cook's distance). From the application 's background, I know that the line should pass (0,0), which is far away from the data cloud. I would like to add this point to have a more robust line. The question is: does it make sense to do this? what are the negative impacts if any? thanks, jonathan __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help -- Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: [EMAIL PROTECTED]Fax 7 838 4155 Phone +64 7 838 4773 wk+64 7 849 6486 homeMobile 021 1395 862 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
RE: [R] add a point to regression line and cook's distance
It is likely that the true relationship is nonlinear. There isn't a priori knowledge about linearity. In the small range where we do have enough data, the relationship looks linear. Outside the range, the data are very scarse and have high level of noises too. This is why adding (0,0) to the data can potentially improve the fit a great deal. But at the same time, I have never heard people doing it this way. Jonathan -Original Message- From: Murray Jorgensen [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 5:18 PM To: Wiener, Matthew Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [R] add a point to regression line and cook's distance Not a good idea, unless the regression function is *known* to be linear. More likely it is only approximately linear over small ranges. Murray Jorgensen Wiener, Matthew wrote: If you know that the line should pass through (0,0), would it make sense to do a regression without an intercept? You can do that by putting -1 in the formula, like: lm(y ~ x - 1). Hope this helps, Matt Matthew Wiener RY84-202 Applied Computer Science Mathematics Dept. Merck Research Labs 126 E. Lincoln Ave. Rahway, NJ 07065 732-594-5303 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Spencer Graves Sent: Wednesday, December 03, 2003 5:51 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [R] add a point to regression line and cook's distance What is the context? What do the outliers represent? If you think carefully about the context, you may find the answer. hope this helps. spencer graves p.s. I know statisticians who worked for HP before the split and who still work for either HP or Agilent, I'm not certain which. If you want to contact me off-line, I can give you a couple of names if that might help. [EMAIL PROTECTED] wrote: Hi, This is more a statistics question rather than R question. But I thought people on this list may have some pointers. MY question is like the following: I would like to have a robust regression line. The data I have are mostly clustered around a small range. So the regression line tend to be influenced strongly by outlier points (with large cook's distance). From the application 's background, I know that the line should pass (0,0), which is far away from the data cloud. I would like to add this point to have a more robust line. The question is: does it make sense to do this? what are the negative impacts if any? thanks, jonathan __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help -- Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html Department of Statistics, University of Waikato, Hamilton, New Zealand Email: [EMAIL PROTECTED]Fax 7 838 4155 Phone +64 7 838 4773 wk+64 7 849 6486 homeMobile 021 1395 862 __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
[R] RE: R performance questions
Hi-- While I agree that we cannot agree on the ideal algorithms, we should be taking practical steps to implement microarrays in the clinic. I think we can all agree that our algorithms have some degree of efficacy over and above conventional diagnostic techniques. If patients are dying from lack of diagnostic accuracy, I think we have to work hard to use this technology to help them, if we can. I think we can, even now. What if I offer, in my clinic, a service for cancer patients to compare their affy data to an existing set of data, to predict their prognosis or response to chemotherapy? I think people will line up out the door for such a service. Knowing what we as a group of array analyzers know, wouldn't we all want this kind of service available if we or a loved one got cancer? Can our programs deal with 1,000 .cel files? 10,000 files? I think our programs are pretty good, but what we need is DATA. We must be careful what we wish for--we might get it! So how do we measure whether analyzing 10,000 .cel files with library(affy) is feasible? I'm assuming that advanced hardware would be required for such a task. What are the critical components of such a platform? How much money would a feasible system for array analysis cost? I was just looking ahead two or three years--where is all this genomic array research headed? I guess I'm concerned about scalability. Is anyone really working on implementing affy on a cluster/Beowulf? That sounds like a real challenge. Regards, Michael Benjamin, MD -Original Message- From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 9:47 PM To: 'Michael Benjamin' Subject: RE: [BioC] R performance questions Another point about benchmarking: As has been discussed on R-help before, benchmarks can be misleading, as the one you mentioned. It measures linear algebra tasks, etc., but that typically account for very small portion of average tasks. Doug Bates also pointed out that the eigen() example used in that benchmark is computing mostly meaningless results. In our experience, learning to use R more efficiently gives us the most mileage, but large and fast hardware wouldn't hurt... Cheers, Andy -Original Message- From: Michael Benjamin [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 7:32 PM To: 'Liaw, Andy' Subject: RE: [BioC] R performance questions Thanks. Mike -Original Message- From: Liaw, Andy [mailto:[EMAIL PROTECTED] Sent: Wednesday, December 03, 2003 8:17 AM To: 'Michael Benjamin' Subject: RE: [BioC] R performance questions Hi Michael, Just one comment about SVM. If you use the svm() function in the e1071 package to train linear SVM, it will be rather slow. That's a known limitation of libsvm, of which the svm() function uses. If you are willing to go outside of R, the bsvm package by C.J. Lin (same person who wrote libsvm) will train linear svm in much more efficient manner. HTH, Andy -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Michael Benjamin Sent: Tuesday, December 02, 2003 10:30 PM To: [EMAIL PROTECTED] Subject: [BioC] R performance questions Hi, all-- I wanted to start a thread on R speed/benchmarking. There is a nice R benchmarking overview at http://www.sciviews.org/other/benchmark.htm, along with a free script so you can see how your machine stacks up. Looks like R is substantially faster than S-plus. My problem is this: with 512Mb and an overclocked AMD Athlon XP 1800+, running at 588 SPEC-FP 2000, it still takes FOREVER to analyze multiple .cel files using affy (expresso). Running svm takes a mighty long time with more than 500 genes, 150 samples. Questions: 1) Would adding RAM or processing speed improve performance the most? 2) Is it possible to run R on a cluster without rewriting my high-level code? In other words, 3) What are we going to do when we start collecting terabytes of array data to analyze? There will come a breaking point at which desktop systems can't perform these analyses fast enough for large quantities of data. What then? Michael Benjamin, MD Winship Cancer Institute Emory University, Atlanta, GA ___ Bioconductor mailing list [EMAIL PROTECTED] https://www.stat.math.ethz.ch/mailman/listinfo/bioconductor __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] RE: R performance questions
Michael Benjamin [EMAIL PROTECTED] writes: I was just looking ahead two or three years--where is all this genomic array research headed? I guess I'm concerned about scalability. Me too -- but at least in the near future, data will be growing more than the capacity to process it. Is anyone really working on implementing affy on a cluster/Beowulf? That sounds like a real challenge. Yes and no. Depends on which components you want to deal with, and how you want to work with the data. Everything (with respect to speed/capacity/etc) is especially contextual -- applications and approximations will be quite important. best, -tony -- [EMAIL PROTECTED]http://www.analytics.washington.edu/ Biomedical and Health Informatics University of Washington Biostatistics, SCHARP/HVTN Fred Hutchinson Cancer Research Center UW (Tu/Th/F): 206-616-7630 FAX=206-543-3461 | Voicemail is unreliable FHCRC (M/W): 206-667-7025 FAX=206-667-4812 | use Email CONFIDENTIALITY NOTICE: This e-mail message and any attachme...{{dropped}} __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help
Re: [R] add a point to regression line and cook's distance
[EMAIL PROTECTED] wrote: Hi, MY question is like the following: I would like to have a robust regression line. The data I have are mostly clustered around a small range. So the regression line tend to be influenced strongly by outlier points (with large cook's distance). From the application's background, I know that the line should pass (0,0), which is far away from the data cloud. I would like to add this point to have a more robust line. The question is: does it make sense to do this? what are the negative impacts if any? Have you tried a more robust fit (ltsreg() in the package lqs springs to mind)? Using this, without forcing the intercept to zero, might give you some idea if your idea makes sense. Venables and Ripley (Modern Applied Statistics with S, Springer-Verlag, 2002) give a good introduction to robust linear models, and how to estimate their error distribution. Julian Faraway also gives an overview of the same, in his Practical Regression and ANOVA using R. http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf Hope that helps Jason -- Indigo Industrial Controls Ltd. http://www.indigoindustrial.co.nz 64-21-343-545 [EMAIL PROTECTED] __ [EMAIL PROTECTED] mailing list https://www.stat.math.ethz.ch/mailman/listinfo/r-help