Re: [R] cointegration analysis
i got this error, i dont remember what was the cause, but what i did work around was that see the example in the manual pages of the ca.po... etc and try to make your date in the same format. also just see whether the functions will take so many columns as a parameter. I have not checked it. Lastly see whether the data what you are using is not having any missing values or number in 'text' format HTH Dorina LAZAR [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 08/08/2007 10:15 PM To r-help@stat.math.ethz.ch cc Subject [R] cointegration analysis Hello, I tried to use urca package (R) for cointegration analysis. The data matrix to be investigated for cointegration contains 8 columns (variables). Both procedures, Phillips Ouliaris test and Johansen's procedures give errors (error in evaluating the argument 'object' in selecting a method for function 'summary' respectiv too many variables, critical values cannot be computedâ). What can I do? With regards, Dorina LAZAR Department of Statistics, Forecasting, Mathematics Babes Bolyai University, Faculty of Economic Science Teodor Mihali 58-60, 400591 Cluj-Napoca, Romania __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER AND CONFIDENTIALITY CAUTION: This message and any attachments with it (the message) are confidential and intended solely for the addressees. Unauthorized reading, copying, dissemination, distribution or disclosure either whole or partial, is prohibited. If you receive this message in error, please delete it and immediately notify the sender. Communicating through email is not secure and capable of interception, corruption and delays. Anyone communicating with The Clearing Corporation of India Limited (CCIL) by email accepts the risks involved and their consequences. The internet can not guarantee the integrity of this message. CCIL shall (will) not therefore be liable for the message if modified. The recipient should check this email and any attachments for the presence of viruses. CCIL accepts no liability for any damage caused by any virus transmitted by this email. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cointegration analysis
regrets typo error - please read 'date' as 'data' --- Regards, Gaurav Yadav (mobile: +919821286118) Assistant Manager, CCIL, Mumbai (India) mailto:[EMAIL PROTECTED] mailto:[EMAIL PROTECTED] Profile: http://www.linkedin.com/in/gydec25 Keep in touch and keep mailing :-) slow or fast, little or too much [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 08/09/2007 11:49 AM To Dorina LAZAR [EMAIL PROTECTED] cc r-help@stat.math.ethz.ch, [EMAIL PROTECTED] Subject Re: [R] cointegration analysis i got this error, i dont remember what was the cause, but what i did work around was that see the example in the manual pages of the ca.po... etc and try to make your date in the same format. also just see whether the functions will take so many columns as a parameter. I have not checked it. Lastly see whether the data what you are using is not having any missing values or number in 'text' format HTH Dorina LAZAR [EMAIL PROTECTED] Sent by: [EMAIL PROTECTED] 08/08/2007 10:15 PM To r-help@stat.math.ethz.ch cc Subject [R] cointegration analysis Hello, I tried to use urca package (R) for cointegration analysis. The data matrix to be investigated for cointegration contains 8 columns (variables). Both procedures, Phillips Ouliaris test and Johansen's procedures give errors (error in evaluating the argument 'object' in selecting a method for function 'summary' respectiv too many variables, critical values cannot be computedââ¬Â). What can I do? With regards, Dorina LAZAR Department of Statistics, Forecasting, Mathematics Babes Bolyai University, Faculty of Economic Science Teodor Mihali 58-60, 400591 Cluj-Napoca, Romania __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER AND CONFIDENTIALITY CAUTION: This message and any attachments with it (the message) are confidential and intended solely for the addressees. Unauthorized reading, copying, dissemination, distribution or disclosure either whole or partial, is prohibited. If you receive this message in error, please delete it and immediately notify the sender. Communicating through email is not secure and capable of interception, corruption and delays. Anyone communicating with The Clearing Corporation of India Limited (CCIL) by email accepts the risks involved and their consequences. The internet can not guarantee the integrity of this message. CCIL shall (will) not therefore be liable for the message if modified. The recipient should check this email and any attachments for the presence of viruses. CCIL accepts no liability for any damage caused by any virus transmitted by this email. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. DISCLAIMER AND CONFIDENTIALITY CAUTION: This message and any attachments with it (the message) are confidential and intended solely for the addressees. Unauthorized reading, copying, dissemination, distribution or disclosure either whole or partial, is prohibited. If you receive this message in error, please delete it and immediately notify the sender. Communicating through email is not secure and capable of interception, corruption and delays. Anyone communicating with The Clearing Corporation of India Limited (CCIL) by email accepts the risks involved and their consequences. The internet can not guarantee the integrity of this message. CCIL shall (will) not therefore be liable for the message if modified. The recipient should check this email and any attachments for the presence of viruses. CCIL accepts no liability for any damage caused by any virus transmitted by this email. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading time/date string
Thanks Mark, that was very helpful. I'm now so close! Can anyone tell me how to extract the value from an instance of a difftime class? I can see the value, but how can I place it in a dataframe? time_string1 - 10:17:07 02 Aug 2007 time_string2 - 13:17:40 02 Aug 2007 time1 - strptime(time_string1, format=%H:%M:%S %d %b %Y) time2 - strptime(time_string2, format=%H:%M:%S %d %b %Y) time_delta - difftime(time2,time1, unit=sec) time_delta Time difference of 10833 secs # --- I'd like this value just here! data.frame(time1, time2, time_delta) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class difftime into a data.frame Thanks again, Matthew Mark W Kimpel wrote: Look at some of these functions... DateTimeClasses(base) Date-Time Classes as.POSIXct(base)Date-time Conversion Functions cut.POSIXt(base)Convert a Date or Date-Time Object to a Factor format.Date(base) Date Conversion Functions to and from Character Mark --- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 663-0513 Home (no voice mail please) ** Matthew Walker wrote: Hello everyone, Can anyone tell me what function I should use to read time/date strings and turn them into a form such that I can easily calculate the difference of two? The strings I've got look like 10:17:07 02 Aug 2007. If I could calculate the number of seconds between them I'd be very happy! Cheers, Matthew __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. . __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reading time/date string
On Thu, 9 Aug 2007, Matthew Walker wrote: Thanks Mark, that was very helpful. I'm now so close! Can anyone tell me how to extract the value from an instance of a difftime class? I can see the value, but how can I place it in a dataframe? as.numeric(time_delta) Hint: you want the number, not the value (which is a classed object). time_string1 - 10:17:07 02 Aug 2007 time_string2 - 13:17:40 02 Aug 2007 time1 - strptime(time_string1, format=%H:%M:%S %d %b %Y) time2 - strptime(time_string2, format=%H:%M:%S %d %b %Y) time_delta - difftime(time2,time1, unit=sec) time_delta Time difference of 10833 secs # --- I'd like this value just here! data.frame(time1, time2, time_delta) Error in as.data.frame.default(x[[i]], optional = TRUE) : cannot coerce class difftime into a data.frame Thanks again, Matthew Mark W Kimpel wrote: Look at some of these functions... DateTimeClasses(base) Date-Time Classes as.POSIXct(base)Date-time Conversion Functions cut.POSIXt(base)Convert a Date or Date-Time Object to a Factor format.Date(base) Date Conversion Functions to and from Character Mark --- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 663-0513 Home (no voice mail please) ** Matthew Walker wrote: Hello everyone, Can anyone tell me what function I should use to read time/date strings and turn them into a form such that I can easily calculate the difference of two? The strings I've got look like 10:17:07 02 Aug 2007. If I could calculate the number of seconds between them I'd be very happy! Cheers, Matthew __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. . __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] cointegration analysis
Hello Dorina, if you apply ca.jo to a system with more than five variables, a *warning* is issued that no critical values are provided. This is not an error, but documented in ?ca.jo. In the seminal paper of Johansen, only cv for up to five variables are provided. Hence, you need to refer to a different source of cv in order to determine the cointegration rank. Best, Bernhard Hello, I tried to use urca package (R) for cointegration analysis. The data matrix to be investigated for cointegration contains 8 columns (variables). Both procedures, Phillips Ouliaris test and Johansen's procedures give errors (error in evaluating the argument 'object' in selecting a method for function 'summary' respectiv too many variables, critical values cannot be computed). What can I do? With regards, Dorina LAZAR Department of Statistics, Forecasting, Mathematics Babes Bolyai University, Faculty of Economic Science Teodor Mihali 58-60, 400591 Cluj-Napoca, Romania __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. * Confidentiality Note: The information contained in this mess...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R memory usage
See ?gc ?Memory-limits On Wed, 8 Aug 2007, Jun Ding wrote: Hi All, I have two questions in terms of the memory usage in R (sorry if the questions are naive, I am not familiar with this at all). 1) I am running R in a linux cluster. By reading the R helps, it seems there are no default upper limits for vsize or nsize. Is this right? Is there an upper limit for whole memory usage? How can I know the default in my specific linux environment? And can I increase the default? See ?Memory-limits, but that is principally a Linux question. 2) I use R to read in several big files (~200Mb each), and then I run: gc() I get: used (Mb) gc trigger (Mb) max used Ncells 23083130 616.4 51411332 1372.9 51411332 Vcells 106644603 813.7 240815267 1837.3 227550003 (Mb) 1372.9 1736.1 What do columns of used, gc trigger and max used mean? It seems to me I have used 616Mb of Ncells and 813.7Mb of Vcells. Comparing with the numbers of max used, I still should have enough memory. But when I try object.size(area.results) ## area.results is a big data.frame I get an error message: Error: cannot allocate vector of size 32768 Kb Why is that? Looks like I am running out of memory. Is there a way to solve this problem? Thank you very much! Jun -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tcltk error on Linux
On Thu, 9 Aug 2007, Mark W Kimpel wrote: I am having trouble getting tcltk package to load on openSuse 10.2 running R-devel. I have specifically put my /usr/share/tcl directory in my PATH, but R doesn't seem to see it. I also have installed tk on my system. Any ideas on what the problem is? Whether Tcl/Tk would available was determined when you installed R. The relevant information was in the configure output and log, which we don't have. You are not running a released version of R: please don't use the development version unless you are familiar with the build process and know how to debug such things yourself. The rule is that questions about development versions of R should not be asked here but on R-devel (and not to R-core which I have deleted from the recipients). I suggest reinstalling R (preferably R-patched) and if tcltk still is not available sending the relevant configure information to the R-devel list. Also, note that I have some warning messages on starting up R, not sure what they mean or if they are pertinent. Those are coming from a Bioconductor package: again you must be using development versions with R-devel and those are not stable (last time I looked even Biobase would not install, and the packages change daily). If you have all those packages in your startup, please don't -- there will be a considerable performance hit so only load them when you need them. Thanks, Mark Warning messages: 1: In .updateMethodsInTable(fdef, where, attach) : Couldn't find methods table for conditional, package Category may be out of date 2: In .updateMethodsInTable(fdef, where, attach) : Methods list for generic conditional not found require(tcltk) Loading required package: tcltk Error in firstlib(which.lib.loc, package) : Tcl/Tk support is not available on this system sessionInfo() R version 2.6.0 Under development (unstable) (2007-08-01 r42387) i686-pc-linux-gnu locale: LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] splines tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] affycoretools_1.9.3annaffy_1.9.1 xtable_1.5-0 [4] gcrma_2.9.1matchprobes_1.9.10 biomaRt_1.11.4 [7] RCurl_0.8-1XML_1.9-0 GOstats_2.3.8 [10] Category_2.3.19genefilter_1.15.9 survival_2.32 [13] KEGG_1.17.0RBGL_1.13.3annotate_1.15.3 [16] AnnotationDbi_0.0.88 RSQLite_0.6-0 DBI_0.2-3 [19] GO_1.17.0 limma_2.11.9 affy_1.15.7 [22] preprocessCore_0.99.12 affyio_1.5.6 Biobase_1.15.23 [25] graph_1.15.10 loaded via a namespace (and not attached): [1] cluster_1.11.7 rcompgen_0.1-15 -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Odp: Successively eliminating most frequent elemets
Hi your construction is quite complicated so instead of refining it I tried to do such task different way. If I understand what you want to do you can use set.seed(1) T - matrix(trunc(runif(20)*10), nrow=10, ncol=2) T [,1] [,2] [1,]22 [2,]31 [3,]56 [4,]93 [5,]27 [6,]84 [7,]97 [8,]69 [9,]63 [10,]07 m-table(T) # matrix is vector with dimensions todel-rowSums(T==as.numeric(names(which.max(m0 # check which element of matrix is the first frequent T[todel,] [,1] [,2] [1,]22 [2,]27 T[!todel,] [,1] [,2] [1,]31 [2,]56 [3,]93 [4,]84 [5,]97 [6,]69 [7,]63 [8,]07 You can put all of these to cycle but you have to decide when to end the cycle. Regards Petr [EMAIL PROTECTED] [EMAIL PROTECTED] napsal dne 08.08.2007 15:33:58: Dear experts, I have a 10x2 matrix T containing random integers. I would like to delete pairs (rows) iteratively, which contain the most frequent element either in the first or second column: T - matrix(trunc(runif(20)*10), nrow=10, ncol=2) G - matrix(0, nrow=6, ncol=2) for (i in (1:6)){ print(** Start iteration ~i~ ***) print(Current matrix:) print(T) m - append(T[,1], T[,2]) print(Concatenated columns:) print(m) # build frequency table F - data.matrix(as.data.frame(table(m))) dimnames(F)-NULL # pick up the most frequent element: sort decreasing and take is from the top F - F[order(F[,2], decreasing=TRUE),] print(Freq. table:) print(F[1:5,]) todel - F[1,1] #rows containing the most frequent element will be deleted G[i,1] - todel G[i,2] - F[1,2] print(todel=~todel) # eliminate rows containing the most frequent element # either the first or the second column contains this element id - which(T[,1]==todel) print(Indexes of rows to be deleted:) print(id) if (length(id)0){ T - T[-1*id, ] } id - which(T[,2]==todel) print(Indexes of rows to be deleted:) print(id) if (length(id)0){ T - T[-1*id, ] } print(nrow(T)=~nrow(T)) } print(Result matrix:) print(G) The output of the first two iterations looks like as follows. As one can see, the frequency table in the second iteration still contains the element deleted in the first iteration! Is this a bug or what am I doing here wrong? Any help greatly appreciated! [1] ** Start iteration 1 *** [1] Current matrix: [,1] [,2] [1,]22 [2,]67 [3,]99 [4,]35 [5,]40 [6,]79 [7,]57 [8,]17 [9,]96 [10,]33 [1] Concatenated columns: [1] 2 6 9 3 4 7 5 1 9 3 2 7 9 5 0 9 7 7 6 3 [1] Freq. table: [,1] [,2] [1,]84 [2,]94 [3,]43 [4,]32 [5,]62 [1] todel=8 [1] Indexes of rows to be deleted: integer(0) [1] Indexes of rows to be deleted: integer(0) [1] nrow(T)=10 [1] ** Start iteration 2 *** [1] Current matrix: [,1] [,2] [1,]22 [2,]67 [3,]99 [4,]35 [5,]40 [6,]79 [7,]57 [8,]17 [9,]96 [10,]33 [1] Concatenated columns: [1] 2 6 9 3 4 7 5 1 9 3 2 7 9 5 0 9 7 7 6 3 [1] Freq. table: [,1] [,2] [1,]84 [2,]94 [3,]43 [4,]32 [5,]62 [1] todel=8 [1] Indexes of rows to be deleted: integer(0) [1] Indexes of rows to be deleted: integer(0) [1] nrow(T)=10 [1] ** Start iteration 3 *** [1] Current matrix: ... __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] using data() and determining data types
Hi R Gurus: I'm using the data() function to get the list of data sets for a package. I would like to find the class for each data set; i.e.,data.frame, etc. Using str(), I can find the name of the data set. However, when I try the class function on the str output, I get character, since the name in the str is a character. I've also tried this with just plain results column. Still no luck. Any help would be much appreciated. Sincerely, Edna Bell __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] data() problem solved
Problem solved: sapply(data(package=car)$results[,3], function(x)class(get(x))) sorry for the silliness Hi R Gurus: I'm using the data() function to get the list of data sets for a package. I would like to find the class for each data set; i.e.,data.frame, etc. Using str(), I can find the name of the data set. However, when I try the class function on the str output, I get character, since the name in the str is a character. I've also tried this with just plain results column. Still no luck. Any help would be much appreciated. Sincerely, Edna Bell Hi R Gurus: I'm using the data() function to get the list of data sets for a package. I would like to find the class for each data set; i.e.,data.frame, etc. Using str(), I can find the name of the data set. However, when I try the class function on the str output, I get character, since the name in the str is a character. I've also tried this with just plain results column. Still no luck. Any help would be much appreciated. Sincerely, Edna Bell __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Change in R**2 for block entry regression
David Kaplan wrote: Hi all, I'm demonstrating a block entry regression using R for my regression class. For each block, I get the R**2 and the associated F. I do this with separate regressions adding the next block in and then get the results by writing separate summary() statements for each regression. Is there a more convenient way to do this and also to get the change in R**2 and associated F test for the change? Thanks in advance. David I'm not sure this is the best approach, but you might start with the data frame returned by applying anova() to several models and extend that to include the squared multiple correlation and increments: mod.0 - lm(breaks ~ 1, data = warpbreaks) mod.1 - lm(breaks ~ 1 + wool, data = warpbreaks) mod.2 - lm(breaks ~ 1 + wool + tension, data = warpbreaks) mod.3 - lm(breaks ~ 1 + wool * tension, data = warpbreaks) BlockRegSum - anova(mod.0, mod.1, mod.2, mod.3) BlockRegSum$R2 - 1 - (BlockRegSum$RSS / BlockRegSum$RSS[1]) BlockRegSum$IncR2 - c(NA, diff(BlockRegSum$R2)) BlockRegSum$R2[1] - NA BlockRegSum Analysis of Variance Table Model 1: breaks ~ 1 Model 2: breaks ~ 1 + wool Model 3: breaks ~ 1 + wool + tension Model 4: breaks ~ 1 + wool * tension Res.DfRSS Df Sum of Sq FPr(F)R2 IncR2 1 53 9232.8 2 52 8782.1 1 450.7 3.7653 0.1 0.0488114 0.0488114 3 50 6747.9 22034.3 8.4980 0.0006926 0.3 0.2 4 48 5745.1 21002.8 4.1891 0.0210442 0.4 0.1 BlockRegSum$R2 [1] NA 0.04881141 0.26914067 0.37775086 BlockRegSum$IncR2 [1] NA 0.04881141 0.22032926 0.10861019 summary(mod.1)$r.squared [1] 0.04881141 summary(mod.2)$r.squared [1] 0.2691407 summary(mod.3)$r.squared [1] 0.3777509 -- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 512-0171 (M, W, F) fax: (917) 438-0894 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] substrings
Hello again! I have a set of character results. If one of the characters is a blank space, followed by other characters, I want to end at the blank space. I tried strsplit, but it picks up again after the blank. Any help would be much appreciated. TIA, Edna __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Countvariable for id by date
Best R-users, Heres a newbie question. I have tried to find an answer to this via help and the ave(x,factor(),FUN=function(y) rank (z,tie=first)-function, but without success. I have a dataframe (~8000 observations, registerdata) with four columns: id, dg1, dg2 and date(-MM-DD) of interest: id;dg1;dg2;date; 1;F28;;1997-11-04; 1;F20;F702;1998-11-09; 1;F20;;1997-12-03; 1;F208;;2001-03-18; 2;F32;;1999-03-07; 2;F29;F32;2000-01-06; 2;F32;;2003-07-05; 2;F323;F2800;2000-02-05; ... I would like o have two additional columns: 1. countF20: a countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F20* OR dg2 = F20*, where * means F201,F202... F2001,F2002...F20001,F20002... 2. countF2129: another countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F21*-F29* OR dg2 = F21*-F29*, where F21*-F29* means F21*, F22*...F29* and where * means F211,F212... F2101,F2102...F21001,F21002... ... so the dataframe would look like this, where 1 is the first observation for the id with the right condition, 2 is the second etc.: id;dg1;dg2;date;countF20;countF2129; 1;F28;;1997-11-04;;1; 1;F20;F702;1998-11-09;2;; 1;F20;;1997-12-03;1;; 1;F208;;2001-03-18;3;; 2;F32;;1999-03-07;;; 2;F29;F32;2000-01-06;;1; 2;F32;;2003-07-05;;; 2;F323;F2800;2000-02-05;;2; ... Do you know a convenient way to create these kind of countvariables? Thank you in advance! / David (david.gyllenberg at yahoo.com - Park yourself in front of a world of choices in alternative vehicles. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] substrings
Is this what you want? a-c(a b c,1 2 3,q - 5) a [1] a b c 1 2 3 q - 5 sapply(strsplit(a,[[:blank:]]),function(x)x[1]) [1] a 1 q Edna Bell wrote: I have a set of character results. If one of the characters is a blank space, followed by other characters, I want to end at the blank space. I tried strsplit, but it picks up again after the blank. Any help would be much appreciated. -- View this message in context: http://www.nabble.com/substrings-tf4241506.html#a12069209 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] substrings
one more, shorter, solution. a [1] a b c 1 2 3 q- 5 gsub(\\s.+,,a) [1] a 1 q- Edna Bell wrote: I have a set of character results. If one of the characters is a blank space, followed by other characters, I want to end at the blank space. I tried strsplit, but it picks up again after the blank. Any help would be much appreciated. -- View this message in context: http://www.nabble.com/substrings-tf4241506.html#a12069488 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help on R performance using aov function
Hi, Im trying to replace some SAS statistical functions by R (batch calling). But Ive seen that calling R in a batch mode (under Unix) takes about 2or 3 times more than SAS software. So its a great problem of performance for me. Here is an extract of the calculation: stoutput-file(res_oneWayAnova.dat,w); cat(Param|F|Prob,file=stoutput,\n); for (i in 1:n) { p-list_param[[i]] aov_-aov(A[,p]~ A[,wafer],data=A); anova_-summary(aov_); if (!is.na(anova_[[1]][1,5]) anova_[[1]][1,5]=0.0001) res_aov-cbind(p,anova_[[1]][1,4],0.0001) else res_aov-cbind(p,anova_[[1]][1,4],anova_[[1]][1,5]); cat(res_aov, file=stoutput, append = TRUE,sep = |,\n); }; close(stoutput); A is a data.frame of about (400 lines and 1800 parameters). Im a new user of R and I dont know if its a problem in my code or if there are some tips that I can use to optimise my treatment. Thanks a lot for your help. Françoise Pfiffelmann Engineering Data Analysis Group -- Crolles2 Alliance 860 rue Jean Monnet 38920 Crolles, France Tel: +33 438 92 29 84 Email: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help on R performance using aov function
aov() will handle multiple responses and that would be considerably more efficient than running separate fits as you seem to be doing. Your code is nigh unreadable: please use your spacebar and remove the redundant semicolons: `Writing R Extensions' shows you how to tidy up your code to make it presentable. But I think anova_[[1]] is really coef(summary(aov_)) which is a lot more intelligible. On Thu, 9 Aug 2007, Francoise PFIFFELMANN wrote: Hi, Im trying to replace some SAS statistical functions by R (batch calling). But Ive seen that calling R in a batch mode (under Unix) takes about 2or 3 times more than SAS software. So its a great problem of performance for me. Here is an extract of the calculation: stoutput-file(res_oneWayAnova.dat,w); cat(Param|F|Prob,file=stoutput,\n); for (i in 1:n) { p-list_param[[i]] aov_-aov(A[,p]~ A[,wafer],data=A); anova_-summary(aov_); if (!is.na(anova_[[1]][1,5]) anova_[[1]][1,5]=0.0001) res_aov-cbind(p,anova_[[1]][1,4],0.0001) else res_aov-cbind(p,anova_[[1]][1,4],anova_[[1]][1,5]); cat(res_aov, file=stoutput, append = TRUE,sep = |,\n); }; close(stoutput); A is a data.frame of about (400 lines and 1800 parameters). Im a new user of R and I dont know if its a problem in my code or if there are some tips that I can use to optimise my treatment. Thanks a lot for your help. Françoise Pfiffelmann Engineering Data Analysis Group -- Crolles2 Alliance 860 rue Jean Monnet 38920 Crolles, France Tel: +33 438 92 29 84 Email: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Discriminant scores plot
Hello, How can I plot the discriminant scores resulting from prediction using a dataset and n lda model and its decission boundaries using GGobi and rggobi? Best regards, Dani Daniel Valverde Saubí Grup d'Aplicacions Biomèdiques de la Ressonància Magnètica Nuclear (GABRMN) Departament de Bioquímica i Biologia Molecular Edifici C, Facultat de Biociències, Campus Universitat Autònoma de Barcelona 08193 Cerdanyola del Vallès, Spain Tlf. (0034) 935814126 [EMAIL PROTECTED] [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] GLMM: MEEM error due to dichotomous variables
I am trying to run a GLMM on some binomial data. My fixed factors include 2 dichotomous variables, day, and distance. When I run the model: modelA-glmmPQL(Leaving~Trial*Day*Dist,random=~1|Indiv,family=binomial) I get the error: iteration 1 Error in MEEM(object, conLin, control$niterEM) : Singularity in backsolve at level 0, block 1 From looking at previous help topics,( http://tolstoy.newcastle.edu.au/R/help/02a/4473.html) I gather this is because of the dichotomous predictor variables - what approach should I take to avoid this problem? Thanks, Elva. _ Got a favourite clothes shop, bar or restaurant? Share your local knowledge __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Subject: Re: how to include bar values in a barplot?
Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. Sizes, proportions, colours etc. of shapes (bars in a histogram, the marks representing points in a scatterplot, ... ) are interpreted, so to speak, intuitively -- the resulting perception is formed by processes which are hard to ascertain consciously, and the overall effect can only be ascertained by looking at it, and noting what impression one has formed. They stimulate mental responses in the domain of perception of spatial relationships. Numbers, and text, on the other hand, while still shapes from the optical point of view, up to the point of their impact on the retina, provoke different perceptions. They are interpreted analytically stimulating mental responses in the domains of language and number. There is no Law whatever which requires that the two must be separated. It may be that adding any annotation to a graph or diagram will interfere with the intuitive imterpretation that the diagram is intended to stimulate, with no associated benefit. It may be that presenting numerical/textual information within a graphical/diagrammatic context will interfere with the analytic interpretation wich is desired, with no associated benefit. In such cases, it is clearly (and as a matter of fact to be decided in each case) better to separate the two apsects. It may, however, be that both can be combined in such a way that each enhances the other; and also the simultaneous perception of both aspects induces a cartesian-product richness of interpretation where each element of the graphical presentation combines with each element of the textual/numerical presentation to generate a perception which could not possibly have been realised if they had been presented separately. This, too, is a matter to be decided in each case. On that basis, if a graph without numbers fails to stimulate a desired impression which could have been stimulated by adding the numbers to the graph, then the graph without numbers is a failure. 2) it converts the graph into a poorly laid out table (with a colorful and distracting background) In general it is better to find an appropriate graph that does convey the information that is intended or if a table is more appropriate, then replace it with a well laid out table (or both). There is an implication here that the information conveyed by a graph, and the information conveyed by a table, are mutually exclusive. And that it then follows: Thou Shalt Not Allow The One To Corrupt The Other. While this has the appearance of a Law, it is (for reasons I have sketched above) a Law which is not *generally* applicable. Remember that the role of tables is to look up specific values and the role of graphs is to give a good overview. I would agree with this only to the following extent: Tables allow *only* the look-up of values. Graphs (modulo the capacity of the eye/brain to more or less precisely judge relative magnitudes) only allow a good overview. I would not agree that these are their exclusive roles. The role of Hamlet is to agonise over revenge for his father's death. The role of Ophelia is to embody the love interest in the play. This does not imply that there should be parallel performances of Hamlet on two different stages, with the audience trooping from one to the other according to which character is currently at the centre of the action. It actually works better when they're all up there at once, interacting! The books by William Cleveland and Tufte have a lot of good advice on these issues. Since you mention Tufte, I commend the admiring discussion in his book The Visual Display of Quantitative Information, Chapter 1 (Graphical Excellence), section Narrative Graphics of Space and Time (pp. 40-41 in the edition which I have) of Minard's graphical representation of what happened to Napoleon's army in the course of its advance on, and retreat from, Moscow. An impression of the original can be formed from the rather small version displayed on Tufte's website at the top of http://www.edwardtufte.com/tufte/posters The version in the book
[R] Term Structure Estimation using Kalman Filter
Long time reader, first time poster, I'm working on a paper regarding a term structure estimation using the Kalman Filter Algorithm. The model in question is the Generalized Vasicek, and since there are coupon-bonds being estimated, I'm supposed to make some changes on the Kalman Filter. Does anyone has already used R for these purposes? Any tips? Does anyone has a Kalman Filter code I could use as a starting point for an Extended Kalman Filter Approach? Thanks a lot for the patience and time, Bernardo Ribeiro [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Countvariable for id by date
This should do what you want: x - read.table(textConnection(id;dg1;dg2;date; + 1;F28;;1997-11-04; + 1;F20;F702;1998-11-09; + 1;F20;;1997-12-03; + 1;F208;;2001-03-18; + 2;F32;;1999-03-07; + 2;F29;F32;2000-01-06; + 2;F32;;2003-07-05; + 2;F323;F2800;2000-02-05;), header=TRUE, sep=;, as.is=TRUE) # convert dates x$dateP - unclass(as.POSIXct(x$date)) # matches for F20 F20 - grep(F20, paste(x$dg1, x$dg2)) # matches for F21 - F29 F21 - grep(F2[1-9], paste(x$dg1, x$dg2)) # grouping x$F20 - x$F21 - NA x$F20[F20] - rank(x$dateP[F20]) x$F21[F21] - rank(x$dateP[F21]) x id dg1 dg2 date X dateP F21 F20 1 1 F28 1997-11-04 NA 878601600 1 NA 2 1 F20 F702 1998-11-09 NA 910569600 NA 2 3 1 F20 1997-12-03 NA 881107200 NA 1 4 1 F208 2001-03-18 NA 984873600 NA 3 5 2 F32 1999-03-07 NA 920764800 NA NA 6 2 F29 F32 2000-01-06 NA 947116800 2 NA 7 2 F32 2003-07-05 NA 1057363200 NA NA 8 2 F323 F2800 2000-02-05 NA 949708800 3 NA On 8/9/07, David Gyllenberg [EMAIL PROTECTED] wrote: Best R-users, Here's a newbie question. I have tried to find an answer to this via help and the ave(x,factor(),FUN=function(y) rank (z,tie='first')-function, but without success. I have a dataframe (~8000 observations, registerdata) with four columns: id, dg1, dg2 and date(-MM-DD) of interest: id;dg1;dg2;date; 1;F28;;1997-11-04; 1;F20;F702;1998-11-09; 1;F20;;1997-12-03; 1;F208;;2001-03-18; 2;F32;;1999-03-07; 2;F29;F32;2000-01-06; 2;F32;;2003-07-05; 2;F323;F2800;2000-02-05; ... I would like o have two additional columns: 1. countF20: a countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F20* OR dg2 = F20*, where * means F201,F202... F2001,F2002...F20001,F20002... 2. countF2129: another countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F21*-F29* OR dg2 = F21*-F29*, where F21*-F29* means F21*, F22*...F29* and where * means F211,F212... F2101,F2102...F21001,F21002... ... so the dataframe would look like this, where 1 is the first observation for the id with the right condition, 2 is the second etc.: id;dg1;dg2;date;countF20;countF2129; 1;F28;;1997-11-04;;1; 1;F20;F702;1998-11-09;2;; 1;F20;;1997-12-03;1;; 1;F208;;2001-03-18;3;; 2;F32;;1999-03-07;;; 2;F29;F32;2000-01-06;;1; 2;F32;;2003-07-05;;; 2;F323;F2800;2000-02-05;;2; ... Do you know a convenient way to create these kind of countvariables? Thank you in advance! / David (david.gyllenberg at yahoo.com - Park yourself in front of a world of choices in alternative vehicles. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ARIMA fitting
Hello, Im trying to fit an ARIMA process, using STATS package, arima function. Can I expect, that fitted model with any parameters is stationary, causal and invertible? Thanks __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] interior branch test
Dear R users, Does anyone know which package provide interior branch test for phylogenetic tree with distance based method? Any helps are really appreciated. Thank you. Nora. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Lo and MacKinlay variance ratio test (Lo.Mac)
Hi all, I am trying to calculate the variance ratio of a time series under heteroskedasticity. So I know that the variance ratio should be calculated as a weighted average of autocorrelations. But I don't find the same results when I calculate the variance ratio manually and when I compute the M2 (M2 for heteroskedasticity) variance ratio using Lo.Mac function in R. Anybody knows what formula R is using to calculate the M2 statistics? -- View this message in context: http://www.nabble.com/Lo-and-MacKinlay-variance-ratio-test-%28Lo.Mac%29-tf4232129.html#a12040466 Sent from the R help mailing list archive at Nabble.com. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Countvariable for id by date
Try this: Lines - id;dg1;dg2;date; 1;F28;;1997-11-04; 1;F20;F702;1998-11-09; 1;F20;;1997-12-03; 1;F208;;2001-03-18; 2;F32;;1999-03-07; 2;F29;F32;2000-01-06; 2;F32;;2003-07-05; 2;F323;F2800;2000-02-05; # replace textConnection(Lines) with actual file name DF - read.csv2(textConnection(Lines), as.is = TRUE, colClasses = list(numeric, character, character, Date, NULL)) rk - function(x, pat) { z - regexpr(pat, x$dg1) 0 | regexpr(pat, x$dg2) 0 rank(ifelse(z, x$date, NA), na.last = keep) } DF$countF20 - unlist(by(DF, DF$id, rk, pat = ^F20)) DF$countF2129 - unlist(by(DF, DF$id, rk, pat = ^F2[1-9])) DF On 8/9/07, David Gyllenberg [EMAIL PROTECTED] wrote: Best R-users, Here's a newbie question. I have tried to find an answer to this via help and the ave(x,factor(),FUN=function(y) rank (z,tie='first')-function, but without success. I have a dataframe (~8000 observations, registerdata) with four columns: id, dg1, dg2 and date(-MM-DD) of interest: id;dg1;dg2;date; 1;F28;;1997-11-04; 1;F20;F702;1998-11-09; 1;F20;;1997-12-03; 1;F208;;2001-03-18; 2;F32;;1999-03-07; 2;F29;F32;2000-01-06; 2;F32;;2003-07-05; 2;F323;F2800;2000-02-05; ... I would like o have two additional columns: 1. countF20: a countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F20* OR dg2 = F20*, where * means F201,F202... F2001,F2002...F20001,F20002... 2. countF2129: another countvariable that shows which in order (by date) the id has if it fulfils the following logical expression: dg1 = F21*-F29* OR dg2 = F21*-F29*, where F21*-F29* means F21*, F22*...F29* and where * means F211,F212... F2101,F2102...F21001,F21002... ... so the dataframe would look like this, where 1 is the first observation for the id with the right condition, 2 is the second etc.: id;dg1;dg2;date;countF20;countF2129; 1;F28;;1997-11-04;;1; 1;F20;F702;1998-11-09;2;; 1;F20;;1997-12-03;1;; 1;F208;;2001-03-18;3;; 2;F32;;1999-03-07;;; 2;F29;F32;2000-01-06;;1; 2;F32;;2003-07-05;;; 2;F323;F2800;2000-02-05;;2; ... Do you know a convenient way to create these kind of countvariables? Thank you in advance! / David (david.gyllenberg at yahoo.com - Park yourself in front of a world of choices in alternative vehicles. [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rcmdr window border lost
OK, I tried completely removing and reinstalling R, but this has not worked - I am still missing window borders for Rcmdr. I am certain that everything is installed correctly and that all dependencies are met - there must be something trivial I am missing?! Thanks in advance, Andy Andy Weller wrote: Dear all, I have recently lost my Rcmdr window borders (all my other programs have borders)! I am unsure of what I have done, although I have recently update.packages() in R... How can I reclaim them? I am using: Ubuntu Linux (Feisty) R version 2.5.1 R Commander Version 1.3-5 I have deleted the folder: /usr/local/lib/R/site-library/Rcmdr and reinstalled Rcmdr with: install.packages(Rcmdr, dep=TRUE) This has not solved my problem though. Maybe I need to reinstall something else as well? Thanks in advance, Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help using gPath
Hi Emilio Gagliardi wrote: Hi everyone,I'm trying to figure out how to use gPath and the documentation is not very helpful :( I have the following plot object: plot-surrounds:: background plot.gTree.378:: background guide.gTree.355:: (background.rect.345, minor-horizontal.segments.347, minor-vertical.segments.349, major-horizontal.segments.351, major-vertical.segments.353) guide.gTree.356:: (background.rect.345, minor-horizontal.segments.347, minor-vertical.segments.349, major-horizontal.segments.351, major-vertical.segments.353) yaxis.gTree.338:: ticks.segments.321 labels.gTree.335:: (label.text.324, label.text.326, label.text.328, label.text.330, label.text.332, label.text.334) xaxis.gTree.339:: ticks.segments.309 labels.gTree.315:: (label.text.312, label.text.314) xaxis.gTree.340:: ticks.segments.309 labels.gTree.315:: (label.text.312, label.text.314) strip.gTree.364:: (background.rect.361, label.text.363) strip.gTree.370:: (background.rect.367, label.text.369) guide.rect.357 guide.rect.358 boxplots.gTree.283:: geom_boxplot.gTree.273:: (GRID.segments.267, GRID.segments.268, geom_bar.rect.270, geom_bar.rect.272) geom_boxplot.gTree.281:: (GRID.segments.275, GRID.segments.276, geom_bar.rect.278, geom_bar.rect.280) boxplots.gTree.301:: geom_boxplot.gTree.291:: (GRID.segments.285, GRID.segments.286, geom_bar.rect.288, geom_bar.rect.290) geom_boxplot.gTree.299:: (GRID.segments.293, GRID.segments.294, geom_bar.rect.296, geom_bar.rect.298) geom_jitter.points.303 geom_jitter.points.305 guide.rect.357 guide.rect.358 ylabel.text.382 xlabel.text.380 title It would be easier to help if we also had the code used to produce this plot, but in the meantime ... Could someone be so kind and create the proper call to grid.gedit() to access a couple of different aspects of this graph? I tried: grid.gedit(gPath(ylabel.text.382,labels), gp=gpar(fontsize=16)) # error That is looking for a grob called labels that is the child of a grob called ylabel.text.382. I can see a grob called ylabel.text.382, but it has no children. Try just ... grid.gedit(gPath(ylabel.text.382), gp=gpar(fontsize=16)) I'd like to change the margins on the label for the yaxis (not the tick marks) to put more space between the label and the tick marks. I'd also Margins may be tricky because it likely depends on a layout generated by ggplot; Hadley Wickham may have to help us out with a ggplot argument here ... (?) like to remove the left border on the first panel. I'd like to adjust the I'd guess you'd have to remove the grob background.rect.345 and then draw in just the sides you want, which would require getting to the right viewport, for which you'll need to study the viewport tree (see current.vpTree()) size of the font for the axis labels independently of the tick marks. I'd That's the one we've already done, right? like to change the color of the lines that make up the boxplots. Plus, I'd Something like ... grid.gedit(geom_bar.rect, gp=gpar(col=green)) ...? Again, it would really help to have some code to run. Paul like to change the margins of the strip labels. If you could show me a couple of examples I'm sure I cold get the rest working. Thanks so much, emilio [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dr Paul Murrell Department of Statistics The University of Auckland Private Bag 92019 Auckland New Zealand 64 9 3737599 x85392 [EMAIL PROTECTED] http://www.stat.auckland.ac.nz/~paul/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subject: Re: how to include bar values in a barplot?
[EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Regsubsets statistics
On Wed, 8 Aug 2007, Markus Brugger wrote: Dear R-help, I have used the regsubsets function from the leaps package to do subset selection of a logistic regression model with 6 independent variables and all possible ^2 interactions. As I want to get information about the statistics behind the selection output, I´ve intensively searched the mailing list to find answers to following questions: 1. What should I do to get the statistics behind the selection (e.g. BIC)? summary.regsubsets(object) just returns * meaning in or meaning out. For the plot function generates BICs, it is obviously that these values must be computed and available somewhere, but where? Is it possible to directly get AIC values instead of BIC? These statistics are in the object returned by summary(). Using the first example from the help page names(summary(a)) [1] which rsqrssadjr2 cp bicoutmat obj summary(a)$bic [1] -19.60287 -28.61139 -35.65643 -37.23388 -34.55301 2. As to the plot function, I´ve encountered a problem with setting the ylim argument. I fear that this (nice!) particular plot function ignores many of these additional arguments. How can I nevertheless change this setting? You can't (without modifying the plot function). The ... argument is required for inheritance [ie, required for R CMD check] but it doesn't take graphical parameters 3. For it is not explicitly mentioned in the manual, can I really use regsubsets for logistic regression? No. If your data set is large enough relative to the number of variables, you can fit a model with all variables and then apply regsubsets() to the weighted linear model arising from the IWLS algorithm. This will give an approximate ranking of models that you can then refit exactly. This is useful if you wanted to summarize the best few thousand models on 30 variables but not if you want a single model. On the other hand, regsubsets() isn't useful if you want a single model anyway. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subject: Re: how to include bar values in a barplot?
You could put the numbers inside the bars in which case it would not add to the height of the bar: x - 1:5 names(x) - letters[1:5] bp - barplot(x) text(bp, x - .02 * diff(par(usr)[3:4]), x) On 8/9/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Help with Filtering (interest rates related)
Dear, r-help, Long time reader, first time poster, I'm working on a paper regarding a term structure estimation using the Kalman Filter Algorithm. The model in question is the Generalized Vasicek, and since there are coupon-bonds being estimated, I'm supposed to make some changes on the Kalman Filter. Does anyone has already used R for these purposes? Any tips? Does anyone has a Kalman Filter code I could use as a starting point for an Extended Kalman Filter Approach? Thanks a lot for the patience and time, Bernardo Ribeiro [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subject: Re: how to include bar values in a barplot?
Gabor Grothendieck wrote: You could put the numbers inside the bars in which case it would not add to the height of the bar: I think the Cleveland/Tufte prescription would be much different: horizontal dot charts with the numbers in the right margin. I do this frequently with great effect. The Hmisc dotchart2 function makes this easy. Frank x - 1:5 names(x) - letters[1:5] bp - barplot(x) text(bp, x - .02 * diff(par(usr)[3:4]), x) On 8/9/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ARIMA fitting
On Tue, 7 Aug 2007, [EMAIL PROTECTED] wrote: Hello, Im trying to fit an ARIMA process, using STATS package, arima function. Can I expect, that fitted model with any parameters is stationary, causal and invertible? Please read ?arima: it answers all your questions, and points out that the answer depends on the arguments passed to arima(). The posting guide did ask you to do this *before* posting: please study it more carefully. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Interpret impulse response functions from irf in MSBVAR library
Hello, I am wondering if anyone knows how to interpret the values returned by irf function in the MSBVAR library. Some of the literature I have read indicates that impulse responses in the dependent variables are often based on a 1 unit change in the independent variable, but other sources suggest that they are based on a a change of 1 standard deviation. Any ideas which irf uses to compute the irf? The documentation is not very clear. Thanks, Spencer [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] AlgDesign expand.formula()
Can anyone explain why AlgDesign's expand.formula help and output differ? #From help: # quad(A,B,C) makes ~(A+B+C)^2+I(A^2)+I(B^2)+I(C^2) expand.formula(~quad(A+B+C)) #actually gives ~(A + B + C)^2 + I(A + B + C^2) They don't _look_ the same... Steve E *** This email contains information which may be confidential and/or privileged, and is intended only for the individual(s) or organisation(s) named above. If you are not the intended recipient, then please note that any disclosure, copying, distribution or use of the contents of this email is prohibited. Internet communications are not 100% secure and therefore we ask that you acknowledge this. If you have received this email in error, please notify the sender or contact +44(0)20 8943 7000 or [EMAIL PROTECTED] immediately, and delete this email and any attachments and copies from your system. Thank you. LGC Limited. Registered in England 2991879. Registered office: Queens Road, Teddington, Middlesex TW11 0LY, UK __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mac OSX fonts in R plots
I had been looking for information about including OSX fonts in R plots for a long time and never quite found the answer. I spent an hour or so gathering together the following solution which, as far as I have tested, works. I'm posting this for feedback and and archiving. I'd be interested in any caveats about the brittleness of the technique. Thanks. F - 1. Find font system font path: /Library/Fonts/ 2. Extract ttf (if necessary) with fondu [http://fondu.sourceforge.net/] eg, fondu -force Optima.dfont 3. ttf2asm for each ttf file, stripping the Copyright and warning 3. copy files to RHOME/library/grDevices/afm (usually, /Library/Frameworks/R.framework/Versions/2.5/Resources/library/grDevices/afm ) 4. R code to use the font; eg, newfont= Type1Font(Optima,c(OptimaRegular.afm,OptimaBold.afm,OptimaItalic.afm,OptimaBoldItalic.afm)) pdf(newfont.pdf,version = 1.4,family=newfont) plot(rnorm,col=red) dev.off() __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Memory problem
I got a long list of error message repeating with the following 3 lines when running the loop at the end of this mail: R(580,0xa000ed88) malloc: *** vm_allocate(size=327680) failed (error code=3) R(580,0xa000ed88) malloc: *** error: can't allocate region R(580,0xa000ed88) malloc: *** set a breakpoint in szone_error to debug There are 2 big arrays, IData (54x64x50x504) and Stat (4x64x50x9), in the code. They would only use about 0.8GB of memory. However when I check the memory usage during the looping, the memory usage keeps growing and finally reaches the memory limit of my computer, 4GB, and spills the above error message. Is there something in the loop about lme that is causing memory leaking? How can I clean up the memory usage in the loop? Thank you very much for your help, Gang tag - 0; dimx-54; dimy-64; dimz-50; NoF-8; NoFile-504; IData - array(data=NA, dim=c(dimx, dimy, dimz, NoFile)); Stat - array(data=NA, dim=c(dimx, dimy, dimz, NoF)); for (i in 1:NoFile) { IData[,,,i] - fill in the data for array IData here; } for (i in 1:dimx) { for (j in 1:dimy) { for (k in 1:dimz) { for (m in 1:NoFile) { Model$Beta[m] - IData[i, j, k, m]; } try(fit.lme - lme(Beta ~ group*session*difficulty+FTND, random = ~1|Subj, Model), tag - 1); if (tag != 1) { Stat[i, j, k,] - anova(fit.lme)$F[-1]; } else { Stat[i, j, k,] - rep(0, NoF-1); } tag - 0; } } } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Rcmdr window border lost
Andy Weller wrote: OK, I tried completely removing and reinstalling R, but this has not worked - I am still missing window borders for Rcmdr. I am certain that everything is installed correctly and that all dependencies are met - there must be something trivial I am missing?! Thanks in advance, Andy Andy Weller wrote: Dear all, I have recently lost my Rcmdr window borders (all my other programs have borders)! I am unsure of what I have done, although I have recently update.packages() in R... How can I reclaim them? I am using: Ubuntu Linux (Feisty) R version 2.5.1 R Commander Version 1.3-5 This sort of behaviour is usually the fault of the window manager, not R/Rcmdr/tcltk. It's the WM's job to supply the various window decorations on a new window, so either it never got told that there was a window, or it somehow got into a confused state. Did you try restarting the WM (i.e., log out/in or reboot)? And which WM are we talking about? Same combination works fine on Fedora 7, except for a load of messages saying Warning: X11 protocol error: BadWindow (invalid Window parameter) I have deleted the folder: /usr/local/lib/R/site-library/Rcmdr and reinstalled Rcmdr with: install.packages(Rcmdr, dep=TRUE) This has not solved my problem though. Maybe I need to reinstall something else as well? Thanks in advance, Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] tcltk error on Linux
Hi Mark, Prof Brian Ripley [EMAIL PROTECTED] writes: On Thu, 9 Aug 2007, Mark W Kimpel wrote: I am having trouble getting tcltk package to load on openSuse 10.2 running R-devel. I have specifically put my /usr/share/tcl directory in my PATH, but R doesn't seem to see it. I also have installed tk on my system. Any ideas on what the problem is? Any chance you are running R on a remote server using an ssh session? If that is the case, you may have an ssh/X11 config issue that prevents using tcl/tk from such a session. Rerun the configure script for R and verify that tcl/tk support is listed in the summary. Also, note that I have some warning messages on starting up R, not sure what they mean or if they are pertinent. Those are coming from a Bioconductor package: again you must be using development versions with R-devel and those are not stable (last time I looked even Biobase would not install, and the packages change daily). BioC devel tracks R-devel, but not on a daily basis -- because R changes daily. The recent issues with Biobase are a result of changes to R and have already been fixed. If you have all those packages in your startup, please don't -- there will be a considerable performance hit so only load them when you need them. Presumably, that's why they are there in the first place. The warning messages are a problem and suggest some needed improvements to the methods packages. These are being worked on. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Systematically biased count data regression model
Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28, 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34, 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221, 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232, 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318, 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2, 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38, 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288, 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162, 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306, 0.28, 0.279, 0.208, 0.377, 0.413, 0.489, 0.417, 0.333, 0.208, 0.232, 0.431, 0.283, 0.241, 0.105, 0.18, -0.172, -0.374, 0.25, 0.043, 0.215, 0.204, 0.19, 0.177, -0.106, -0.143, 0.062, 0.462, 0.256, 0.229, 0.314, 0.415, 0.307, 0.238, -0.35, 0.34, 0.275, 0.097, 0.353, 0.214, 0.435, 0.055, -0.289, 0.239, 0.186, 0.135, 0.259, 0.268, 0.258, 0.032, 0.489, 0.389, 0.298, 0.164, 0.325, 0.254, -0.059, 0.524, 0.539, 0.25, 0.175, 0.326, 0.302, -0.047, -0.301, -0.149, 0.358, 0.495, 0.311, 0.235, 0.558, -0.156, 0, 0.146, 0.329, -0.069, -0.352, -0.356, -0.206, -0.179, 0.467, -0.325, 0.39, -0.399, -0.165, 0.267, -0.334, -0.17, 0.58, 0.228, 0.234, 0.351, 0.3, -0.018, 0.125, 0.176, 0.322, 0.246,
Re: [R] Subject: Re: how to include bar values in a barplot?
Ted, Thanks for your thoughts. I don't take it as the start of a flame war (I don't want that either). My original intent was to get the original posters out of the mode of thinking they want to match what the spreadsheet does and into thinking about what message they are trying to get across. To get them (and possibly others) thinking I made the statements a bit more bold than my actual position (I did include a couple of qualifiers). Now that there has been a couple of days to think about it, your post adds some good depth to the discussion. I think the most important point (which I think we agree on) is not to just add something to a graph because you can (or someone else did), but to think through if it is benificial or not (which will depend on the graph, data, questions, etc.). There are ways to combine graphs and tables, sparklines are an upcoming way of including the power of graphs into a table. Another approach for the bar graph example would be to first replace the bargraph with a dotplot, then put the numbers into the margin so that they are properly lined up and not distracting from the points. I still think that anytime anyone is tempted to add data values to a graph they should ask themselves if that is an admission that the graph is not appropriate and would be better replaced by either a table (if the goal really is to look up specific values) or a better graph. Sometimes the answer will be yes, the question of interest, or the obvious follow-up question, will be answered by adding some additional information. Then the next question should be: which information to include? And where to put it? Can you imagine what Minard's graph would have looked like if he had included the numbers every time the total changed by 100, and put the temperatures as numbers instead of a line graph in the main plot at every 1 degree change? Thanks for adding depth to the discussion, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, August 08, 2007 3:53 PM To: r-help@stat.math.ethz.ch Subject: [R] Subject: Re: how to include bar values in a barplot? Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. Sizes, proportions, colours etc. of shapes (bars in a histogram, the marks representing points in a scatterplot, ... ) are interpreted, so to speak, intuitively -- the resulting perception is formed by processes which are hard to ascertain consciously, and the overall effect can only be ascertained by looking at it, and noting what impression one has formed. They stimulate mental responses in the domain of perception of spatial relationships. Numbers, and text, on the other hand, while still shapes from the optical point of view, up to the point of their impact on the retina, provoke different perceptions. They are interpreted analytically stimulating mental responses in the domains of language and number. There is no Law whatever which requires that the two must be separated. It may be that adding any annotation to a graph or diagram will interfere with the intuitive imterpretation that the diagram is intended to stimulate, with no associated benefit. It may be that presenting numerical/textual information within a graphical/diagrammatic context will interfere with the analytic interpretation wich is desired, with no associated benefit. In such cases, it is clearly (and as a matter of fact to be decided in each case) better to separate the two apsects. It may, however, be that both can be combined in such a way that each enhances the other; and also the simultaneous perception of both aspects induces a cartesian-product richness of interpretation where each element of the graphical presentation combines with each element of the textual/numerical presentation to generate a perception which could not possibly have been realised
Re: [R] Subject: Re: how to include bar values in a barplot?
Gabor, Putting the numbers in the bars is an improvement over putting them over the bars, but if the numbers are large relative to the bars, this could still create a fuzzy top to the bars making them harder to compare. This also has the problem of the poorly laid out table, numbers are easiest to compare if they are aligned (and vertical comparisons are easier than horizontal). There is also the issue of scale. You can shrink a barplot quite a bit and still get a good overview of the relationships, but if you need the numbers inside the plot, then either the numbers become to small to easily read, or the numbers stay big and overwhelm the plot. The best approach is to switch to a dotplot with the numbers in the margin (Frank has suggested this as well). If you need to stay with the bar plot (some lay people are still more comfortable with them until we can educate them to prefer the dot plots) then I would suggest doing horizontal bars with the numbers in the margin (vertically aligned). If the vertical bars are necessary, then putting the numbers below the bars (but separated enough that they don't interfere with a clear zero point) seems the safest approach. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Gabor Grothendieck Sent: Thursday, August 09, 2007 6:55 AM To: Frank E Harrell Jr Cc: r-help@stat.math.ethz.ch; [EMAIL PROTECTED] Subject: Re: [R] Subject: Re: how to include bar values in a barplot? You could put the numbers inside the bars in which case it would not add to the height of the bar: x - 1:5 names(x) - letters[1:5] bp - barplot(x) text(bp, x - .02 * diff(par(usr)[3:4]), x) On 8/9/07, Frank E Harrell Jr [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Greg, I'm going to join issue with your here! Not that I'll go near advocating Excel-style graphics (abominable, and the Patrick Burns URL which you cite is remarkable in its restraint). Also, I'm aware that this is potential flame-war territory -- again, I want to avoid that too. However, this is the second time you have intervened on this theme (previously Mon 6 August), along with John Kane on Wed 1 August and again today on similar lines, and I think it's time an alternative point of view was presented, to counteract (I hope usefully) what seems to be a draconianly prescriptive approach to the presentation of information. ---snip--- Ted, You make many excellent points and provide much food for thought. I still think that Greg's points are valid too, and in this particular case, bar plots are a bad choice and adding numbers at variable heights causes a perception error as I wrote previously. Thanks for your elaboration on this important subject. Frank On 07-Aug-07 21:37:50, Greg Snow wrote: Generally adding the numbers to a graph accomplishes 2 things: 1) it acts as an admission that your graph is a failure Generally, I disagree. Different elements in a display serve different purposes, according to the psychological aspects of visual preception. . . . -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Matthew it is possible that your results are suffering from heterogeneity, it may be that your model performs well at the aggregate level and this would explain good aggregate fit levels and decent predictive performance etc, you could perhaps look at a 'latent' approach to modelling your data, in other words, see if there is something unique in the cases/data/observations in the lower and upper levels of the model (where prediction is poor) and whether it is justified that you model these count areas as spearate and unique from the generic aggregate level model (in other words there is something unobserved/unmeasurted or latent etc in your popn of observations that could causing some observations to behave uniquely overall hth thanks Paul - Original Message - From: Matthew and Kim Bowser [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, August 10, 2007 1:43 AM Subject: [R] Systematically biased count data regression model Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28, 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34, 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221, 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232, 0.23, 0.239, -0.12,
Re: [R] small sample techniques
Thanks, that discussion was helpful. Well, I have another question I am comparing two proportions for its deviation from the hypothesized difference of zero. My manually calculated z ratio is 1.94. But, when I calculate it using prop.test, it uses Pearson's chi-squared test and the X-squared value that it gives it 0.74. Is there a function in R where I can calculate the z ratio? Which is ('p1-'p2)-(p1-p2) Z= S ('p1-'p2) Where S is the standard error estimate of the difference between two independent proportions Dummy example This is how I use it prop.test(c(30,23),c(300,300)) Cheers../Murli -Original Message- From: Moshe Olshansky [mailto:[EMAIL PROTECTED] Sent: Thursday, August 09, 2007 12:01 AM To: Rolf Turner; r-help@stat.math.ethz.ch Cc: Nair, Murlidharan T; Moshe Olshansky Subject: Re: [R] small sample techniques Well, this an explanation of what is done in the paired t-test (and why the number of df is as it is). I was too lazy to write all this. It is nice that some list members are less lazy! --- Rolf Turner [EMAIL PROTECTED] wrote: On 9/08/2007, at 2:57 PM, Moshe Olshansky wrote: As Thomas Lumley noted, there exist several versions of t-test. snip If you use t3 - t.test(x,y,paired=TRUE) then equal sample sizes are assumed and the number of degrees of freedom is 4 (5-1). This is seriously misleading. The assumption is not that the sample sizes are equal, but rather that there is ***just one sample***, namely the sample of differences. More explicitly the assumptions are that x_i - y_i are i.i.d. Gaussian with mean mu and variance sigma^2. One is trying to conduct inference about mu, of course. It should also be noted that it is a crucial assumption for the ``non-paired'' t-test that the two samples be ***independent*** of each other, as well as being Gaussian. None of this is however germane to Nair's original question; it is clear that he is interested in a two-independent-sample t-test. cheers, Rolf Turner ## Attention: This e-mail message is privileged and confidential. If you are not the intended recipient please delete the message and notify the sender. Any views or opinions presented are solely those of the author. This e-mail has been scanned and cleared by MailMarshal www.marshalsoftware.com ## __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help using gPath
Hi Paul, I'm sorry for not posting code, I wasn't sure if it would be helpful without the data...should I post the code and a sample of the data? I will remember to do that next time! grid.gedit(gPath(ylabel.text.382), gp=gpar(fontsize=16)) OK, I think my confusion comes from the notation that current.grobTree() produces and what strings are required in order to make changes to the underlying grobs. But, from what you've provided, it looks like I can access each grob with its unique name, regardless of which parent it is nested in...that helps like to remove the left border on the first panel. I'd like to adjust the I'd guess you'd have to remove the grob background.rect.345 and then draw in just the sides you want, which would require getting to the right viewport, for which you'll need to study the viewport tree (see current.vpTree()) I did some digging into this and it seems pretty complicated, is there an example anywhere that makes sense to the beginner? The whole viewport grob relationship is not clear to me. So, accessing viewports and removing objects and drawing new ones is beyond me at this point. I can get my mind around your example below because I can see the object I want to modify in the viewer, and the code changes a property of that object, click enter, and bang the object changes. When you start talking external pointers and finding viewports and pushing and popping grobs I just get lost. I found the viewports for the grobTree, it looks like this: viewport[ROOT]-(viewport[layout]-(viewport[axis_h_1_1]-(viewport[bottom_axis]-(viewport[labels], viewport[ticks])), viewport[axis_h_1_2]-(viewport[bottom_axis]-(viewport[labels], viewport[ticks])), viewport[axis_v_1_1]-(viewport[left_axis]-(viewport[labels], viewport[ticks])), viewport[panel_1_1], viewport[panel_1_2], viewport[strip_h_1_1], viewport[strip_h_1_2], viewport[strip_v_1_1])) at that point I was like, ok, I'm done. :S Something like ... grid.gedit(geom_bar.rect, gp=gpar(col=green)) Again, it would really help to have some code to run. My apologies, I thought the grobTree was sufficient in this case. Thanks very much for your help. emilio [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to apply functions over rows of multiple matrices
Dear ExpRts, I would like to perform a function with two arguments over the rows of two matrices. There are a couple of *applys (including mApply in Hmisc) but I haven't found out how to do it straightforward. Applying to row indices works, but looks like a poor hack to me: sens - function(test, gold) { if (any(gold==1)) { sum(test[which(gold==1)]/sum(which(gold==1))) } else NA } numtest - 6 numsubj - 20 newtest - array(rbinom(numtest*numsubj, 1, .5), dim=c(numsubj, numtest)) goldstandard - array(rbinom(numtest*numsubj, 1, .5), dim=c(numsubj, numtest)) t(sapply(1:nrow(newtest), function(i) { sens(newtest[i,], goldstandard[i,])})) Is there any shortcut to sapply over the indices? Best wishes Johannes __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Seasonality
I have a time series x = f(t), where t is taken for each month. What is the best function to detect if _x_ has a seasonal variation? If there is such seasonal effect, what is the best function to estimate it? Function arima has a seasonal parameter, but I guess this is too complex to be useful. Alberto Monteiro __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to apply functions over rows of multiple matrices
Is sens really what you want? The denominator is the indexes, e.g. if a row in goldstandard were c(0, 0, 1, 1) then you would be dividing by 3+4. Also test[which(gold == 1)] is the same as test[gold == 1] which is the same test * gold since gold has only 0 and 1's in it. Perhaps what you really intend is to take the average over those elements in each row of the first matrix which correspond to 1's in the second in the corresponding row of the second. In that case its just: rowSums(newtest * goldstandard) / rowSums(goldstandard) On 8/9/07, Johannes Hüsing [EMAIL PROTECTED] wrote: Dear ExpRts, I would like to perform a function with two arguments over the rows of two matrices. There are a couple of *applys (including mApply in Hmisc) but I haven't found out how to do it straightforward. Applying to row indices works, but looks like a poor hack to me: sens - function(test, gold) { if (any(gold==1)) { sum(test[which(gold==1)]/sum(which(gold==1))) } else NA } numtest - 6 numsubj - 20 newtest - array(rbinom(numtest*numsubj, 1, .5), dim=c(numsubj, numtest)) goldstandard - array(rbinom(numtest*numsubj, 1, .5), dim=c(numsubj, numtest)) t(sapply(1:nrow(newtest), function(i) { sens(newtest[i,], goldstandard[i,])})) Is there any shortcut to sapply over the indices? Best wishes Johannes __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Need Help: Installing/Using xtable package
Hi all, Let me know if I need to ask this question of the bioconductor group. I used the bioconductor utility to install this package and also the CRAN package.install function. My computer crashed a week ago. Today I reinstalled all my bioconductor/R packages. One of my scripts is giving me the following error: in my script I set: library(xtable) print.xtable( and receive this error: Error : could not find function print.xtable This is a new error and I cannot find the source. I reinstalled xtable with the messages below(which are the same whether I use CRAN or bioconductor): Any help is appreciated! Thanks! Matt biocLite(xtable) Running biocinstall version 2.0.8 with R version 2.5.1 Your version of R requires version 2.0 of Bioconductor. Warning in install.packages(pkgs = pkgs, repos = repos, dependencies = dependenc ies, : argument 'lib' is missing: using '/home/mdj/R/i486-pc-linux-gnu-library /2.5' trying URL 'http://cran.fhcrc.org/src/contrib/xtable_1.5-1.tar.gz' Content type 'application/x-gzip' length 134758 bytes opened URL == downloaded 131Kb * Installing *source* package 'xtable' ... ** R ** data ** inst ** preparing package for lazy loading ** help Building/Updating help pages for package 'xtable' Formats: text html latex example print.xtable texthtmllatex stringtexthtmllatex table.attributes texthtmllatex tli texthtmllatex xtabletexthtmllatex example ** building package indices ... * DONE (xtable) The downloaded packages are in /tmp/RtmpGThCuI/downloaded_packages __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv,row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s); [1] 84002712 here is where things get more interesting. after the read.csv() is finished, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 270505 14.58349948 446.0 11268682 601.9 Vcells 10639515 81.2 34345544 262.1 42834692 326.9 I was a big surprised by this---R had 928MB intermittent memory in use. More interestingly, this is also similar to what linux 'top' reports as memory use of the R process (919MB, probably 1024 vs. 1000 B/MB), even after the read.csv() is finished and gc() has been run. Nothing seems to have been released back to the OS. Now, rm(s) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 270541 14.56679958 356.8 11268755 601.9 Vcells 139481 1.1 27476536 209.7 42807620 326.6 linux 'top' now reports 650MB of memory use (though R itself uses only 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus the base 48MB. There are two interesting observations for me here: first, to read a .csv file, I need to have at least 10-15 times as much memory as the file that I want to read---a lot more than the factor of 3-4 that I had expected. The moral is that IF R can read a .csv file, one need not worry too much about running into memory constraints lateron. {R Developers---reducing read.csv's memory requirement a little would be nice. of course, you have more than enough on your plate, already.} Second, memory is not returned fully to the OS. This is not necessarily a bad thing, but good to know. Hope this helps... Sincerely, /iaw __ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, ripley_at_stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv,row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s); [1] 84002712 here is where things get more interesting. after the read.csv() is finished, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 270505 14.58349948 446.0 11268682 601.9 Vcells 10639515 81.2 34345544 262.1 42834692 326.9 I was a big surprised by this---R had 928MB intermittent memory in use. More interestingly, this is also similar to what linux 'top' reports as memory use of the R process (919MB, probably 1024 vs. 1000 B/MB), even after the read.csv() is finished and gc() has been run. Nothing seems to have been released back to the OS. Now, rm(s) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 270541 14.56679958 356.8 11268755 601.9 Vcells 139481 1.1 27476536 209.7 42807620 326.6 linux 'top' now reports 650MB of memory use (though R itself uses only 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus the base 48MB. There are two interesting observations for me here: first, to read a .csv file, I need to have at least 10-15 times as much memory as the file that I want to read---a lot more than the factor of 3-4 that I had expected. The moral is that IF R can read a .csv file, one need not worry too much about running into memory constraints lateron. {R Developers---reducing read.csv's memory requirement a little would be nice. of course, you have more than enough on your plate, already.} Second, memory is not returned fully to the OS. This is not necessarily a bad thing, but good to know. Hope this helps... Sincerely, /iaw __ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, ripley_at_stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
Re: [R] Need Help: Installing/Using xtable package
M. Jankowski wrote: Hi all, Let me know if I need to ask this question of the bioconductor group. I used the bioconductor utility to install this package and also the CRAN package.install function. My computer crashed a week ago. Today I reinstalled all my bioconductor/R packages. One of my scripts is giving me the following error: in my script I set: library(xtable) print.xtable( and receive this error: Error : could not find function print.xtable This is a new error and I cannot find the source. Looks like the current xtable is no longer exporting its print methods. Why were you calling print.xtable explicitly in the first place? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv, row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv ,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s); [1] 84002712 here is where things get more interesting. after the read.csv() is finished, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 270505 14.58349948 446.0 11268682 601.9 Vcells 10639515 81.2 34345544 262.1 42834692 326.9 I was a big surprised by this---R had 928MB intermittent memory in use. More interestingly, this is also similar to what linux 'top' reports as memory use of the R process (919MB, probably 1024 vs. 1000 B/MB), even after the read.csv() is finished and gc() has been run. Nothing seems to have been released back to the OS. Now, rm(s) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 270541 14.56679958 356.8 11268755 601.9 Vcells 139481 1.1 27476536 209.7 42807620 326.6 linux 'top' now reports 650MB of memory use (though R itself uses only 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus the base 48MB. There are two interesting observations for me here: first, to read a .csv file, I need to have at least 10-15 times as much memory as the file that I want to read---a lot more than the factor of 3-4 that I had expected. The moral is that IF R can read a .csv file, one need not worry too much about running into memory constraints lateron. {R Developers---reducing read.csv's memory requirement a little would be nice. of course, you have more than enough on your plate, already.} Second, memory is not returned fully to the OS. This is not necessarily a bad thing, but good to know. Hope this helps... Sincerely, /iaw __ R-help_at_stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, ripley_at_stats.ox.ac.uk Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list
Re: [R] Need Help: Installing/Using xtable package
Peter Dalgaard [EMAIL PROTECTED] writes: M. Jankowski wrote: Hi all, Let me know if I need to ask this question of the bioconductor group. I used the bioconductor utility to install this package and also the CRAN package.install function. My computer crashed a week ago. Today I reinstalled all my bioconductor/R packages. One of my scripts is giving me the following error: in my script I set: library(xtable) print.xtable( and receive this error: Error : could not find function print.xtable This is a new error and I cannot find the source. Looks like the current xtable is no longer exporting its print methods. Why were you calling print.xtable explicitly in the first place? Indeed, xtable now has a namespace. The S3 methods are not exported because they should not be called directly; rather, the generic function (in this case print) should be called. The addition of the namespace is really a good. Yes, it will cause some hicups for folks who were calling the methods directory (tsk tsk). But the addition fixes breakage that was occuring due to internal xtable helper functions being masked. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Need Help: Installing/Using xtable package
Ok, I got it now. Just: print(xtable(...),) Thanks! Matt On 8/9/07, Seth Falcon [EMAIL PROTECTED] wrote: Peter Dalgaard [EMAIL PROTECTED] writes: M. Jankowski wrote: Hi all, Let me know if I need to ask this question of the bioconductor group. I used the bioconductor utility to install this package and also the CRAN package.install function. My computer crashed a week ago. Today I reinstalled all my bioconductor/R packages. One of my scripts is giving me the following error: in my script I set: library(xtable) print.xtable( and receive this error: Error : could not find function print.xtable This is a new error and I cannot find the source. Looks like the current xtable is no longer exporting its print methods. Why were you calling print.xtable explicitly in the first place? Indeed, xtable now has a namespace. The S3 methods are not exported because they should not be called directly; rather, the generic function (in this case print) should be called. The addition of the namespace is really a good. Yes, it will cause some hicups for folks who were calling the methods directory (tsk tsk). But the addition fixes breakage that was occuring due to internal xtable helper functions being masked. + seth -- Seth Falcon | Computational Biology | Fred Hutchinson Cancer Research Center BioC: http://bioconductor.org/ Blog: http://userprimary.net/user/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] small sample techniques
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 9:19 AM To: Moshe Olshansky; Rolf Turner; r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques Thanks, that discussion was helpful. Well, I have another question I am comparing two proportions for its deviation from the hypothesized difference of zero. My manually calculated z ratio is 1.94. But, when I calculate it using prop.test, it uses Pearson's chi-squared test and the X-squared value that it gives it 0.74. Is there a function in R where I can calculate the z ratio? Which is ('p1-'p2)-(p1-p2) Z= S ('p1-'p2) Where S is the standard error estimate of the difference between two independent proportions Dummy example This is how I use it prop.test(c(30,23),c(300,300)) Cheers../Murli Murli, I think you need to recheck you computations. You can run a t-test on your data in a variety of ways. Here is one: x-c(rep(1,30),rep(0,270)) y-c(rep(1,23),rep(0,277)) t.test(x,y) Welch Two Sample t-test data: x and y t = 1.0062, df = 589.583, p-value = 0.3147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02221086 0.06887752 sample estimates: mean of x mean of y 0.1000 0.0767 Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv,row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s); [1] 84002712 here is where things get more interesting. after the read.csv() is finished, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 270505 14.58349948 446.0 11268682 601.9 Vcells 10639515 81.2 34345544 262.1 42834692 326.9 I was a big surprised by this---R had 928MB intermittent memory in use. More interestingly, this is also similar to what linux 'top' reports as memory use of the R process (919MB, probably 1024 vs. 1000 B/MB), even after the read.csv() is finished and gc() has been run. Nothing seems to have been released back to the OS. Now, rm(s) gc() used (Mb) gc trigger (Mb) max used (Mb) Ncells 270541 14.56679958 356.8 11268755 601.9 Vcells 139481 1.1 27476536 209.7 42807620 326.6 linux 'top' now reports 650MB of memory use (though R itself uses only 15.6Mb). My guess is that It leaves the trigger memory of 567MB plus the base 48MB. There are two interesting observations for me here: first, to read a .csv file, I need to have at least 10-15 times as much memory as the file that I want to read---a lot more than the factor of
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv,row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s); [1] 84002712 here is where things get more interesting. after the read.csv() is finished, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 270505 14.58349948 446.0 11268682 601.9 Vcells 10639515 81.2 34345544 262.1 42834692 326.9 I was a big surprised by this---R had 928MB intermittent memory in use. More interestingly, this is also similar to what linux 'top' reports as memory use of the R process (919MB, probably 1024 vs. 1000 B/MB), even after the read.csv() is finished and gc() has been run. Nothing
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv, row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv ,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as baseline. This includes some of my own routines that are always loaded. Good.. Next, I created a s.csv file with 22 variables and 500,000 observations, taking up an uncompressed disk space of 115MB. The resulting object.size() after a read.csv() is 84,002,712 bytes (80MB). s= read.csv(s.csv); object.size(s);
Re: [R] small sample techniques
n=300 30% taking A relief from pain 23% taking B relief from pain Question; If there is no difference are we likely to get a 7% difference? Hypothesis H0: p1-p2=0 H1: p1-p2!=0 (not equal to) 1Weighed average of two sample proportion 300(0.30)+300(0.23) --- = 0.265 300+300 2Std Error estimate of the difference between two independent proportions sqrt((0.265 *0.735)*((1/300)+(1/300))) = 0.03603 3Evaluation of the difference between sample proportion as a deviation from the hypothesized difference of zero ((0.30-0.23)-(0))/0.03603 = 1.94 z did not approach 1.96 hence H0 is not rejected. This is what I was trying to do using prop.test. prop.test(c(30,23),c(300,300)) What function should I use? -Original Message- From: [EMAIL PROTECTED] on behalf of Nordlund, Dan (DSHS/RDA) Sent: Thu 8/9/2007 1:26 PM To: r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 9:19 AM To: Moshe Olshansky; Rolf Turner; r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques Thanks, that discussion was helpful. Well, I have another question I am comparing two proportions for its deviation from the hypothesized difference of zero. My manually calculated z ratio is 1.94. But, when I calculate it using prop.test, it uses Pearson's chi-squared test and the X-squared value that it gives it 0.74. Is there a function in R where I can calculate the z ratio? Which is ('p1-'p2)-(p1-p2) Z= S ('p1-'p2) Where S is the standard error estimate of the difference between two independent proportions Dummy example This is how I use it prop.test(c(30,23),c(300,300)) Cheers../Murli Murli, I think you need to recheck you computations. You can run a t-test on your data in a variety of ways. Here is one: x-c(rep(1,30),rep(0,270)) y-c(rep(1,23),rep(0,277)) t.test(x,y) Welch Two Sample t-test data: x and y t = 1.0062, df = 589.583, p-value = 0.3147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02221086 0.06887752 sample estimates: mean of x mean of y 0.1000 0.0767 Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
One other idea. Don't use byrow = TRUE. Matrices are stored in column order so that might be more efficient. You can always transpose it later. Haven't tested it to see if it helps. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv,row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of memory, and R 2.5.0. upon startup, gc() reports used (Mb) gc trigger (Mb) max used (Mb) Ncells 268755 14.4 407500 21.8 35 18.7 Vcells 139137 1.1 786432 6.0 444750 3.4 This is my baseline. linux 'top' reports 48MB as
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
On Thu, 9 Aug 2007, Michael Cassin wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Mike, This does not seem to be an issue with scan() per se. Notice the difference in size of big2, big3, and bigThree here: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 cat(big2, file='lotsaletters.txt', sep='\n') bigTwo - scan('lotsaletters.txt',what='') Read 100 items object.size(bigTwo)/1e6 [1] 4.000856 cat(big3, file='moreletters.txt', sep='\n') bigThree - scan('moreletters.txt',what='') Read 100 items object.size(bigThree)/1e6 [1] 4.000856 all.equal(big3,bigThree) [1] TRUE Chuck p.s. version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 5.1 year 2007 month 06 day27 svn rev42083 language R version.string R version 2.5.1 (2007-06-27) Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv, row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv ,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64 GNU/Linux == Quoted Text From: Prof Brian Ripley ripley_at_stats.ox.ac.uk Date: Tue, 26 Jun 2007 17:53:28 +0100 (BST) The R Data Import/Export Manual points out several ways in which you can use read.csv more efficiently. On Tue, 26 Jun 2007, ivo welch wrote: dear R experts: I am of course no R experts, but use it regularly. I thought I would share some experimentation with memory use. I run a linux machine with about 4GB of
[R] depreciation of $ for atomic vectors
Dear All, I would like to know why $ was deprecated for atomic vectors and what I can use instead. I got used to the following idiom for working with data frames: df - data.frame(start=1:5,end=10:6) apply(df,1,function(row){ return(row$start + row$end) }) I have a data.frame with named columns and use each row to do something. I would like the named index ($) because the column position in the data frame changes from time to time. The data frame is read from files. thank you very much, ido '$' returns 'NULL' (with a warning) except for recursive objects, and is only discussed in the section below on recursive objects. Its use on non-recursive objects was deprecated in R 2.5.0. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] depreciation of $ for atomic vectors
Try this: DF - data.frame(start=1:5,end=10:6) # apply(DF,1,function(row){ return(row$start + row$end) }) DF$start + DF$end apply(DF, 1, function(row) row[[start]] + row[[end]]) apply(DF, 1, function(row) row[start] + row[end]) On 8/9/07, Ido M. Tamir [EMAIL PROTECTED] wrote: Dear All, I would like to know why $ was deprecated for atomic vectors and what I can use instead. I got used to the following idiom for working with data frames: df - data.frame(start=1:5,end=10:6) apply(df,1,function(row){ return(row$start + row$end) }) I have a data.frame with named columns and use each row to do something. I would like the named index ($) because the column position in the data frame changes from time to time. The data frame is read from files. thank you very much, ido '$' returns 'NULL' (with a warning) except for recursive objects, and is only discussed in the section below on recursive objects. Its use on non-recursive objects was deprecated in R 2.5.0. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
Try it as a factor: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 object.size(as.factor(big2))/1e6 [1] 4.001184 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 object.size(as.factor(big3))/1e6 [1] 4.001184 On 8/9/07, Charles C. Berry [EMAIL PROTECTED] wrote: On Thu, 9 Aug 2007, Michael Cassin wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Mike, This does not seem to be an issue with scan() per se. Notice the difference in size of big2, big3, and bigThree here: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 cat(big2, file='lotsaletters.txt', sep='\n') bigTwo - scan('lotsaletters.txt',what='') Read 100 items object.size(bigTwo)/1e6 [1] 4.000856 cat(big3, file='moreletters.txt', sep='\n') bigThree - scan('moreletters.txt',what='') Read 100 items object.size(bigThree)/1e6 [1] 4.000856 all.equal(big3,bigThree) [1] TRUE Chuck p.s. version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 5.1 year 2007 month 06 day27 svn rev42083 language R version.string R version 2.5.1 (2007-06-27) Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv, row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv ,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56 bytes per string, which seems excessive. Regards, Mike Configuration info: sessionInfo() R version 2.5.1 (2007-06-27) x86_64-redhat-linux-gnu locale: C attached base packages: [1] stats graphics grDevices utils datasets methods [7] base # uname -a Linux ***.com 2.6.9-023stab044.4-smp #1 SMP Thu May 24 17:20:37 MSD 2007 x86_64 x86_64 x86_64
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
I do not see how this helps Mike's case: res - (as.character(1:1e6)) object.size(res) [1] 3624 object.size(as.factor(res)) [1] 4224 Anyway, my point was that if two character vectors for which all.equal() yields TRUE can differ by almost an order of magnitude in object.size(), and the smaller of the two was read in by scan(), then Mike will have to dig deeper than scan() to see how to reduce the size of a character vector in R. On Thu, 9 Aug 2007, Gabor Grothendieck wrote: Try it as a factor: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 object.size(as.factor(big2))/1e6 [1] 4.001184 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 object.size(as.factor(big3))/1e6 [1] 4.001184 On 8/9/07, Charles C. Berry [EMAIL PROTECTED] wrote: On Thu, 9 Aug 2007, Michael Cassin wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Mike, This does not seem to be an issue with scan() per se. Notice the difference in size of big2, big3, and bigThree here: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 cat(big2, file='lotsaletters.txt', sep='\n') bigTwo - scan('lotsaletters.txt',what='') Read 100 items object.size(bigTwo)/1e6 [1] 4.000856 cat(big3, file='moreletters.txt', sep='\n') bigThree - scan('moreletters.txt',what='') Read 100 items object.size(bigThree)/1e6 [1] 4.000856 all.equal(big3,bigThree) [1] TRUE Chuck p.s. version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 5.1 year 2007 month 06 day27 svn rev42083 language R version.string R version 2.5.1 (2007-06-27) Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking, but my file has quotes. It's also 400MB, and I don't mind waiting, but don't have 6x the memory to read it in. On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: If we add quote = FALSE to the write.csv statement its twice as fast reading it in. On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Hi, I've been having similar experiences and haven't been able to substantially improve the efficiency using the guidance in the I/O Manual. Could anyone advise on how to improve the following scan()? It is not based on my real file, please assume that I do need to read in characters, and can't do any pre-processing of the file, etc. ## Create Sample File write.csv(matrix(as.character(1:1e6),ncol=10,byrow=TRUE),big.csv, row.names=FALSE) q() **New Session** #R system(ls -l big.csv) system(free -m) big1-matrix(scan(big.csv ,sep=,,what=character(0),skip=1,n=1e6),ncol=10,byrow=TRUE) system(free -m) The file is approximately 9MB, but approximately 50-60MB is used to read it in. object.size(big1) is 56MB, or 56
Re: [R] tcltk error on Linux
Seth and Brian, Today and downloaded and installed the latest R-devel and tcltk now works. My suspicion is that Tcl was not on my path when R-devel was installed previously. BTW, I had though that is was a courtesy to cc: the maintainers of a package when writing either R-devel or R-help about a specific package. For tcltk, I see: Maintainer:R Core Team [EMAIL PROTECTED] If it is not appropriate to write R-core regarding packages they maintain, would it perhaps not be better to remove them as maintainers or, not suggest that people cc: maintainers of packages? Just an idea. As for the suggestion that I not use R-devel, I do that because I sometimes use BioC packages that have just been published and are only available in the devel versions of BioC. Are you suggesting that only people who can debug things themselves, and thus who do not need to write to R-devel, use R-devel? As an open-source user, I thought the philosophy was that it was useful to have users willing to test beta versions of software and have those users report problems to developers. If that is not the case, please put a stronger warning on R-devel and warn users not to use it unless they are willing to debug and take care of all problems themselves. Thanks, Mark --- Mark W. Kimpel MD ** Neuroinformatics ** Dept. of Psychiatry Indiana University School of Medicine 15032 Hunter Court, Westfield, IN 46074 (317) 490-5129 Work, Mobile VoiceMail (317) 663-0513 Home (no voice mail please) ** Seth Falcon wrote: Hi Mark, Prof Brian Ripley [EMAIL PROTECTED] writes: On Thu, 9 Aug 2007, Mark W Kimpel wrote: I am having trouble getting tcltk package to load on openSuse 10.2 running R-devel. I have specifically put my /usr/share/tcl directory in my PATH, but R doesn't seem to see it. I also have installed tk on my system. Any ideas on what the problem is? Any chance you are running R on a remote server using an ssh session? If that is the case, you may have an ssh/X11 config issue that prevents using tcl/tk from such a session. Rerun the configure script for R and verify that tcl/tk support is listed in the summary. Also, note that I have some warning messages on starting up R, not sure what they mean or if they are pertinent. Those are coming from a Bioconductor package: again you must be using development versions with R-devel and those are not stable (last time I looked even Biobase would not install, and the packages change daily). BioC devel tracks R-devel, but not on a daily basis -- because R changes daily. The recent issues with Biobase are a result of changes to R and have already been fixed. If you have all those packages in your startup, please don't -- there will be a considerable performance hit so only load them when you need them. Presumably, that's why they are there in the first place. The warning messages are a problem and suggest some needed improvements to the methods packages. These are being worked on. + seth __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] RMySQL loading error
Hi, I am having problems loading RMySQL. I am using MySQL 5.0, R version 2.5.1, and RMySQL with Windows XP. When I try to load rMySQL I get the following error: require(RMySQL) Loading required package: RMySQL Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library 'C:/PROGRA~1/R/R-25~1.1/library/RMySQL/libs/RMySQL.dll': LoadLibrary failure: Invalid access to memory location. I did not get any errors while installing MySQL or RMySQL. It seems that there are other people with similar problems, although I could not find any hint on how to try to solve the problem. Any help, hint or advice would be greatly appreciated. Thanks, Clara Anton __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
On Thu, 9 Aug 2007, Charles C. Berry wrote: On Thu, 9 Aug 2007, Michael Cassin wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Yes, and documented in the 'R Internals' manual. That is basic reading for people wishing to comment on efficiency issues in R. Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Mike, This does not seem to be an issue with scan() per se. Notice the difference in size of big2, big3, and bigThree here: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 On a 32-bit computer every R object has an overhead of 24 or 28 bytes. Character strings are R objects, but in some functions such as rep (and scan for up to 10,000 distinct strings) the objects can be shared. More string objects will be shared in 2.6.0 (but factors are designed to be efficient at storing character vectors with few values). On a 64-bit computer the overhead is usually double. So I would expect just over 56 bytes/string for distinct short strings (and that is what big3 gives). But 56Mb is really not very much (tiny on a 64-bit computer), and 1 million items is a lot. [...] -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] S4 based package giving strange error at install time, but not at check time
Hi, I have a S4 based package package that was loading fine on R 2.5.0 on both OS X and Linux. I was checking the package against 2.5.1 and doing R CMD check does not give any warnings. So I next built the package and installed it. Though the package installed fine I noticed the following message: Loading required package: methods Error in loadNamespace(package, c(which.lib.loc, lib.loc), keep.source = keep.source) : in 'fingerprint' methods specified for export, but none defined: fold, euc.vector, distance, random.fingerprint, as.character, length, show During startup - Warning message: package fingerprint in options(defaultPackages) was not found However, I can load the package in R with no errors being reported and it seems that the functions are working fine. Looking at the sources I see that my NAMESPACES file contains the following: importFrom(methods) exportClasses(fingerprint) exportMethods(fold, euc.vector, distance, random.fingerprint, as.character, length, show) export(fp.sim.matrix, fp.to.matrix, fp.factor.matrix, fp.read.to.matrix, fp.read, moe.lf, bci.lf, cdk.lf) and all the exported methods are defined. As an example consider the 'fold' method. It's defined as setGeneric(fold, function(fp) standardGeneric(fold)) setMethod(fold, fingerprint, function(fp) { ## code for the function snipped }) Since the method has been defined I can't see why I should see the error during install time, but nothing when the package is checked. Any pointers would be appreciated. --- Rajarshi Guha [EMAIL PROTECTED] GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE --- Bus error -- driver executed. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] small sample techniques
30 is not 30% of 300 (it is 10%), so your prop.test below is testing something different from your hand calculations. Try: prop.test(c(.30,.23)*300,c(300,300), correct=FALSE) 2-sample test for equality of proportions without continuity correction data: c(0.3, 0.23) * 300 out of c(300, 300) X-squared = 3.7736, df = 1, p-value = 0.05207 alternative hypothesis: two.sided 95 percent confidence interval: -0.000404278 0.140404278 sample estimates: prop 1 prop 2 0.30 0.23 sqrt(3.7736) [1] 1.942576 Notice that the square root of the X-squared value matches your hand calculations (with rounding error). This is true if Yates continuty correction is not used (the correct=FALSE in the call to prop.test). Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 1:02 PM To: Nordlund, Dan (DSHS/RDA); r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques n=300 30% taking A relief from pain 23% taking B relief from pain Question; If there is no difference are we likely to get a 7% difference? Hypothesis H0: p1-p2=0 H1: p1-p2!=0 (not equal to) 1Weighed average of two sample proportion 300(0.30)+300(0.23) --- = 0.265 300+300 2Std Error estimate of the difference between two independent 2proportions sqrt((0.265 *0.735)*((1/300)+(1/300))) = 0.03603 3Evaluation of the difference between sample proportion as a deviation 3from the hypothesized difference of zero ((0.30-0.23)-(0))/0.03603 = 1.94 z did not approach 1.96 hence H0 is not rejected. This is what I was trying to do using prop.test. prop.test(c(30,23),c(300,300)) What function should I use? -Original Message- From: [EMAIL PROTECTED] on behalf of Nordlund, Dan (DSHS/RDA) Sent: Thu 8/9/2007 1:26 PM To: r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 9:19 AM To: Moshe Olshansky; Rolf Turner; r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques Thanks, that discussion was helpful. Well, I have another question I am comparing two proportions for its deviation from the hypothesized difference of zero. My manually calculated z ratio is 1.94. But, when I calculate it using prop.test, it uses Pearson's chi-squared test and the X-squared value that it gives it 0.74. Is there a function in R where I can calculate the z ratio? Which is ('p1-'p2)-(p1-p2) Z= S ('p1-'p2) Where S is the standard error estimate of the difference between two independent proportions Dummy example This is how I use it prop.test(c(30,23),c(300,300)) Cheers../Murli Murli, I think you need to recheck you computations. You can run a t-test on your data in a variety of ways. Here is one: x-c(rep(1,30),rep(0,270)) y-c(rep(1,23),rep(0,277)) t.test(x,y) Welch Two Sample t-test data: x and y t = 1.0062, df = 589.583, p-value = 0.3147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02221086 0.06887752 sample estimates: mean of x mean of y 0.1000 0.0767 Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory Experimentation: Rule of Thumb = 10-15 Times the Memory
The examples were just artificially created data. We don't know what the real case is but if each entry is distinct then factors won't help; however, if they are not distinct then there is a huge potential savings. Also if they are really numeric, as in your example, then storing them as numeric rather than character or factor could give substantial savings. So it all depends on the nature of the data but the way its stored does seem to make a potentially large difference. # distinct elements res - as.character(1:1e6) object.size(res)/1e6 [1] 36.2 object.size(as.factor(res))/1e6 [1] 40.00022 object.size(as.numeric(res))/1e6 [1] 8.24 # non-distinct elements res2 - as.character(rep(1:100, length = 1e6)) object.size(res2)/1e6 [1] 36.2 object.size(as.factor(res2))/1e6 [1] 4.003824 object.size(as.numeric(res2))/1e6 [1] 8.24 On 8/9/07, Charles C. Berry [EMAIL PROTECTED] wrote: I do not see how this helps Mike's case: res - (as.character(1:1e6)) object.size(res) [1] 3624 object.size(as.factor(res)) [1] 4224 Anyway, my point was that if two character vectors for which all.equal() yields TRUE can differ by almost an order of magnitude in object.size(), and the smaller of the two was read in by scan(), then Mike will have to dig deeper than scan() to see how to reduce the size of a character vector in R. On Thu, 9 Aug 2007, Gabor Grothendieck wrote: Try it as a factor: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 object.size(as.factor(big2))/1e6 [1] 4.001184 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 object.size(as.factor(big3))/1e6 [1] 4.001184 On 8/9/07, Charles C. Berry [EMAIL PROTECTED] wrote: On Thu, 9 Aug 2007, Michael Cassin wrote: I really appreciate the advice and this database solution will be useful to me for other problems, but in this case I need to address the specific problem of scan and read.* using so much memory. Is this expected behaviour? Can the memory usage be explained, and can it be made more efficient? For what it's worth, I'd be glad to try to help if the code for scan is considered to be worth reviewing. Mike, This does not seem to be an issue with scan() per se. Notice the difference in size of big2, big3, and bigThree here: big2 - rep(letters,length=1e6) object.size(big2)/1e6 [1] 4.000856 big3 - paste(big2,big2,sep='') object.size(big3)/1e6 [1] 36.2 cat(big2, file='lotsaletters.txt', sep='\n') bigTwo - scan('lotsaletters.txt',what='') Read 100 items object.size(bigTwo)/1e6 [1] 4.000856 cat(big3, file='moreletters.txt', sep='\n') bigThree - scan('moreletters.txt',what='') Read 100 items object.size(bigThree)/1e6 [1] 4.000856 all.equal(big3,bigThree) [1] TRUE Chuck p.s. version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major 2 minor 5.1 year 2007 month 06 day27 svn rev42083 language R version.string R version 2.5.1 (2007-06-27) Regards, Mike On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Just one other thing. The command in my prior post reads the data into an in-memory database. If you find that is a problem then you can read it into a disk-based database by adding the dbname argument to the sqldf call naming the database. The database need not exist. It will be created by sqldf and then deleted when its through: DF - sqldf(select * from f, dbname = tempfile(), file.format = list(header = TRUE, row.names = FALSE)) On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Another thing you could try would be reading it into a data base and then from there into R. The devel version of sqldf has this capability. That is it will use RSQLite to read the file directly into the database without going through R at all and then read it from there into R so its a completely different process. The RSQLite software has no capability of dealing with quotes (they will be regarded as ordinary characters) but a single gsub can remove them afterwards. This won't work if there are commas within the quotes but in that case you could read each row as a single record and then split it yourself in R. Try this library(sqldf) # next statement grabs the devel version software that does this source(http://sqldf.googlecode.com/svn/trunk/R/sqldf.R;) gc() f - file(big.csv) DF - sqldf(select * from f, file.format = list(header = TRUE, row.names = FALSE)) gc() For more info see the man page from the devel version and the home page: http://sqldf.googlecode.com/svn/trunk/man/sqldf.Rd http://code.google.com/p/sqldf/ On 8/9/07, Michael Cassin [EMAIL PROTECTED] wrote: Thanks for looking,
Re: [R] RMySQL loading error
On Thu, 9 Aug 2007, Clara Anton wrote: Hi, I am having problems loading RMySQL. I am using MySQL 5.0, R version 2.5.1, and RMySQL with Windows XP. More exact versions would be helpful. When I try to load rMySQL I get the following error: require(RMySQL) Loading required package: RMySQL Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library 'C:/PROGRA~1/R/R-25~1.1/library/RMySQL/libs/RMySQL.dll': LoadLibrary failure: Invalid access to memory location. I did not get any errors while installing MySQL or RMySQL. It seems that there are other people with similar problems, although I could not find any hint on how to try to solve the problem. It is there, unfortunately along with a lot of uniformed speculation. Any help, hint or advice would be greatly appreciated. The most likely solution is to update (or downdate) your MySQL. You possibly got RMySQL from the CRAN Extras site, and if so this is covered in the ReadMe there: The build of RMySQL_0.6-0 is known to work with MySQL 5.0.21 and 5.0.45, and known not to work (it crashes on startup) with 5.0.41. Usually the message is the one you show, but I have seen R crash. The issue is the MySQL client DLL: that from 5.0.21 or 5.0.45 works in 5.0.41. All the reports of problems I have seen are for MySQL versions strictly between 5.0.21 and 5.0.45. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RMySQL loading error
This was just discussed: https://www.stat.math.ethz.ch/pipermail/r-help/2007-August/138142.html On 8/9/07, Clara Anton [EMAIL PROTECTED] wrote: Hi, I am having problems loading RMySQL. I am using MySQL 5.0, R version 2.5.1, and RMySQL with Windows XP. When I try to load rMySQL I get the following error: require(RMySQL) Loading required package: RMySQL Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library 'C:/PROGRA~1/R/R-25~1.1/library/RMySQL/libs/RMySQL.dll': LoadLibrary failure: Invalid access to memory location. I did not get any errors while installing MySQL or RMySQL. It seems that there are other people with similar problems, although I could not find any hint on how to try to solve the problem. Any help, hint or advice would be greatly appreciated. Thanks, Clara Anton __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Tukey HSD
Hi, I was wondering if you could help me: The following are the first few lines of my data set: subject group condition depvar s1 c ver 114.87 s1 c feet114.87 s1 c body114.87 s2 c ver 73.54 s2 c feet64.32 s2 c body61.39 s3 a ver 114.87 s3 a feet97.21 s3 a body103.31 etc. I entered the following ANOVA command: dat - read.table(mydata.txt, header=T) summary(aov(depvar ~ group * condition + Error(subject), data=dat)) Error: subject Df Sum Sq Mean Sq F value Pr(F) group 1 443.3 443.3 1.0314 0.3185 Residuals 28 12035.3 429.8 Error: Within Df Sum Sq Mean Sq F value Pr(F) condition2 615.82 307.91 6.6802 0.002501 ** group:condition 2 61.51 30.75 0.6672 0.517168 Residuals 56 2581.18 46.09 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1 I cannot find a way to perform a Tukey HSD on the main effect of condition. since the ANOVA formula contains the 'error' command. Could you help me please? Kurt _ [[trailing spam removed]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] S4 based package giving strange error at install time, but not at check time
On Thu, 9 Aug 2007, Rajarshi Guha wrote: Hi, I have a S4 based package package that was loading fine on R 2.5.0 on both OS X and Linux. I was checking the package against 2.5.1 and doing R CMD check does not give any warnings. So I next built the package and installed it. Though the package installed fine I noticed the following message: Loading required package: methods Error in loadNamespace(package, c(which.lib.loc, lib.loc), keep.source = keep.source) : in 'fingerprint' methods specified for export, but none defined: fold, euc.vector, distance, random.fingerprint, as.character, length, show During startup - Warning message: package fingerprint in options(defaultPackages) was not found ^^^ Do you have this package in your startup files or the environment variable R_DEFAULT_PACKAGES? R CMD check should not look there: whatever you are quoting above seems to. However, I can load the package in R with no errors being reported and it seems that the functions are working fine. Looking at the sources I see that my NAMESPACES file contains the following: importFrom(methods) That should specify what to import, or be imports(methods). See 'Writing R Extensions'. exportClasses(fingerprint) exportMethods(fold, euc.vector, distance, random.fingerprint, as.character, length, show) export(fp.sim.matrix, fp.to.matrix, fp.factor.matrix, fp.read.to.matrix, fp.read, moe.lf, bci.lf, cdk.lf) and all the exported methods are defined. As an example consider the 'fold' method. It's defined as setGeneric(fold, function(fp) standardGeneric(fold)) setMethod(fold, fingerprint, function(fp) { ## code for the function snipped }) Since the method has been defined I can't see why I should see the error during install time, but nothing when the package is checked. Any pointers would be appreciated. --- Rajarshi Guha [EMAIL PROTECTED] GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE --- Bus error -- driver executed. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Memory problem
It seems the problem lies in this line: try(fit.lme - lme(Beta ~ group*session*difficulty+FTND, random = ~1|Subj, Model), tag - 1); As lme fails for most iterations in the loop, the 'try' function catches one error message for each failed iteration. But the puzzling part is, why does the memory usage keep accumulating? Does each error message keep stored accumulatively in the buffer or something else? Or something is wrong with the way I'm using 'try'? Thanks, Gang On Aug 9, 2007, at 10:36 AM, Gang Chen wrote: I got a long list of error message repeating with the following 3 lines when running the loop at the end of this mail: R(580,0xa000ed88) malloc: *** vm_allocate(size=327680) failed (error code=3) R(580,0xa000ed88) malloc: *** error: can't allocate region R(580,0xa000ed88) malloc: *** set a breakpoint in szone_error to debug There are 2 big arrays, IData (54x64x50x504) and Stat (4x64x50x9), in the code. They would only use about 0.8GB of memory. However when I check the memory usage during the looping, the memory usage keeps growing and finally reaches the memory limit of my computer, 4GB, and spills the above error message. Is there something in the loop about lme that is causing memory leaking? How can I clean up the memory usage in the loop? Thank you very much for your help, Gang tag - 0; dimx-54; dimy-64; dimz-50; NoF-8; NoFile-504; IData - array(data=NA, dim=c(dimx, dimy, dimz, NoFile)); Stat - array(data=NA, dim=c(dimx, dimy, dimz, NoF)); for (i in 1:NoFile) { IData[,,,i] - fill in the data for array IData here; } for (i in 1:dimx) { for (j in 1:dimy) { for (k in 1:dimz) { for (m in 1:NoFile) { Model$Beta[m] - IData[i, j, k, m]; } try(fit.lme - lme(Beta ~ group*session*difficulty+FTND, random = ~1|Subj, Model), tag - 1); if (tag != 1) { Stat[i, j, k,] - anova(fit.lme)$F[-1]; } else { Stat[i, j, k,] - rep(0, NoF-1); } tag - 0; } } } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Dear Paul, Thank you very much for your comment. I will apply the 'latent' approach you suggested. Sincerely, Matthew Bowser On 8/9/07, paulandpen [EMAIL PROTECTED] wrote: Matthew it is possible that your results are suffering from heterogeneity, it may be that your model performs well at the aggregate level and this would explain good aggregate fit levels and decent predictive performance etc, you could perhaps look at a 'latent' approach to modelling your data, in other words, see if there is something unique in the cases/data/observations in the lower and upper levels of the model (where prediction is poor) and whether it is justified that you model these count areas as spearate and unique from the generic aggregate level model (in other words there is something unobserved/unmeasurted or latent etc in your popn of observations that could causing some observations to behave uniquely overall hth thanks Paul - Original Message - From: Matthew and Kim Bowser [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Friday, August 10, 2007 1:43 AM Subject: [R] Systematically biased count data regression model Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475,
[R] a question on lda{MASS}
hi, assume val is the test data while m is lda model value by using CV=F x = predict(m, val) val2 = val[, 1:(ncol(val)-1)] # the last column is class label # col is sample, row is variable then I am wondering if x$x == (apply(val2*m$scaling), 2, sum) i.e., the scaling (is it coeff vector?) times val data and sum is the discrimant result $x? Thanks. -- Weiwei Shi, Ph.D Research Scientist GeneGO, Inc. Did you always know? No, I did not. But I believed... ---Matrix III __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] plot table with sapply - labeling problems
Here is a modified script that should work. In many cases where you want the names of the element of the list you are processing, you should work with the names: test-as.data.frame(cbind(round(runif(50,0,5)),round(runif(50,0,3)),round(runif(50,0,4 sapply(test, table)-vardist sapply(test, function(x) round(table(x)/sum(table(x))*100,1) )-vardist1 par(mfrow=c(1,3)) # you need to use the 'names' and then index into the variable # your original 'x' did not have a names associated with it sapply(names(vardist1), function(x) barplot(vardist1[[x]], ylim=c(0,100),main=Varset1,xlab=x)) par(mfrow=c(1,1)) On 8/9/07, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Hi List, I am trying to label a barplot group with variable names when using sapply unsucessfully. I can't seem to extract the names for the indiviual plots: test-as.data.frame(cbind(round(runif(50,0,5)),round(runif(50,0,3)),roun d(runif(50,0,4 sapply(test, table)-vardist sapply(test, function(x) round(table(x)/sum(table(x))*100,1) )-vardist1 par(mfrow=c(1,3)) sapply(vardist1, function(x) barplot(x, ylim=c(0,100),main=Varset1,xlab=names(x))) par(mfrow=c(1,1)) Names don't show up although names(vardist) works. Also I would like to put a single Title on this plot instead of repeating Varset three times. Any hints appreciated. Thanx Herry __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Dear all, I received a very helpful response from someone who requested anonymity, but to whom I am grateful. PLEASE do not quote my name or email (I am trying to stay off spam lists) Matthew: I think this is just a reflection of the fact the model does not fit perfectly. The example below is a simple linear regression that is highly significant but has R-square of 16%. This model as well is biased high at low observed values of y and biased low at high values observed values of y set.seed(1) n - 200 m - data.frame(x=rnorm(n,mean=10,sd=2)) m$y - m$x + rnorm(n,sd=4)# simulate using intercept 0, slope 1 f - lm(y ~ x,data=m) print(summary(f)) # # Call: # lm(formula = y ~ x, data = m) # # Residuals: # Min 1Q Median 3Q Max # -11.7310 -2.1709 -0.1009 2.6733 10.3446 # # Coefficients: # Estimate Std. Error t value Pr(|t|) # (Intercept) 0.6274 1.5830 0.3960.692 # x 0.9538 0.1546 6.170 3.77e-09 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Residual standard error: 4.052 on 198 degrees of freedom # Multiple R-Squared: 0.1613, Adjusted R-squared: 0.157 # F-statistic: 38.07 on 1 and 198 DF, p-value: 3.773e-09 # plot(m$y,f$fitted.values,xlab=Observed,ylab=Predicted) lines(lowess(m$y,f$fitted.values),col=red,lty=2) abline(c(0,1)) legend(topleft,lty=c(2,1),col=c(red,black),legend=c(Loess,45-degree)) - Show quoted text - At 2007-08-09 08:43, Matthew and Kim Bowser wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. [snip] #This appears to be a decent explanatory model, but as a predictive model it is systematically biased. It is biased high at low observed values of D and biased low at high values observed values of D. - Show quoted text - On 8/9/07, Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3,
[R] odfWeave processing error, file specific
Hello, I hope there is a simple explanation for this. I have been using odfWeave with great satisfaction in R 2.5.0. Unfortunately, I cannot get beyond the following error message with a particular file. I have copied and pasted into new files and the same error pops up. It looks like the error is occurring before any of the R code is run (?). Any suggestions on how to track this down and fix it? odfWeave('balf.odt', 'balfout.odt') Copying balf.odt Setting wd to /tmp/Rtmpz0aWPf/odfWeave09155238949 Unzipping ODF file using unzip -o balf.odt Archive: balf.odt extracting: mimetype creating: Configurations2/statusbar/ inflating: Configurations2/accelerator/current.xml creating: Configurations2/floater/ creating: Configurations2/popupmenu/ creating: Configurations2/progressbar/ creating: Configurations2/menubar/ creating: Configurations2/toolbar/ creating: Configurations2/images/Bitmaps/ inflating: layout-cache inflating: content.xml inflating: styles.xml inflating: meta.xml inflating: Thumbnails/thumbnail.png inflating: settings.xml inflating: META-INF/manifest.xml Removing balf.odt Creating a Pictures directory Pre-processing the contents Error: cc$parentId == parentId is not TRUE Thanks, aric -- IMPORTANT WARNING: This email (and any attachments) is only...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Subsetting by number of observations in a factor
Hi, I generally do my data preparation externally to R, so I this is a bit unfamiliar to me, but a colleague has asked me how to do certain data manipulations within R. Anyway, basically I can get his large file into a dataframe. One of the columns is a management group code (mg). There may be varying numbers of observations per management group, and he would like to subset the dataframe such that there are always at least n per management group. I presume I can get to this using table or tapply, then (and I'm not sure how on this bit) creating a column nmg containing the number of observations that corresponds to mg for that row, then simply subsetting. So, am I on the right track? If so how do I actually do it, and is there an easier method than I am considering. Thanks for your help, Ron __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Perhaps you don't really need to predict the precise count. Maybe its good enough to predict whether the count is above or below average. In that case the model is 74% correct on a holdout sample of the last 54 points based on a model of the first 200 points. # create model on first 200 and predict on rest DD - data$D mean(data$D) mod - glm(DD ~., data[-1], family = binomial, subset = 1:200) tab - table(predict(mod, data[201:254,-1], type = resp) .5, DD[201:254]) sum(tab * diag(2)) / sum(tab) [1] 0.7407407 On 8/9/07, Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28, 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34, 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221, 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232, 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318, 0.31, 0.261, 0.441, 0.147, 0.283, 0.339, 0.224, 0.5, 0.265, 0.2, 0.287, 0.398, 0.116, 0.292, 0.045, 0.137, 0.542, 0.171, 0.38, 0.469, 0.325, 0.139, 0.166, 0.247, 0.253, 0.466, 0.26, 0.288, 0.34, 0.288, 0.26, 0.178, 0.274, 0.358, 0.285, 0.225, 0.162, 0.223, 0.301, -0.398, -0.2, 0.239, 0.228, 0.255, 0.166, 0.306, 0.28, 0.279, 0.208,
Re: [R] Systematically biased count data regression model
Matthew, In response to that post, I am afraid I have to disagree. I think a poor model fit (eg 16%) is a reflection of a lot of unmeasured factors and therefore random error in the model. This would explain why overall predictive performance is poor (eg a lot of error in the model) Your situation is different. You are having trouble predicting extreme values, so there is something systematic (eg your model works well in the middle and worse at the tails) not poorly overall. As the post does reflect, you are suffering from error in prediction, that is a fact of life, as others have stated, and most of us who suffer from prediction error experience it at the more extreme values. Thanks Paul Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I received a very helpful response from someone who requested anonymity, but to whom I am grateful. PLEASE do not quote my name or email (I am trying to stay off spam lists) Matthew: I think this is just a reflection of the fact the model does not fit perfectly. The example below is a simple linear regression that is highly significant but has R-square of 16%. This model as well is biased high at low observed values of y and biased low at high values observed values of y set.seed(1) n - 200 m - data.frame(x=rnorm(n,mean=10,sd=2)) m$y - m$x + rnorm(n,sd=4)# simulate using intercept 0, slope 1 f - lm(y ~ x,data=m) print(summary(f)) # # Call: # lm(formula = y ~ x, data = m) # # Residuals: # Min 1Q Median 3Q Max # -11.7310 -2.1709 -0.1009 2.6733 10.3446 # # Coefficients: # Estimate Std. Error t value Pr(|t|) # (Intercept) 0.6274 1.5830 0.3960.692 # x 0.9538 0.1546 6.170 3.77e-09 *** # --- # Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 # # Residual standard error: 4.052 on 198 degrees of freedom # Multiple R-Squared: 0.1613, Adjusted R-squared: 0.157 # F-statistic: 38.07 on 1 and 198 DF, p-value: 3.773e-09 # plot(m$y,f$fitted.values,xlab=Observed,ylab=Predicted) lines(lowess(m$y,f$fitted.values),col=red,lty=2) abline(c(0,1)) legend(topleft,lty=c(2,1),col=c(red,black),legend=c(Loess,45-deg ree)) - Show quoted text - At 2007-08-09 08:43, Matthew and Kim Bowser wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. [snip] #This appears to be a decent explanatory model, but as a predictive model it is systematically biased. It is biased high at low observed values of D and biased low at high values observed values of D. - Show quoted text - On 8/9/07, Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics,
Re: [R] Systematically biased count data regression model
I guess I should not have been so quick to make that conclusion since it seems that 74% of the values in the holdout set are FALSE so simply guessing FALSE for each one would give us 74% accuracy: table(DD[201:254]) FALSE TRUE 4014 40/54 [1] 0.7407407 On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Perhaps you don't really need to predict the precise count. Maybe its good enough to predict whether the count is above or below average. In that case the model is 74% correct on a holdout sample of the last 54 points based on a model of the first 200 points. # create model on first 200 and predict on rest DD - data$D mean(data$D) mod - glm(DD ~., data[-1], family = binomial, subset = 1:200) tab - table(predict(mod, data[201:254,-1], type = resp) .5, DD[201:254]) sum(tab * diag(2)) / sum(tab) [1] 0.7407407 On 8/9/07, Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28, 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34, 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221, 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309,
Re: [R] small sample techniques
Hi Murli, First of all, regarding prop.test, you made a typo: you should have used prop.test(c(69,90),c(300,300)) which gives you the squared value of 3.4228, and it's square root is 1.85 which is not too far from 1.94. I would use Fisher Exact Test (fisher.test). Two sided test has a p-value of 0.06411 so you do not reject H0, One sided test (i.e. H1 is that the first probability of success is smaller than the second) has a p-value of 0.03206, so you reject H0 (with 95% confidence level). You get similar results with two-sided and one-sided t-test. Moshe. P.S. if you use paired t-test you get nonsense since it uses pairwise differences, and in your case only 21 of 300 differences are non-zero! --- Nair, Murlidharan T [EMAIL PROTECTED] wrote: n=300 30% taking A relief from pain 23% taking B relief from pain Question; If there is no difference are we likely to get a 7% difference? Hypothesis H0: p1-p2=0 H1: p1-p2!=0 (not equal to) 1Weighed average of two sample proportion 300(0.30)+300(0.23) --- = 0.265 300+300 2Std Error estimate of the difference between two independent proportions sqrt((0.265 *0.735)*((1/300)+(1/300))) = 0.03603 3Evaluation of the difference between sample proportion as a deviation from the hypothesized difference of zero ((0.30-0.23)-(0))/0.03603 = 1.94 z did not approach 1.96 hence H0 is not rejected. This is what I was trying to do using prop.test. prop.test(c(30,23),c(300,300)) What function should I use? -Original Message- From: [EMAIL PROTECTED] on behalf of Nordlund, Dan (DSHS/RDA) Sent: Thu 8/9/2007 1:26 PM To: r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 9:19 AM To: Moshe Olshansky; Rolf Turner; r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques Thanks, that discussion was helpful. Well, I have another question I am comparing two proportions for its deviation from the hypothesized difference of zero. My manually calculated z ratio is 1.94. But, when I calculate it using prop.test, it uses Pearson's chi-squared test and the X-squared value that it gives it 0.74. Is there a function in R where I can calculate the z ratio? Which is ('p1-'p2)-(p1-p2) Z= S ('p1-'p2) Where S is the standard error estimate of the difference between two independent proportions Dummy example This is how I use it prop.test(c(30,23),c(300,300)) Cheers../Murli Murli, I think you need to recheck you computations. You can run a t-test on your data in a variety of ways. Here is one: x-c(rep(1,30),rep(0,270)) y-c(rep(1,23),rep(0,277)) t.test(x,y) Welch Two Sample t-test data: x and y t = 1.0062, df = 589.583, p-value = 0.3147 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.02221086 0.06887752 sample estimates: mean of x mean of y 0.1000 0.0767 Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subsetting by number of observations in a factor
Does this do what you want? It creates a new dataframe with those 'mg' that have at least a certain number of observation. set.seed(2) # create some test data x - data.frame(mg=sample(LETTERS[1:4], 20, TRUE), data=1:20) # split the data into subsets based on 'mg' x.split - split(x, x$mg) str(x.split) List of 4 $ A:'data.frame': 7 obs. of 2 variables: ..$ mg : Factor w/ 4 levels A,B,C,D: 1 1 1 1 1 1 1 ..$ data: int [1:7] 1 4 7 12 14 18 20 $ B:'data.frame': 3 obs. of 2 variables: ..$ mg : Factor w/ 4 levels A,B,C,D: 2 2 2 ..$ data: int [1:3] 9 15 19 $ C:'data.frame': 4 obs. of 2 variables: ..$ mg : Factor w/ 4 levels A,B,C,D: 3 3 3 3 ..$ data: int [1:4] 2 3 10 11 $ D:'data.frame': 6 obs. of 2 variables: ..$ mg : Factor w/ 4 levels A,B,C,D: 4 4 4 4 4 4 ..$ data: int [1:6] 5 6 8 13 16 17 # only choose subsets with at 5 observations x.5 - lapply(x.split, function(a) { + if (nrow(a) = 5) return(a) + else return(NULL) + }) # create new dataframe with these observations x.new - do.call('rbind', x.5) x.new mg data A.1 A1 A.4 A4 A.7 A7 A.12 A 12 A.14 A 14 A.18 A 18 A.20 A 20 D.5 D5 D.6 D6 D.8 D8 D.13 D 13 D.16 D 16 D.17 D 17 On 8/9/07, Ron Crump [EMAIL PROTECTED] wrote: Hi, I generally do my data preparation externally to R, so I this is a bit unfamiliar to me, but a colleague has asked me how to do certain data manipulations within R. Anyway, basically I can get his large file into a dataframe. One of the columns is a management group code (mg). There may be varying numbers of observations per management group, and he would like to subset the dataframe such that there are always at least n per management group. I presume I can get to this using table or tapply, then (and I'm not sure how on this bit) creating a column nmg containing the number of observations that corresponds to mg for that row, then simply subsetting. So, am I on the right track? If so how do I actually do it, and is there an easier method than I am considering. Thanks for your help, Ron __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Seasonality
?monthplot ?stl On 8/10/07, Alberto Monteiro [EMAIL PROTECTED] wrote: I have a time series x = f(t), where t is taken for each month. What is the best function to detect if _x_ has a seasonal variation? If there is such seasonal effect, what is the best function to estimate it? Function arima has a seasonal parameter, but I guess this is too complex to be useful. Alberto Monteiro __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Felix Andrews / 安福立 PhD candidate Integrated Catchment Assessment and Management Centre The Fenner School of Environment and Society The Australian National University (Building 48A), ACT 0200 Beijing Bag, Locked Bag 40, Kingston ACT 2604 http://www.neurofractal.org/felix/ voice:+86_1051404394 (in China) mobile:+86_13522529265 (in China) mobile:+61_410400963 (in Australia) xmpp:[EMAIL PROTECTED] 3358 543D AAC6 22C2 D336 80D9 360B 72DD 3E4C F5D8 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Here is one other idea. Since we are not doing that well with the entire data set lets look at a portion and see if we can do better there. This line of code seems to show that D is related to T: plot(data) so lets try conditioning D ~ T on all combos of the factor levels library(lattice) xyplot(D ~ T | Hemlock * Snow * Alpine, data, layout = c(2, 4) from which it appears there is a much clearer association between D and T when Alpine = 1. Thus lets condition on Alpine = 1, run it over again and eliminate the non-significant variables: mod - glm.nb(D ~ Day + NDVI + T, data = data, subset = Alpine == 1) summary(mod) plot(data$D[data$Alpine == 1], mod$fitted.values) lines(lowess(data$D[data$Alpine == 1], mod$fitted.values), lty = 2) abline(a = 0, b = 1) This time its still slightly biased at the low end but not elsewhere although we have paid a price for this by only looking at the 40 Alpine points (out of 254 points). On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: I guess I should not have been so quick to make that conclusion since it seems that 74% of the values in the holdout set are FALSE so simply guessing FALSE for each one would give us 74% accuracy: table(DD[201:254]) FALSE TRUE 4014 40/54 [1] 0.7407407 On 8/9/07, Gabor Grothendieck [EMAIL PROTECTED] wrote: Perhaps you don't really need to predict the precise count. Maybe its good enough to predict whether the count is above or below average. In that case the model is 74% correct on a holdout sample of the last 54 points based on a model of the first 200 points. # create model on first 200 and predict on rest DD - data$D mean(data$D) mod - glm(DD ~., data[-1], family = binomial, subset = 1:200) tab - table(predict(mod, data[201:254,-1], type = resp) .5, DD[201:254]) sum(tab * diag(2)) / sum(tab) [1] 0.7407407 On 8/9/07, Matthew and Kim Bowser [EMAIL PROTECTED] wrote: Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174,
Re: [R] odfWeave processing error, file specific
Aric, Can you send me a reproducible example (code and odt file) plus the results if sessionInfo()? Thanks, Max -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Aric Gregson Sent: Thursday, August 09, 2007 6:56 PM To: r-help@stat.math.ethz.ch Subject: [R] odfWeave processing error, file specific Hello, I hope there is a simple explanation for this. I have been using odfWeave with great satisfaction in R 2.5.0. Unfortunately, I cannot get beyond the following error message with a particular file. I have copied and pasted into new files and the same error pops up. It looks like the error is occurring before any of the R code is run (?). Any suggestions on how to track this down and fix it? odfWeave('balf.odt', 'balfout.odt') Copying balf.odt Setting wd to /tmp/Rtmpz0aWPf/odfWeave09155238949 Unzipping ODF file using unzip -o balf.odt Archive: balf.odt extracting: mimetype creating: Configurations2/statusbar/ inflating: Configurations2/accelerator/current.xml creating: Configurations2/floater/ creating: Configurations2/popupmenu/ creating: Configurations2/progressbar/ creating: Configurations2/menubar/ creating: Configurations2/toolbar/ creating: Configurations2/images/Bitmaps/ inflating: layout-cache inflating: content.xml inflating: styles.xml inflating: meta.xml inflating: Thumbnails/thumbnail.png inflating: settings.xml inflating: META-INF/manifest.xml Removing balf.odt Creating a Pictures directory Pre-processing the contents Error: cc$parentId == parentId is not TRUE Thanks, aric -- IMPORTANT WARNING: This email (and any attachments) is\ onl...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Systematically biased count data regression model
Hi Matthew, You may be experiencing the classic 'regression towards the mean' phenomenon, in which case shrinkage estimation may help with prediction (extremely low and high values need to be shrunk back towards the mean) Here's a reference that discusses the issue in a manner somewhat related to your situation, and it has plenty of good references Application of Shrinkage Techniques in Logistic Regression Analysis: A Case Study E. W. Steyerberg Statistica Neerlandica, 2001, vol. 55, issue 1, pages 76-88 Steven McKinney Statistician Molecular Oncology and Breast Cancer Program British Columbia Cancer Research Centre email: smckinney +at+ bccrc +dot+ ca tel: 604-675-8000 x7561 BCCRC Molecular Oncology 675 West 10th Ave, Floor 4 Vancouver B.C. V5Z 1L3 Canada -Original Message- From: [EMAIL PROTECTED] on behalf of Matthew and Kim Bowser Sent: Thu 8/9/2007 8:43 AM To: r-help@stat.math.ethz.ch Subject: [R] Systematically biased count data regression model Dear all, I am attempting to explain patterns of arthropod family richness (count data) using a regression model. It seems to be able to do a pretty good job as an explanatory model (i.e. demonstrating relationships between dependent and independent variables), but it has systematic problems as a predictive model: It is biased high at low observed values of family richness and biased low at high observed values of family richness (see attached pdf). I have tried diverse kinds of reasonable regression models mostly as in Zeileis, et al. (2007), as well as transforming my variables, both with only small improvements. Do you have suggestions for making a model that would perform better as a predictive model? Thank you for your time. Sincerely, Matthew Bowser STEP student USFWS Kenai National Wildlife Refuge Soldotna, Alaska, USA M.Sc. student University of Alaska Fairbanks Fairbankse, Alaska, USA Reference Zeileis, A., C. Kleiber, and S. Jackman, 2007. Regression models for count data in R. Technical Report 53, Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Wien, Austria. URL http://cran.r-project.org/doc/vignettes/pscl/countreg.pdf. Code `data` - structure(list(D = c(4, 5, 12, 4, 9, 15, 4, 8, 3, 9, 6, 17, 4, 9, 6, 9, 3, 9, 7, 11, 17, 3, 10, 8, 9, 6, 7, 9, 7, 5, 15, 15, 12, 9, 10, 4, 4, 15, 7, 7, 12, 7, 12, 7, 7, 7, 5, 14, 7, 13, 1, 9, 2, 13, 6, 8, 2, 10, 5, 14, 4, 13, 5, 17, 12, 13, 7, 12, 5, 6, 10, 6, 6, 10, 4, 4, 12, 10, 3, 4, 4, 6, 7, 15, 1, 8, 8, 5, 12, 0, 5, 7, 4, 9, 6, 10, 5, 7, 7, 14, 3, 8, 15, 14, 7, 8, 7, 8, 8, 10, 9, 2, 7, 8, 2, 6, 7, 9, 3, 20, 10, 10, 4, 2, 8, 10, 10, 8, 8, 12, 8, 6, 16, 10, 5, 1, 1, 5, 3, 11, 4, 9, 16, 3, 1, 6, 5, 5, 7, 11, 11, 5, 7, 5, 3, 2, 3, 0, 3, 0, 4, 1, 12, 16, 9, 0, 7, 0, 11, 7, 9, 4, 16, 9, 10, 0, 1, 9, 15, 6, 8, 6, 4, 6, 7, 5, 7, 14, 16, 5, 8, 1, 8, 2, 10, 9, 6, 11, 3, 16, 3, 6, 8, 12, 5, 1, 1, 3, 3, 1, 5, 15, 4, 2, 2, 6, 5, 0, 0, 0, 3, 0, 16, 0, 9, 0, 0, 8, 1, 2, 2, 3, 4, 17, 4, 1, 4, 6, 4, 3, 15, 2, 2, 13, 1, 9, 7, 7, 13, 10, 11, 2, 15, 7), Day = c(159, 159, 159, 159, 166, 175, 161, 168, 161, 166, 161, 166, 161, 161, 161, 175, 161, 175, 161, 165, 176, 161, 163, 161, 168, 161, 161, 161, 161, 161, 165, 176, 175, 176, 163, 175, 163, 168, 163, 176, 176, 165, 176, 175, 161, 163, 163, 168, 163, 175, 167, 176, 167, 165, 165, 169, 165, 169, 165, 161, 165, 175, 165, 176, 175, 167, 167, 175, 167, 164, 167, 164, 181, 164, 167, 164, 176, 164, 167, 164, 167, 164, 167, 175, 167, 173, 176, 173, 178, 167, 173, 172, 173, 178, 178, 172, 181, 182, 173, 162, 162, 173, 178, 173, 172, 162, 173, 162, 173, 162, 173, 170, 178, 166, 166, 162, 166, 177, 166, 170, 166, 172, 172, 166, 172, 166, 174, 162, 164, 162, 170, 164, 170, 164, 170, 164, 177, 164, 164, 174, 174, 162, 170, 162, 172, 162, 165, 162, 165, 177, 172, 162, 170, 162, 170, 174, 165, 174, 166, 172, 174, 172, 174, 170, 170, 165, 170, 174, 174, 172, 174, 172, 174, 165, 170, 165, 170, 174, 172, 174, 172, 175, 175, 170, 171, 174, 174, 174, 172, 175, 171, 175, 174, 174, 174, 175, 172, 171, 171, 174, 160, 175, 160, 171, 170, 175, 170, 170, 160, 160, 160, 171, 171, 171, 171, 160, 160, 160, 171, 171, 176, 171, 176, 176, 171, 176, 171, 176, 176, 176, 176, 159, 166, 159, 159, 166, 168, 169, 159, 168, 169, 166, 163, 180, 163, 165, 164, 180, 166, 166, 164, 164, 177, 166), NDVI = c(0.187, 0.2, 0.379, 0.253, 0.356, 0.341, 0.268, 0.431, 0.282, 0.181, 0.243, 0.327, 0.26, 0.232, 0.438, 0.275, 0.169, 0.288, 0.138, 0.404, 0.386, 0.194, 0.266, 0.23, 0.333, 0.234, 0.258, 0.333, 0.234, 0.096, 0.354, 0.394, 0.304, 0.162, 0.565, 0.348, 0.345, 0.226, 0.316, 0.312, 0.333, 0.28, 0.325, 0.243, 0.194, 0.29, 0.221, 0.217, 0.122, 0.289, 0.475, 0.048, 0.416, 0.481, 0.159, 0.238, 0.183, 0.28, 0.32, 0.288, 0.24, 0.287, 0.363, 0.367, 0.24, 0.55, 0.441, 0.34, 0.295, 0.23, 0.32, 0.184, 0.306, 0.232, 0.289, 0.341, 0.221, 0.333, 0.17, 0.139, 0.2, 0.204, 0.301, 0.253, -0.08, 0.309, 0.232, 0.23, 0.239, -0.12, 0.26, 0.285, 0.45, 0.348, 0.396, 0.311, 0.318,
[R] compute ROC curve?
Hello, i have continuous test results for dieased and nondiseased subjects, say X and Y. Both are vectors of numbers. is there any R function which can generate the step function of ROC curve automatically? Thanks! [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subsetting by number of observations in a factor
Jim, Does this do what you want? It creates a new dataframe with those 'mg' that have at least a certain number of observation. Looks good. I also have an alternative solution which appears to work, so I'll see which is quicker on the big data set in question. My solution: mgsize - as.data.frame(table(in$mg)) in2 - merge(in,mgsize,by.x=mg,by.y=Var1) out - subset(in2, Freq 1, select= -Freq) Thanks for your help. Ron. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] error message: image not found
I have a R version 2.4 and I installed R version 2.5(current version) on Mac OS X 10.4.10. I tried dyn.load to load a object code compiled from C source. I got the following error message: Error in dyn.load(x, as.logical(local), as.logical(now)) : unable to load shared library '/Users/jusong/Desktop/BPM/R/group.so': dlopen(/Users/jusong/Desktop/BPM/R/group.so, 6): Library not loaded: /Library/Frameworks/R.framework/Versions/2.4/Resources/lib/ libR.dylib Referenced from: /Users/jusong/Desktop/BPM/R/group.so Reason: image not found How can I fix this problem? Best, Jungeun Song  __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Subsetting by number of observations in a factor
Here is an even faster way: # faster way x.mg.size - table(x$mg) # count occurance x.mg.5 - names(x.mg.size)[x.mg.size 5] # select greater than 5 x.new1 - subset(x, x$mg %in% x.mg.5) # use in the subset x.new1 mg data 1 A1 4 A4 5 D5 6 D6 7 A7 8 D8 12 A 12 13 D 13 14 A 14 16 D 16 17 D 17 18 A 18 20 A 20 On 8/9/07, Ron Crump [EMAIL PROTECTED] wrote: Jim, Does this do what you want? It creates a new dataframe with those 'mg' that have at least a certain number of observation. Looks good. I also have an alternative solution which appears to work, so I'll see which is quicker on the big data set in question. My solution: mgsize - as.data.frame(table(in$mg)) in2 - merge(in,mgsize,by.x=mg,by.y=Var1) out - subset(in2, Freq 1, select= -Freq) Thanks for your help. Ron. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tukey HSD
Please see the R-help message http://finzi.psych.upenn.edu/R/Rhelp02a/archive/105165.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] small sample techniques
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Nair, Murlidharan T Sent: Thursday, August 09, 2007 12:02 PM To: Nordlund, Dan (DSHS/RDA); r-help@stat.math.ethz.ch Subject: Re: [R] small sample techniques n=300 30% taking A relief from pain 23% taking B relief from pain Question; If there is no difference are we likely to get a 7% difference? Hypothesis H0: p1-p2=0 H1: p1-p2!=0 (not equal to) 1Weighed average of two sample proportion 300(0.30)+300(0.23) --- = 0.265 300+300 2Std Error estimate of the difference between two independent proportions sqrt((0.265 *0.735)*((1/300)+(1/300))) = 0.03603 3Evaluation of the difference between sample proportion as a deviation from the hypothesized difference of zero ((0.30-0.23)-(0))/0.03603 = 1.94 z did not approach 1.96 hence H0 is not rejected. This is what I was trying to do using prop.test. prop.test(c(30,23),c(300,300)) What function should I use? I sent this from work but it seems to have disappeared into the luminiferous ether. The proportion test above indicates that p1=0.1 and p2=0.0767. But in your t-test you specify p1=0.3 and p2=0.23. Which is correct? If p1=0.3 and p2=0.23, then use prop.test(c(.30*300,.23*300),c(300,300)) Hope this is helpful, Dan Daniel J. Nordlund Research and Data Analysis Washington State Department of Social and Health Services Olympia, WA 98504-5204 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.