Re: [R] Tools For Preparing Data For Analysis
An additional option for Windows users is Micro Osiris http://www.microsiris.com/ best robert On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote: As noted on the R-project web site itself ( www.r-project.org - Manuals - R Data Import/Export ), it can be cumbersome to prepare messy and dirty data for analysis with the R tool itself. I've also seen at least one S programming book (one of the yellow Springer ones) that says, more briefly, the same thing. The R Data Import/Export page recommends examples using SAS, Perl, Python, and Java. It takes a bit of courage to say that ( when you go to a corporate software web site, you'll never see a page saying This is the type of problem that our product is not the best at, here's what we suggest instead ). I'd like to provide a few more suggestions, especially for volunteers who are willing to evaluate new candidates. SAS is fine if you're not paying for the license out of your own pocket. But maybe one reason you're using R is you don't have thousands of spare dollars. Using Java for data cleaning is an exercise in sado-masochism, Java has a learning curve (almost) as difficult as C++. There are different types of data transformation, and for some data preparation problems an all-purpose programming language is a good choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has excellent regular expression facilities. However, for some types of complex demanding data preparation problems, an all-purpose programming language is a poor choice. For example: cleaning up and preparing clinical lab data and adverse event data - you could do it in Perl, but it would take way, way too much time. A specialized programming language is needed. And since data transformation is quite different from data query, SQL is not the ideal solution either. There are only three statistical programming languages that are well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more popular than S for data cleaning. If you're an R user with difficult data preparation problems, frankly you are out of luck, because the products I'm about to mention are new, unknown, and therefore regarded as immature. And while the founders of these products would be very happy if you kicked the tires, most people don't like to look at brand new products. Most innovators and inventers don't realize this, I've learned it the hard way. But if you are a volunteer who likes to help out by evaluating, comparing, and reporting upon new candidates, well you could certainly help out R users and the developers of the products by kicking the tires of these products. And there is a huge need for such volunteers. 1. DAP This is an open source implementation of SAS. The founder: Susan Bassein Find it at: directory.fsf.org/math/stats (GNU GPL) 2. PSPP This is an open source implementation of SPSS. The relatively early version number might not give a good idea of how mature the data transformation features are, it reflects the fact that he has only started doing the statistical tests. The founder: Ben Pfaff, either a grad student or professor at Stanford CS dept. Also at : directory.fsf.org/math/stats (GNU GPL) 3. Vilno This uses a programming language similar to SPSS and SAS, but quite unlike S. Essentially, it's a substitute for the SAS datastep, and also transposes data and calculates averages and such. (No t-tests or regressions in this version). I created this, during the years 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in my opinion. The tarball includes about 100 or so test cases used for debugging - for logical calculation errors, but not for extremely high volumes of data. The maintenance of Vilno has slowed down, because I am currently (desparately) looking for employment. But once I've found new employment and living quarters and settled in, I will continue to enhance Vilno in my spare time. The founder: that would be me, Robert Wilkins Find it at: code.google.com/p/vilno ( GNU GPL ) ( In particular, the tarball at code.google.com/p/vilno/downloads/list , since I have yet to figure out how to use Subversion ). 4. Who knows? It was not easy to find out about the existence of DAP and PSPP. So who knows what else is out there. However, I think you'll find a lot more statistics software ( regression , etc ) out there, and not so much data transformation software. Not many people work on data preparation software. In fact, the category is so obscure that there isn't one agreed term: data cleaning , data munging , data crunching , or just getting the data ready for analysis. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] understanding round() behavior
Dear all, I am a little bit puzzled by the way round() works. Consider the following code a-123456.3678 round(a,digits=10) [1] 123456.4 I would expect the outcome to be something like 123456.3678 or 123456.368, instead the computer gives me 123456.4 no matter how large the digits are. Can anybody help me understand what I'm missing here? Thanks again for your help. Robert __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to write a function?
Hi Keti Before reinventing the wheel from scratch you might want to take a look at the survey package http://faculty.washington.edu/tlumley/survey/ best robert On 4/24/07, Keti Cuko [EMAIL PROTECTED] wrote: Hi, My name is Katie and I was wondering if you could help me with my problem. I am trying to write a function in R that computes the statistics (mean, standard error, confidence intervals) for stratified samples. I am not that familiar with R and I am having dificulties setting this function up. Any help or tips on where/how to do this? Best, Katie [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] summary and min max
Has anyone created an alternative summary method where the rounding is made only for digits to right of the decimal point? I personally don't like the way summarize works on this particular issue, but I'm not sure how to modify it generically... (of course one can always set digits=something_big but this is not elegant and unpractical when one doesn't know in advance the magnitude of a number) robert On 4/23/07, Mike Prager [EMAIL PROTECTED] wrote: Sebastian P. Luque [EMAIL PROTECTED] wrote: I came across a case where there's a discrepancy between minimum and maximum values reported by 'summary' and the 'min' and 'max' functions: summary() rounds by default. Thus its reporting oddball values is considered a feature, not a bug. -- Mike Prager, NOAA, Beaufort, NC * Opinions expressed are personal and not represented otherwise. * Any use of tradenames does not constitute a NOAA endorsement. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
So I guess my question is... Is there any hope of R being modified on its core in order to handle more graciously large datasets? (You've mentioned SAS and SPSS, I'd add Stata to the list). Or should we (the users of large datasets) expect to keep on working with the present tools for the time to come? robert On 4/11/07, Marc Schwartz [EMAIL PROTECTED] wrote: On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: On Wed, 2007-04-11 at 17:56 +0200, Bi-Info (http://members.home.nl/bi-info) wrote: I certainly have that idea too. SPSS functions in a way the same, although it specialises in PC applications. Memory addition to a PC is not a very expensive thing these days. On my first AT some extra memory cost 300 dollars or more. These days you get extra memory with a package of marshmellows or chocolate bars if you need it. All computations on a computer are discrete steps in a way, but I've heard that SAS computations are split up in strictly divided steps. That also makes procedures attachable I've been told, and interchangable. Different procedures can use the same code which alternatively is cheaper in memory usages or disk usage (the old days...). That makes SAS by the way a complicated machine to build because procedures who are split up into numerous fragments which make complicated bookkeeping. If you do it that way, I've been told, you can do a lot of computations with very little memory. One guy actually computed quite complicated models with only 32MB or less, which wasn't very much for his type of calculations. Which means that SAS is efficient in memory handling I think. It's not very efficient in dollar handling... I estimate. Wilfred snip OhSAS is quite efficient in dollar handling, at least when it comes to the annual commercial licenses...along the same lines as the purported efficiency of the U.S. income tax system: How much money do you have? Send it in... There is a reason why SAS is the largest privately held software company in the world and it is not due to the academic licensing structure, which constitutes only about 12% of their revenue, based upon their public figures. Hmmm..here is a classic example of the problems of reading pie charts. The figure I quoted above, which is from reading the 2005 SAS Annual Report on their web site (such as it is for a private company) comes from a 3D exploded pie chart (ick...). The pie chart uses 3 shades of grey and 5 shades of blue to differentiate 8 market segments and their percentages of total worldwide revenue. I mis-read the 'shade of grey' allocated to Education as being 12% (actually 11.7%). A re-read of the chart, zooming in close on the pie in a PDF reader, appears to actually show that Education is but 1.8% of their annual worldwide revenue. Government based installations, which are presumably the other notable market segment in which substantially discounted licenses are provided, is 14.6%. The report is available here for anyone else curious: http://www.sas.com/corporate/report05/annualreport05.pdf Somebody needs to send SAS a copy of Tufte or Cleveland. I have to go and rest my eyes now... ;-) Regards, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Keep R packages in the R installation
You don't say which OS you're using. But I infer from your other posting that is MS Windows. In which case you can follow the instructions in the FAQ's for Windows http://cran.r-project.org/bin/windows/base/rw-FAQ.html#What_0027s-the-best-way-to-upgrade_003f ... and also read the mailing list posting guide best robert On 4/1/07, Tong Wang [EMAIL PROTECTED] wrote: Hi, I just got a quick question here, when I install a new version of R , is there an easy to keep the installed R packages ? Thanks a lot for any help. tong __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R code for Statistical Models in S ?
You might want to start looking at the FAQ's http://cran.r-project.org/faqs.html in particular http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-and-S robert On 3/1/07, Charilaos Skiadas [EMAIL PROTECTED] wrote: I just acquired a copy of Statistical Models in S, I guess most commonly known as the white book, and realized to my dismay that most of the code is not directly executable in R, and I was wondering if there was a source discussing the things that are different and what the new ways of calling things are. For instance, the first obstacle was the solder.balance data set. I found a solder data set in rpart, which is very close to it except for the fact that the Panel variable is not a factor, but that's easily fixed. The first problem is the next two calls, on pages 2 and 3. One is plot(solder.balance), which is supposed to produce a very different plot than it does in R (I actually don't know the name of the plot, which is part of the problem I guess). Then one is supposed to call plot.factor(skips ~ Opening + Mask), which I took to mean: plot(skips ~ Opening + Mask, data=solder), and that worked, though I still haven't been able to make a direct call to plot.factor work (I keep getting a could not find function plot.factor error). Anyway, just wondered whether there is some page somewhere that discusses these little differences here and there, as I am sure there will be a number of other problems such as these along the way. Haris Skiadas Department of Mathematics and Computer Science Hanover College __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Robust standard errors in logistic regression
This discussion leads to another point which is more subtle, but more important... You can always get Huber-White (a.k.a robust) estimators of the standard errors even in non-linear models like the logistic regression. However, if you beleive your errors do not satisfy the standard assumptions of the model, then you should not be running that model as this might lead to biased parameter estimates. For instance, in the linear regression model you have consistent parameter estimates independently of whethere the errors are heteroskedastic or not. However, in the case of non-linear models it is usually the case that heteroskedasticity will lead to biased parameter estimates (unless you fix it explicitly somehow). Stata is famous for providing Huber-White std. errors in most of their regression estimates, whether linear or non-linear. But this is nonsensical in the non-linear models since in these cases you would be consistently estimating the standard errors of inconsistent parameters. This point and potential solutions to this problem is nicely discussed in Wooldrige's Econometric Analysis of Cross Section and Panel Data. On 7/5/06, Thomas Lumley [EMAIL PROTECTED] wrote: On Wed, 5 Jul 2006, Martin Maechler wrote: Celso == Celso Barros [EMAIL PROTECTED] on Wed, 5 Jul 2006 04:50:29 -0300 writes: [...] Celso By the way, I was wondering if there is a way to use rlm (from MASS) Celso to estimate robust standard errors for logistic regression? rlm stands for 'robust lm'. What you need here is 'robust glm'. I've already replied to a similar message by you, mentioning the (relatively) new package robustbase. After installing it, you can use robustbase::glmrob() We have a clash of terminology here. The robust standard errors that sandwich and robcov give are almost completely unrelated to glmrob(). My guess is that Celso wants glmrob(), but I don't know for sure. The Huber/White sandwich variance estimator for parameters in an ordinary generalized linear model gives an estimate of the variance that is consistent if the systematic part of the model is correctly specified and conservative otherwise. It is a computationally cheap linear approximation to the bootstrap. These variance estimators seem to usually be called model-robust, though I prefer Nils Hjort's suggestion of model-agnostic, which avoids confusion with robust statistics. This is what sandwich and robcov() do. glmrob() and rlm() give robust estimation of regression parameters. That is, if the data come from a model that is close to the exponential family model underlying glm, the estimates will be close to the parameters from that exponential family model. This is a more common statistical sense of the term robust. I think the confusion has been increased by the fact that earlier S implementations of robust regression didn't provide standard errors, whereas rlm() and glmrob() do. This was partly a quality-of-implementation issue and partly because of theoretical difficulties with, eg, lms(). -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED]University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] is there a formatted output in R?
Dear Andy and Michael, Please stick to the posting guide of this list... i.e. Be tolerant. Rudeness is never warranted, but sometimes `read the manual' is the appropriate response. Don't waste time discussing such matters on the list. Robert Duval On 3/11/06, Liaw, Andy [EMAIL PROTECTED] wrote: From: Michael Thank you for your reminder! I think you don't have to tell me to read the document. I beg to differ. I'd bet many here do feel you need to read the documentations more carefully. I have done that many times already. My feeling after reading the creating package manual is that my god, this job needs a Computer Science degree to do it. It is way too complicated. If an amateur like me can do it, I'm quite sure a CS degree is not needed. How many packages available on CRAN do you think were created by those with CS degrees? How many in R Core do you think have CS degrees? There should be simpler ways. There are, and this question has been asked and answered many times over on this very list. Please do learn to search the archive, as the posting guide asks you to. Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] heckit with a probit
I don't know if I understand your problem very well but the first reference that comes to mind is James Heckman, Dummy Endogenous Variables in a Simultaneous Equation System, Econometrica, (July 1978). also you might find a good survey of the literature on Francis Vella Estimating models with sample selection bias: A survey, Journal of Human Resources, 1998, Vol 33 pp 127-169. I don't know how many of the methods here proposed are already implemented in R, but in principle many of them are Likelihood models that you could program. best robert On 2/27/06, David Hugh-Jones [EMAIL PROTECTED] wrote: Hi I have data for voting behaviour on two (related) binary votes. I want to examine the second vote, running separate regressions for groups who voted different ways on the first vote. As the votes are not independent, I guess that there is an issue with selection bias. So, I think I would like to fit a heckit style model but with a binary dependent variable - so, in effect, two successive probits. Is there a way to do it in R? (Alternatively: am I thinking about this the right way?) Cheers David __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] draft of Comment on UCLA tech report
Being a Stata user in transition to R I have to say that it would be fair to mention that data handling for large amounts of data might take an extra step in R. I understand that there are good reasons for R consuming more memory (than Stata) when handling large datasets, but it is necessary to warn the potential newcomers that they might require using MySQL or other database managers if they use large datasets. greetings robert On 1/27/06, Patrick Burns [EMAIL PROTECTED] wrote: You may recall that there was a discussion of a technical report from the statistical consulting group at UCLA. I have a draft of a comment on that report, which you can get from http://www.burns-stat.com/pages/Flotsam/uclaRcomment_draft1.pdf I'm interested in comments: corrections, additions, deletions. Patrick Burns [EMAIL PROTECTED] +44 (0)20 8525 0696 http://www.burns-stat.com (home of S Poetry and A Guide for the Unwilling S User) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html