Re: [R] Tools For Preparing Data For Analysis

2007-06-07 Thread Robert Duval
An additional option for Windows users is Micro Osiris

http://www.microsiris.com/

best
robert

On 6/7/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 As noted on the R-project web site itself ( www.r-project.org -
 Manuals - R Data Import/Export ), it can be cumbersome to prepare
 messy and dirty data for analysis with the R tool itself. I've also
 seen at least one S programming book (one of the yellow Springer ones)
 that says, more briefly, the same thing.
 The R Data Import/Export page recommends examples using SAS, Perl,
 Python, and Java. It takes a bit of courage to say that ( when you go
 to a corporate software web site, you'll never see a page saying This
 is the type of problem that our product is not the best at, here's
 what we suggest instead ). I'd like to provide a few more
 suggestions, especially for volunteers who are willing to evaluate new
 candidates.

 SAS is fine if you're not paying for the license out of your own
 pocket. But maybe one reason you're using R is you don't have
 thousands of spare dollars.
 Using Java for data cleaning is an exercise in sado-masochism, Java
 has a learning curve (almost) as difficult as C++.

 There are different types of data transformation, and for some data
 preparation problems an all-purpose programming language is a good
 choice ( i.e. Perl , or maybe Python/Ruby ). Perl, for example, has
 excellent regular expression facilities.

 However, for some types of complex demanding data preparation
 problems, an all-purpose programming language is a poor choice. For
 example: cleaning up and preparing clinical lab data and adverse event
 data - you could do it in Perl, but it would take way, way too much
 time. A specialized programming language is needed. And since data
 transformation is quite different from data query, SQL is not the
 ideal solution either.

 There are only three statistical programming languages that are
 well-known, all dating from the 1970s: SPSS, SAS, and S. SAS is more
 popular than S for data cleaning.

 If you're an R user with difficult data preparation problems, frankly
 you are out of luck, because the products I'm about to mention are
 new, unknown, and therefore regarded as immature. And while the
 founders of these products would be very happy if you kicked the
 tires, most people don't like to look at brand new products. Most
 innovators and inventers don't realize this, I've learned it the hard
 way.

 But if you are a volunteer who likes to help out by evaluating,
 comparing, and reporting upon new candidates, well you could certainly
 help out R users and the developers of the products by kicking the
 tires of these products. And there is a huge need for such volunteers.

 1. DAP
 This is an open source implementation of SAS.
 The founder: Susan Bassein
 Find it at: directory.fsf.org/math/stats (GNU GPL)

 2. PSPP
 This is an open source implementation of SPSS.
 The relatively early version number might not give a good idea of how
 mature the
 data transformation features are, it reflects the fact that he has
 only started doing the statistical tests.
 The founder: Ben Pfaff, either a grad student or professor at Stanford CS 
 dept.
 Also at : directory.fsf.org/math/stats (GNU GPL)

 3. Vilno
 This uses a programming language similar to SPSS and SAS, but quite unlike S.
 Essentially, it's a substitute for the SAS datastep, and also
 transposes data and calculates averages and such. (No t-tests or
 regressions in this version). I created this, during the years
 2001-2006 mainly. It's version 0.85, and has a fairly low bug rate, in
 my opinion. The tarball includes about 100 or so test cases used for
 debugging - for logical calculation errors, but not for extremely high
 volumes of data.
 The maintenance of Vilno has slowed down, because I am currently
 (desparately) looking for employment. But once I've found new
 employment and living quarters and settled in, I will continue to
 enhance Vilno in my spare time.
 The founder: that would be me, Robert Wilkins
 Find it at: code.google.com/p/vilno ( GNU GPL )
 ( In particular, the tarball at code.google.com/p/vilno/downloads/list
 , since I have yet to figure out how to use Subversion ).


 4. Who knows?
 It was not easy to find out about the existence of DAP and PSPP. So
 who knows what else is out there. However, I think you'll find a lot
 more statistics software ( regression , etc ) out there, and not so
 much data transformation software. Not many people work on data
 preparation software. In fact, the category is so obscure that there
 isn't one agreed term: data cleaning , data munging , data crunching ,
 or just getting the data ready for analysis.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



[R] understanding round() behavior

2007-04-24 Thread Robert Duval
Dear all,

I am a little bit puzzled by the way round() works.
Consider the following code

a-123456.3678
 round(a,digits=10)
[1] 123456.4


I would expect the outcome to be something like 123456.3678 or
123456.368, instead the computer gives me 123456.4 no matter how large
the digits are.

Can anybody help me understand what I'm missing here?

Thanks again for your help.
Robert

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to write a function?

2007-04-24 Thread Robert Duval
Hi Keti

Before reinventing the wheel from scratch you might want to take a
look at the survey package

http://faculty.washington.edu/tlumley/survey/

best
robert

On 4/24/07, Keti Cuko [EMAIL PROTECTED] wrote:
 Hi,
 My name is Katie and I was wondering if you could help me with my problem. I
 am trying to write a function in R that computes the statistics (mean,
 standard error, confidence intervals) for stratified samples. I am not that
 familiar with R and I am having dificulties setting this function up. Any
 help or tips on where/how to do this?

 Best,
 Katie

 [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] summary and min max

2007-04-23 Thread Robert Duval
Has anyone created an alternative summary method where the rounding is
made only for digits to right of the decimal point?

I personally don't like the way summarize works on this particular
issue, but I'm not sure how to modify it generically...

(of course one can always set digits=something_big but this is not
elegant and unpractical when one doesn't know in advance the magnitude
of a number)

robert

On 4/23/07, Mike Prager [EMAIL PROTECTED] wrote:
 Sebastian P. Luque [EMAIL PROTECTED] wrote:

  I came across a case where there's a discrepancy between minimum and
  maximum values reported by 'summary' and the 'min' and 'max' functions:

 summary() rounds by default. Thus its reporting oddball values
 is considered a feature, not a bug.

 --
 Mike Prager, NOAA, Beaufort, NC
 * Opinions expressed are personal and not represented otherwise.
 * Any use of tradenames does not constitute a NOAA endorsement.

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reasons to Use R

2007-04-11 Thread Robert Duval
So I guess my question is...

Is there any hope of R being modified on its core in order to handle
more graciously large datasets? (You've mentioned SAS and SPSS, I'd
add Stata to the list).

Or should we (the users of large datasets) expect to keep on working
with the present tools for the time to come?

robert

On 4/11/07, Marc Schwartz [EMAIL PROTECTED] wrote:
 On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote:
  On Wed, 2007-04-11 at 17:56 +0200, Bi-Info
  (http://members.home.nl/bi-info) wrote:
   I certainly have that idea too. SPSS functions in a way the same,
   although it specialises in PC applications. Memory addition to a PC is
   not a very expensive thing these days. On my first AT some extra memory
   cost 300 dollars or more. These days you get extra memory with a package
   of marshmellows or chocolate bars if you need it.
   All computations on a computer are discrete steps in a way, but I've
   heard that SAS computations are split up in strictly divided steps. That
   also makes procedures attachable I've been told, and interchangable.
   Different procedures can use the same code which alternatively is
   cheaper in memory usages or disk usage (the old days...). That makes SAS
   by the way a complicated machine to build because procedures who are
   split up into numerous fragments which make complicated bookkeeping. If
   you do it that way, I've been told, you can do a lot of computations
   with very little memory. One guy actually computed quite complicated
   models with only 32MB or less, which wasn't very much for his type of
   calculations. Which means that SAS is efficient in memory handling I
   think. It's not very efficient in dollar handling... I estimate.
  
   Wilfred
 
  snip
 
  OhSAS is quite efficient in dollar handling, at least when it comes
  to the annual commercial licenses...along the same lines as the
  purported efficiency of the U.S. income tax system:
 
How much money do you have?  Send it in...
 
  There is a reason why SAS is the largest privately held software company
  in the world and it is not due to the academic licensing structure,
  which constitutes only about 12% of their revenue, based upon their
  public figures.

 Hmmm..here is a classic example of the problems of reading pie
 charts.

 The figure I quoted above, which is from reading the 2005 SAS Annual
 Report on their web site (such as it is for a private company) comes
 from a 3D exploded pie chart (ick...).

 The pie chart uses 3 shades of grey and 5 shades of blue to
 differentiate 8 market segments and their percentages of total worldwide
 revenue.

 I mis-read the 'shade of grey' allocated to Education as being 12%
 (actually 11.7%).

 A re-read of the chart, zooming in close on the pie in a PDF reader,
 appears to actually show that Education is but 1.8% of their annual
 worldwide revenue.

 Government based installations, which are presumably the other notable
 market segment in which substantially discounted licenses are provided,
 is 14.6%.

 The report is available here for anyone else curious:

   http://www.sas.com/corporate/report05/annualreport05.pdf

 Somebody needs to send SAS a copy of Tufte or Cleveland.

 I have to go and rest my eyes now...  ;-)

 Regards,

 Marc

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Keep R packages in the R installation

2007-04-01 Thread Robert Duval
You don't say which OS you're using. But I infer from your other
posting that is MS Windows.

In which case you can follow the instructions in the FAQ's for Windows

http://cran.r-project.org/bin/windows/base/rw-FAQ.html#What_0027s-the-best-way-to-upgrade_003f

... and also read the mailing list posting guide

best
robert

On 4/1/07, Tong Wang [EMAIL PROTECTED] wrote:
 Hi,
 I just got a quick question here,  when I install a new version of R , is 
 there an easy to keep the installed R packages ?


 Thanks a lot for any help.

 tong

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R code for Statistical Models in S ?

2007-03-01 Thread Robert Duval
You might want to start looking at the FAQ's

http://cran.r-project.org/faqs.html

in particular

http://cran.r-project.org/doc/FAQ/R-FAQ.html#R-and-S

robert

On 3/1/07, Charilaos Skiadas [EMAIL PROTECTED] wrote:
 I just acquired a copy of Statistical Models in S, I guess most
 commonly known as the white book, and realized to my dismay that
 most of the code is not directly executable in R, and I was wondering
 if there was a source discussing the things that are different and
 what the new ways of calling things are.

 For instance, the first obstacle was the solder.balance data set. I
 found a solder data set in rpart, which is very close to it except
 for the fact that the Panel variable is not a factor, but that's
 easily fixed.
 The first problem is the next two calls, on pages 2 and 3. One is
 plot(solder.balance), which is supposed to produce a very different
 plot than it does in R (I actually don't know the name of the plot,
 which is part of the problem I guess). Then one is supposed to call
 plot.factor(skips ~ Opening + Mask), which I took to mean:
 plot(skips ~ Opening + Mask, data=solder), and that worked, though
 I still haven't been able to make a direct call to plot.factor work
 (I keep getting a could not find function plot.factor error).

 Anyway, just wondered whether there is some page somewhere that
 discusses these little differences here and there, as I am sure there
 will be a number of other problems such as these along the way.

 Haris Skiadas
 Department of Mathematics and Computer Science
 Hanover College

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Robust standard errors in logistic regression

2006-07-05 Thread Robert Duval
This discussion leads to another point which is more subtle, but more
important...

You can always get Huber-White (a.k.a robust) estimators of the
standard errors even in non-linear models like the logistic
regression. However, if you beleive your errors do not satisfy the
standard assumptions of the model, then you should not be running that
model as this might lead to biased parameter estimates.

For instance, in the linear regression model you have consistent
parameter estimates independently of whethere the errors are
heteroskedastic or not. However, in the case of non-linear models it
is usually the case that heteroskedasticity will lead to biased
parameter estimates (unless you fix it explicitly somehow).

Stata is famous for providing Huber-White std. errors in most of their
regression estimates, whether linear or non-linear. But this is
nonsensical in the non-linear models since in these cases you would be
consistently estimating the standard errors of inconsistent
parameters.

This point and potential solutions to this problem is nicely discussed
in Wooldrige's Econometric Analysis of Cross Section and Panel Data.






On 7/5/06, Thomas Lumley [EMAIL PROTECTED] wrote:
 On Wed, 5 Jul 2006, Martin Maechler wrote:
  Celso == Celso Barros [EMAIL PROTECTED]
  on Wed, 5 Jul 2006 04:50:29 -0300 writes:
 
  [...]
 Celso By the way, I was wondering if there is a way to use rlm (from 
  MASS)
 Celso to estimate robust standard errors for logistic regression?
 
  rlm stands for 'robust lm'.  What you need here is  'robust glm'.
 
  I've already replied to a similar message by you,
  mentioning the (relatively) new package robustbase.
  After installing it, you can
  use
robustbase::glmrob()

 We have a clash of terminology here.  The robust standard errors that
 sandwich and robcov give are almost completely unrelated to glmrob().
 My guess is that Celso wants glmrob(), but I don't know for sure.

 The Huber/White sandwich variance estimator for parameters in an ordinary
 generalized linear model gives an estimate of the variance that is
 consistent if the systematic part of the model is correctly specified and
 conservative otherwise.  It is a computationally cheap linear
 approximation to the bootstrap.  These variance estimators seem to usually
 be called model-robust, though I prefer Nils Hjort's suggestion of
 model-agnostic, which avoids confusion with robust statistics. This is
 what sandwich and robcov() do.

 glmrob() and rlm() give robust estimation of regression parameters. That
 is, if the data come from a model that is close to the exponential family
 model underlying glm, the estimates will be close to the parameters from
 that exponential family model.  This is a more common statistical sense of
 the term robust.


 I think the confusion has been increased by the fact that earlier S
 implementations of robust regression didn't provide standard errors,
 whereas rlm() and glmrob() do. This was partly a quality-of-implementation
 issue and partly because of theoretical difficulties with, eg, lms().


 -thomas

 Thomas Lumley   Assoc. Professor, Biostatistics
 [EMAIL PROTECTED]University of Washington, Seattle

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] is there a formatted output in R?

2006-03-12 Thread Robert Duval
Dear Andy and Michael,

Please stick to the posting guide of this list... i.e.

Be tolerant. Rudeness is never warranted, but sometimes `read the
manual' is the appropriate response. Don't waste time discussing such
matters on the list.

Robert Duval

On 3/11/06, Liaw, Andy [EMAIL PROTECTED] wrote:
 From: Michael
 
  Thank you for your reminder!
 
  I think you don't have to tell me to read the document.

 I beg to differ.  I'd bet many here do feel you need to read the
 documentations more carefully.

  I have done that many times already.
 
  My feeling after reading the creating package manual is
  that my god, this job needs a Computer Science degree to do
  it. It is way too complicated.

 If an amateur like me can do it, I'm quite sure a CS degree is not needed.
 How many packages available on CRAN do you think were created by those with
 CS degrees?  How many in R Core do you think have CS degrees?

  There should be simpler ways.

 There are, and this question has been asked and answered many times over on
 this very list.  Please do learn to search the archive, as the posting guide
 asks you to.

 Andy


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] heckit with a probit

2006-02-27 Thread Robert Duval
I don't know if I understand your problem very well but the first
reference that comes to mind is

James Heckman, Dummy Endogenous Variables in a Simultaneous Equation
System, Econometrica, (July 1978).

also you might find a good survey of the literature on

Francis Vella Estimating models with sample selection bias: A
survey, Journal of Human Resources, 1998, Vol 33 pp 127-169.

I don't know how many of the methods here proposed are already
implemented in R, but in principle many of them are Likelihood models
that you could program.


best
robert


On 2/27/06, David Hugh-Jones [EMAIL PROTECTED] wrote:
 Hi

 I have data for voting behaviour on two (related) binary votes. I want
 to examine the second vote, running separate regressions for groups
 who voted different ways on the first vote. As the votes are not
 independent, I guess that there is an issue with selection bias.

 So, I think I would like to fit a heckit style model but with a binary
 dependent variable - so, in effect, two successive probits. Is there a
 way to do it in R? (Alternatively: am I thinking about this the right
 way?)

 Cheers
 David

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] draft of Comment on UCLA tech report

2006-01-27 Thread Robert Duval
Being a Stata user in transition to R I have to say that it would be
fair to mention that data handling for large amounts of data might
take an extra step in R.

I understand that there are good reasons for R consuming more memory
(than Stata) when handling large datasets, but it is necessary to warn
the potential newcomers that they might require using MySQL or other
database managers if they use large datasets.

greetings
robert

On 1/27/06, Patrick Burns [EMAIL PROTECTED] wrote:
 You may  recall that there was a discussion of a technical
 report from the statistical consulting group at UCLA.

 I have a draft of a comment on that report, which you
 can get from
 http://www.burns-stat.com/pages/Flotsam/uclaRcomment_draft1.pdf

 I'm interested in comments: corrections, additions, deletions.

 Patrick Burns
 [EMAIL PROTECTED]
 +44 (0)20 8525 0696
 http://www.burns-stat.com
 (home of S Poetry and A Guide for the Unwilling S User)

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html