[R] comparing two vectors

2007-06-10 Thread gallon li
Suppose I have a vector A=c(1,2,3)

now I want to compare each element of A to another vector L=c(0.5, 1.2)

and then recode values for sum(A0.5) and sum(A1.2)

to get a result of (3,2)

how can I get this without writing a loop of sums?

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] penalized cox regression

2007-06-10 Thread carol white
Hi,
What is the function to calculate penalized cox regression? frailtyPenal in 
frailtypack R package imposes max 2 strata. I want to use a function that 
reduces all my variables without stratifying them in advance.

Look forward to your reply

carol
   
-

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] comparing two vectors

2007-06-10 Thread hadley wickham
On 6/10/07, gallon li [EMAIL PROTECTED] wrote:
 Suppose I have a vector A=c(1,2,3)

 now I want to compare each element of A to another vector L=c(0.5, 1.2)

 and then recode values for sum(A0.5) and sum(A1.2)

 to get a result of (3,2)

 how can I get this without writing a loop of sums?

How about colSums(outer(A, L, ))

Hadley

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote:
 That can be elegantly handled in R through R's object
 oriented programming by defining a class for the fancy input.
 See this post:
   https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html
 for a simple example of that style.
 
 On 6/9/07, Robert Wilkins [EMAIL PROTECTED] wrote:
 Here are some examples of the type of data crunching you might
 have to do.

 In response to the requests by Christophe Pallier and Martin Stevens.

 Before I started developing Vilno, some six years ago, I had
 been working in the pharmaceuticals for eight years ( it's not
 easy to show you actual data though, because it's all confidential
 of course).

I hadn't heard of Vilno before (except as a variant of Vilnius).
And it seems remarkably hard to find info about it from a Google
search. The best I've come up with, searching on

  vilno  data

is at
  http://www.xanga.com/datahelper

This is a blog site, apparently with postings by Robert Wilkins.

At the end of the Sunday, September 17, 2006 posting Tedious
coding at the Pharmas is a link:

  I have created a new data crunching programming language.
   http://www.my.opera.com/datahelper

which appears to be totally empty. In another blog article:

  go to the www.my.opera.com/datahelper site, go to the August 31
   blog article, and there you will find a tarball-file to download,
   called vilnoAUG2006package.tgz

so again inaccessible; and a google on vilnoAUG2006package.tgz
gives a single hit which is simply the same aricle.

In the Xanga blog there are a few examples of tasks which are
no big deal in any programming language (and, relative to their
simplicity, appear a bit cumbersome in Vilno). 

I've not seen in the blog any instance of data transformation
which could not be quite easily done in any straigthforward
language (even awk).

 Lab data can be especially messy, especially if one clinical
 trial allows the physicians to use different labs. So let's
 consider lab data.
 [...]

That's a fairly daunting description, though indeed not at all
extreme for the sort of data that can arise in practice (and
not just in pharmaceutical investigations). But the complexity
is in the situation, and, whatever language you use, the writing
of the program will involve the writer getting to grips with
the complexity, and the complexity will be present in the code
simply because of the need to accomodate all the special cases,
exceptions and faults that have to be anticipated in feral data.

Once these have been anticipated and incorporated in the code,
the actual transformations are again no big deal.

Frankly, I haven't yet seen anything Vilno that couldn't be
accomodated in an 'awk' program. Not that I'm advocating awk for
universal use (I'm not that monolithic about it). But I'm using
it as my favourite example of a flexible, capable, transparent
and efficient data filtering language, as far as it goes.


SO: where can one find out more about Vilno, to see what it may
really be capable of that can not be done so easily in other ways?


(As is implicit in many comments in Robert's blog, and indeed also
from many postings to this list over time and undoubtedly well
known to many of us in practice, a lot of the problems with data
files arise at the data gathering and entry stages, where people
can behave as if stuffing unpaired socks and unattributed underwear
randomly into a drawer, and then banging it shut).

Best wishes to all,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 09:28:10
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Coding categorical variables in mixed environment

2007-06-10 Thread spime

Hi R users,

Suppose we have following data for a regression model:

AGE:numerical
SEX: male/female categorical
COLOR: {blue, green, pink} categorical
RESPONSE: yes/no categorical

AGE  SEX  COLOR  RESPONSE
10 M  BLUE Y
12 M  GREEN   N
13 F   PINK Y
11 M  BLUE Y
13 M  GREEN   N
09 F   GREEN   N
15 F   BLUE Y
11 F   PINK  Y
12 M  PINK  N
14 M  GREENN

I want to code the categorical data as {male =1, female =2}, {blue =1, green
=2, pink = 3} {yes =1, no =0} and finally get the new table.

how can i do this?

waiting for reply. Thanks in advance.

bye

 
-- 
View this message in context: 
http://www.nabble.com/Coding-categorical-variables-in-mixed-environment-tf3896721.html#a11046822
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] find position

2007-06-10 Thread gallon li
find the position of the first value who equals certain number in a vector:

Say a=c(0,0,0,0,0.2, 0.2, 0.4,0.4,0.5)

i wish to return the index value in a for which the value in the vector is
equal to 0.4 for the first time. in this case, it is 7.

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] find position

2007-06-10 Thread Benilton Carvalho
which(a == .4)[1]

b

On Jun 10, 2007, at 4:45 AM, gallon li wrote:

 find the position of the first value who equals certain number in a  
 vector:

 Say a=c(0,0,0,0,0.2, 0.2, 0.4,0.4,0.5)

 i wish to return the index value in a for which the value in the  
 vector is
 equal to 0.4 for the first time. in this case, it is 7.

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] find position

2007-06-10 Thread Dimitris Rizopoulos
try this:

which(a == 0.4)[1]


I hope it helps.

Best,
Dimitris


Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
  http://www.student.kuleuven.be/~m0390867/dimitris.htm


Quoting gallon li [EMAIL PROTECTED]:

 find the position of the first value who equals certain number in a vector:

 Say a=c(0,0,0,0,0.2, 0.2, 0.4,0.4,0.5)

 i wish to return the index value in a for which the value in the vector is
 equal to 0.4 for the first time. in this case, it is 7.

   [[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.





Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Peter Dalgaard
Douglas Bates wrote:
 Frank Harrell indicated that it is possible to do a lot of difficult
 data transformation within R itself if you try hard enough but that
 sometimes means working against the S language and its whole object
 view to accomplish what you want and it can require knowledge of
 subtle aspects of the S language.
   
Actually, I think Frank's point was subtly different: It is *because* of 
the differences in view that it sometimes seems difficult to find the 
way to do something in R that  is apparently straightforward in SAS. 
I.e. the solutions exist and are often elegant, but may require some 
lateral thinking.

Case in point: Finding the first or the last observation for each 
subject when there are multiple records for each subject. The SAS way 
would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
you can compare the subject ID with the one from the previous record, 
working with data that are sorted appropriately.

You can do the same thing in R with a for loop, but there are better 
ways e.g.
subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
maybe
do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or 
something involving aggregate(). (The latter approaches generalize 
better to other within-subject functionals like cumulative doses, etc.).

The hardest cases that I know of are the ones where you need to turn one 
record into many, such as occurs in survival analysis with 
time-dependent, piecewise constant covariates. This may require 
transposing the problem, i.e. for each  interval you find out which 
subjects contribute and with what, whereas the SAS way would be a 
within-subject loop over intervals containing an OUTPUT statement.

Also, there are some really weird data formats, where e.g. the input 
format is different in different records. Back in the 80's where 
punched-card input was still common, it was quite popular to have one 
card with background information on a patient plus several cards 
detailing visits, and you'd get a stack of cards containing both kinds. 
In R you would most likely split on the card type using grep() and then 
read the two kinds separately and merge() them later.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Coding categorical variables in mixed environment

2007-06-10 Thread John Kane
try ?recode in package:car

--- spime [EMAIL PROTECTED] wrote:

 
 Hi R users,
 
 Suppose we have following data for a regression
 model:
 
 AGE:numerical
 SEX: male/female categorical
 COLOR: {blue, green, pink} categorical
 RESPONSE: yes/no categorical
 
 AGE  SEX  COLOR  RESPONSE
 10 M  BLUE Y
 12 M  GREEN   N
 13 F   PINK Y
 11 M  BLUE Y
 13 M  GREEN   N
 09 F   GREEN   N
 15 F   BLUE Y
 11 F   PINK  Y
 12 M  PINK  N
 14 M  GREENN
 
 I want to code the categorical data as {male =1,
 female =2}, {blue =1, green
 =2, pink = 3} {yes =1, no =0} and finally get the
 new table.
 
 how can i do this?
 
 waiting for reply. Thanks in advance.
 
 bye
 
  
 -- 
 View this message in context:

http://www.nabble.com/Coding-categorical-variables-in-mixed-environment-tf3896721.html#a11046822
 Sent from the R help mailing list archive at
 Nabble.com.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained,
 reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] find position

2007-06-10 Thread Gabor Grothendieck
Try

match(0.4, a)

Also see ?match and the nomatch= argument, in particular. If your
numbers are only equal to within an absolute tolerance, tol, as
discussed in the R FAQ
   
http://cran.r-project.org/doc/FAQ/R-FAQ.html#Why-doesn_0027t-R-think-these-numbers-are-equal_003f
you may need:

   tol - 1e-6
   match(TRUE, abs(a-0.4)  tol)

or

   which(abs(a-0.4)  tol)[1]  # tol from above

and analogously if a relative tolerance is required.

On 6/10/07, gallon li [EMAIL PROTECTED] wrote:
 find the position of the first value who equals certain number in a vector:

 Say a=c(0,0,0,0,0.2, 0.2, 0.4,0.4,0.5)

 i wish to return the index value in a for which the value in the vector is
 equal to 0.4 for the first time. in this case, it is 7.

[[alternative HTML version deleted]]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Sarah Goslee
On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:

 ... a lot of the problems with data
 files arise at the data gathering and entry stages, where people
 can behave as if stuffing unpaired socks and unattributed underwear
 randomly into a drawer, and then banging it shut.

Not specifically R-related, but this would make a great fortune.

Sarah
-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] {nlme} Multilevel estimation heteroscedasticity

2007-06-10 Thread Rense Nieuwenhuis
Dear All,

I'm trying to model heteroscedasticity using a multilevel model. To  
do so, I make use of the nlme package and the weigths-parameter.  
Let's say that I hypothesize that the exam score of students  
(normexam) is influenced by their score on a standardized LR test  
(standLRT). Students are of course nested in schools. These  
variables are contained in the Exam-data in the mlmRev package.

library(nlme)
library(mlmRev)
lme(fixed = normexam ~ standLRT,
data = Exam,
random = ~ 1 | school)


If I want to model only a few categories of variance, all works fine.  
For instance, should I (for whatever reason) hypothesize that the  
variance on the normexam-scores is larger in mixed schools than in  
boys-schools, I'd use weights = varIdent(form = ~ 1 | type), leading to:

heteroscedastic - lme(fixed = normexam ~ standLRT,
data = Exam,
weights = varIdent(form = ~ 1 | type),
random = ~ 1 | school)

This gives me nice and clear output, part of which is shown below:
Variance function:
Structure: Different standard deviations per stratum
Formula: ~normexam | type
Parameter estimates:
  Mxd Sngl
1.00 1.034607
Number of Observations: 4059
Number of Groups: 65


Though, should I hypothesize that the variance on the normexam- 
variable is larger on schools that have a higher average score on  
intake-exams (schavg), I run into troubles. I'd use weights = varIdent 
(form = ~ 1 | schavg), leading to:

heteroscedastic - lme(fixed = normexam ~ standLRT,
data = Exam,
weights = varIdent(form = ~ 1 | schavg),
random = ~ 1 | school)

This leads to estimation problems. R tells me:
Error in lme.formula(fixed = normexam ~ standLRT, data = Exam,  
weights = varIdent(form = ~1 |  :
nlminb problem, convergence error code = 1; message = iteration  
limit reached without convergence (9)

Fiddling with maxiter and setting an unreasonable tolerance doesn't  
help. I think the origin of this problem lies within the large number  
of categories on schavg (65), that may make estimation troublesome.

This leads to my two questions:
- How to solve this estimation-problem?
- Is is possible that the varIdent (or more general: VarFunc) of lme  
returns a single value, representing a coëfficiënt along which  
variance is increasing / decreasing?

- In general: how can a variance-component / heteroscedasticity be  
made dependent on some level-2 variable (school level in my examples) ?

Many thanks in advance,

Rense Nieuwenhuis
[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] format.dates, chron and Hmisc

2007-06-10 Thread R.H. Koning
Hello, I have some problems in using chron, Hmisc, and lattice. First, 
using both chron and Hmisc, I get an error message when describing data:


df$Date - chron(df$Date,format=c(d/m/y))
 ll - latex(describe(df),file=..//text//df.tex)
Error in formatDateTime(dd, atx, !timeUsed) :
   could not find function format.dates

Then, using a chron object and lattice, I get

 plot.a - xyplot(theta~Date|team,data=op.df.long,
+  strip = function(bg, ...) strip.default(bg = 'transparent', ...),
+  panel=function(x,y,...){
+   panel.xyplot(x,y,cex=0.4,col=black,...)
+   panel.loess(x,y,span=0.3,col=black,...)
+   panel.abline(h=0)
+  })
 print(plot.a)
Error in pretty(rng, ...) : unused argument(s) (format.posixt = NULL)

In both cases, the cron objects have been created using the function 
chron(). Are lattice and Hmisc functions incompatible with chron, or am 
I doing something else that causes these problems? Thanks, Ruud


 sessionInfo()
R version 2.5.0 (2007-04-23)
i386-pc-mingw32

locale:
LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252


attached base packages:
[1] stats graphics  grDevices utils datasets  
methods   base


other attached packages:
lattice MASSchron xlsReadWriteHmisc
   0.15-4 7.2-33 2.3-11  1.3.2  3.3-2

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] R logistic regression - comparison with SPSS

2007-06-10 Thread Alain Reymond
Dear R-list members,

I have been a user of SPSS for a few years and quite new to R. I read
the documentation and tried samples but I have some problems to obtain
results for a logistic regression under R.

The following SPSS script

LOGISTIC REGRESSION  vir
/METHOD = FSTEP(LR) d007 d008 d009 d010 d011 d012 d013 d014 d015
d016 d017 d018 d069 d072 d073
/SAVE = PRED COOK SRESID
/CLASSPLOT
/PRINT = GOODFIT CI(95)
/CRITERIA = PIN(.10) POUT(.10) ITERATE(40) CUT(.5) .

predicts vir (value 0 or 1) according to my parameters d007 to d073. It
gives me the parameters to retain in the logistic equation and the
intercept.
The calculation is made from a set of values of about 1.000 cases.

I have been unable to translate it with success under R. I would like to
check if I can obtain the same results than with SPSS. Can someone help
me translate it under R ? I would be most grateful.

I thank you.

Best regards.

-- 
Alain Reymond
CEIA
Bd Saint-Michel 119
1040 Bruxelles
Tel: +32 2 736 04 58
Fax: +32 2 736 58 02
PGPId :  0xEFB06E2E

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to specify the start position using plot

2007-06-10 Thread Patrick Wang
Hi,

How to specify the start position of Y in plot command, hopefully I can
specify the range of X and Y axis. I checked the ?plot, it didnot mention
I can setup the range.


Thanks
Pat

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to specify the start position using plot

2007-06-10 Thread Charles Annis, P.E.
plot( x=rnorm(25, 0.5, 0.3), y=rnorm(25, 4, 1), xlim=c(0,1), ylim=c(2,7)) 
#   ^^ 
for example

Charles Annis, P.E.

[EMAIL PROTECTED]
phone: 561-352-9699
eFax:  614-455-3265
http://www.StatisticalEngineering.com
 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Patrick Wang
Sent: Sunday, June 10, 2007 12:25 PM
To: r-help@stat.math.ethz.ch
Subject: [R] How to specify the start position using plot

Hi,

How to specify the start position of Y in plot command, hopefully I can
specify the range of X and Y axis. I checked the ?plot, it didnot mention
I can setup the range.


Thanks
Pat

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Feature selection for Clustering

2007-06-10 Thread Ranga Chandra Gudivada
Hi,

I was wondering whether there any feature selection methods for clustering. 

  Thanks chandra

   
-

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Rdonlp2 - an extension library for constrained optimization

2007-06-10 Thread Diethelm Wuertz
Ryuichi Tamura wrote:

Please can you put your package on the CRAN server ?

Many thanks
Diethelm Wuertz

 Hello R-list,

 I have released an update version (0.3-1) of Rdonlp2.
 Some (fatal) bugs which may kill interpreter should be fixed.

 In addition, user-visible changes are:
 * *.mes, *.pro files are not created if name=NULL(this is default) in 
 donlp2().
 * use machine-epsilons defined in R for internal
 calculations(step-size, etc.).
 * numeric hessian is now evaluated at the optimum and calculated with
   the algorithm specified in 'difftype' in donlp2.control(). Setting
 difftype=2 will
   produce (roughly) same value as optim() does.

 I sincerely appreciate users who sent me useful comments.

 Windows Binary, OSX Universal Binary, Source file are available at:

 http://arumat.net/Rdonlp2/

 Regards,

 TAMURA Ryuichi,
 mailto: [EMAIL PROTECTED]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] PCA for Binary data

2007-06-10 Thread Ranga Chandra Gudivada
Hi,

I was wondering whether there is any package implementing Principal 
Component Analysis for Binary data

  Thanks chandra

   
-


[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to specify the start position using plot

2007-06-10 Thread Stephen Tucker

plot(x=1:10,y=1:10,xlim=c(0,5),ylim=c(6,10))

a lot of the arguments descriptions for plot() are contained in ?par

--- Patrick Wang [EMAIL PROTECTED] wrote:

 Hi,
 
 How to specify the start position of Y in plot command, hopefully I can
 specify the range of X and Y axis. I checked the ?plot, it didnot mention
 I can setup the range.
 
 
 Thanks
 Pat
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 



 

Bored stiff? Loosen up...

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R logistic regression - comparison with SPSS

2007-06-10 Thread Tobias Verbeke
Alain Reymond wrote:

 Dear R-list members,
 
 I have been a user of SPSS for a few years and quite new to R. I read
 the documentation and tried samples but I have some problems to obtain
 results for a logistic regression under R.
 
 The following SPSS script
 
 LOGISTIC REGRESSION  vir
 /METHOD = FSTEP(LR) d007 d008 d009 d010 d011 d012 d013 d014 d015
 d016 d017 d018 d069 d072 d073
 /SAVE = PRED COOK SRESID
 /CLASSPLOT
 /PRINT = GOODFIT CI(95)
 /CRITERIA = PIN(.10) POUT(.10) ITERATE(40) CUT(.5) .
 
 predicts vir (value 0 or 1) according to my parameters d007 to d073. It
 gives me the parameters to retain in the logistic equation and the
 intercept.
 The calculation is made from a set of values of about 1.000 cases.
 
 I have been unable to translate it with success under R. I would like to
 check if I can obtain the same results than with SPSS. Can someone help
 me translate it under R ? I would be most grateful.

If all the variables you mention are available in a data frame, e.g. 
virdf, than you can fit a logistic regression model by

mymodel - glm(vir ~ d007 + d008 + d009 + d010 + d011 + d012 + d013 + 
d014 + d015 + d016 + d017 + d018 + d069 + d072 + d073, data = virdf, 
family = binomial)

or

mymodel - glm(vir ~ ., data = virdf, family = binomial)

if there are no variables other than those mentioned above in the
virdf data frame.

Contrary to SPSS you need not specify in advance what you would like
as output. Everything useful is stored in the model object (here: 
mymodel) which can then be used to further investigate the model in
many ways:

summary(mymodel)
anova(mymodel, test = Chisq)
plot(mymodel)

See ?summary.glm, ?anova.glm etc.

For stepwise variable selection (not necessarily corresponding to
STEP(LR)), see ?step or ?add1 to do it `by hand'.

HTH,
Tobias

P.S. You can find an introduction to R specifically targeted at (SAS 
and) SPSS users here:

http://oit.utk.edu/scc/RforSASSPSSusers.pdf

-- 

Tobias Verbeke - Consultant
Business  Decision Benelux
Rue de la révolution 8
1000 Brussels - BELGIUM

+32 499 36 33 15
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Windows vista's early terminate Rgui execution

2007-06-10 Thread adschai
Hi,I have a frustrating problem from vista that I wonder if anyone has come 
across the same problem. I wrote a script that involves long computational time 
(although, during the calculation, it spits out text on the gui to notify me 
the progress of the calculation periodically). Windows vista always stopped my 
calculation and claimed that 'Rgui is stop-working. Windows is checking for 
solution.' And when I looked into task manager, windows already stopped my Rgui 
process. I am quite disappointed with this. I would really appreciate if anyone 
finds a solution to go around this windows vista problem? Particularly, how to 
turn off this feature in vista? Any help would be really appreciated. Thank 
you!- adschai

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker

Since R is supposed to be a complete programming language, I wonder
why these tools couldn't be implemented in R (unless speed is the
issue). Of course, it's a naive desire to have a single language that
does everything, but it seems that R currently has most of the
functions necessary to do the type of data cleaning described.

For instance, Gabor and Peter showed some snippets of ways to do this
elegantly; my [physical science] data is often not as horrendously
structured so usually I can get away with a program containing this
type of code

txtin - scan(filename,what=,sep=\n)
filteredList - lapply(strsplit(txtin,delimiter),FUN=filterfunction)
   # fiteringfunction() returns selected (and possibly transformed
   # elements if present and NULL otherwise
   # may include calls to grep(), regexpr(), gsub(), substring(),...
   # nchar(), sscanf(), type.convert(), paste(), etc.
mydataframe - do.call(rbind,filteredList)
   # then match(), subset(), aggregate(), etc.

In the case that the file is large, I open a file connection and scan
a single line + apply filterfunction() successively in a FOR-LOOP
instead of using lapply(). Of course, the devil is in the details of
the filtering function, but I believe most of the required text
processing facilities are already provided by R.

I often have tasks that involve a combination of shell-scripting and
text processing to construct the data frame for analysis; I started
out using Python+NumPy to do the front-end work but have been using R
progressively more (frankly, all of it) to take over that portion
since I generally prefer the data structures and methods in R.


--- Peter Dalgaard [EMAIL PROTECTED] wrote:

 Douglas Bates wrote:
  Frank Harrell indicated that it is possible to do a lot of difficult
  data transformation within R itself if you try hard enough but that
  sometimes means working against the S language and its whole object
  view to accomplish what you want and it can require knowledge of
  subtle aspects of the S language.

 Actually, I think Frank's point was subtly different: It is *because* of 
 the differences in view that it sometimes seems difficult to find the 
 way to do something in R that  is apparently straightforward in SAS. 
 I.e. the solutions exist and are often elegant, but may require some 
 lateral thinking.
 
 Case in point: Finding the first or the last observation for each 
 subject when there are multiple records for each subject. The SAS way 
 would be a datastep with IF-THEN-DELETE, and a RETAIN statement so that 
 you can compare the subject ID with the one from the previous record, 
 working with data that are sorted appropriately.
 
 You can do the same thing in R with a for loop, but there are better 
 ways e.g.
 subset(df,!duplicated(ID)), and subset(df, rev(!duplicated(rev(ID))), or 
 maybe
 do.call(rbind,lapply(split(df,df$ID), head, 1)), resp. tail. Or 
 something involving aggregate(). (The latter approaches generalize 
 better to other within-subject functionals like cumulative doses, etc.).
 
 The hardest cases that I know of are the ones where you need to turn one 
 record into many, such as occurs in survival analysis with 
 time-dependent, piecewise constant covariates. This may require 
 transposing the problem, i.e. for each  interval you find out which 
 subjects contribute and with what, whereas the SAS way would be a 
 within-subject loop over intervals containing an OUTPUT statement.
 
 Also, there are some really weird data formats, where e.g. the input 
 format is different in different records. Back in the 80's where 
 punched-card input was still common, it was quite popular to have one 
 card with background information on a patient plus several cards 
 detailing visits, and you'd get a stack of cards containing both kinds. 
 In R you would most likely split on the card type using grep() and then 
 read the two kinds separately and merge() them later.
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 



  

Park yourself in front of a world of choices in alternative vehicles. Visit the 
Yahoo! Auto Green Center.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 14:04:44, Sarah Goslee wrote:
 On 6/10/07, Ted Harding [EMAIL PROTECTED] wrote:
 
 ... a lot of the problems with data
 files arise at the data gathering and entry stages, where people
 can behave as if stuffing unpaired socks and unattributed underwear
 randomly into a drawer, and then banging it shut.
 
 Not specifically R-related, but this would make a great fortune.
 
 Sarah
 -- 
 Sarah Goslee
 http://www.functionaldiversity.org

I'm not going to object to that!
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 21:18:45
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Question on weighted Kaplan-Meier analysis of case-cohort design

2007-06-10 Thread xiao-jun ma
I have a study best described as a retrospective case-cohort design:  
the cases were all the events in a given time span surveyed, and the  
controls (event-free during the follow-up period) were selected in  
2:1 ratio (2 controls per case).  The sampling frequency for the  
controls was about 0.27, so I used a weight vector consisting of 1  
for cases and 1/0.27 for controls for coxph to adjust for sampling  
bias. Using the same weights in Kaplan-Meier analysis (survfit) gave  
very inaccurate survival curves (much lower event rate than expected  
from population). Are weighting handled differently between coxph and  
survfit? How should I conduct a weighted Kaplan-Meier analysis (given  
that survfit doesn't accept a weighted cox model) for such a design?

Any explanations or suggestions are highly appreciated,

xiaojun

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Windows vista's early terminate Rgui execution

2007-06-10 Thread Robert A LaBudde
At 03:28 PM 6/10/2007, [EMAIL PROTECTED] wrote:
Hi,I have a frustrating problem from vista that I wonder if anyone 
has come across the same problem. I wrote a script that involves 
long computational time (although, during the calculation, it spits 
out text on the gui to notify me the progress of the calculation 
periodically). Windows vista always stopped my calculation and 
claimed that 'Rgui is stop-working. Windows is checking for 
solution.' And when I looked into task manager, windows already 
stopped my Rgui process. I am quite disappointed with this. I would 
really appreciate if anyone finds a solution to go around this 
windows vista problem? Particularly, how to turn off this feature in 
vista? Any help would be really appreciated. Thank you!- adschai

You probably need to contact Vista periodically so it knows you are awake.

Just include a line that does a call to Vista that doesn't do output, such as

useless - dir()

placed in some outer loop that satisfies the drop dead time between calls.

Alternatively, you can attempt to find out how to change the registry 
entry corresponding to the wait time and increase it to a value you 
can live with.


Robert A. LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED]
Least Cost Formulations, Ltd.URL: http://lcfltd.com/
824 Timberlake Drive Tel: 757-467-0954
Virginia Beach, VA 23464-3239Fax: 757-467-2947

Vere scire est per causas scire

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] initial value for optim in polr question

2007-06-10 Thread adschai
Hi,

I have a problem with initial value for optim in polr that R report. After a 
call to polr, it complains that:

Error in optim(start, fmin, gmin, method=BFGS, hessian= Hess, ...) : initial 
value in 'vmin' is not finite.

Would you please suggest a way round to this problem? Thank you so much in 
advance.

Rgds,

- adschai

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Windows vista's early terminate Rgui execution

2007-06-10 Thread adschai
That's really helpful Robert! I was thinking of writing my output to a file 
periodically but that will make my runtime longer. I think this way is better. 
Running dir() which contacts windows periodically because it takes much less 
time than writing to a file. Thank you.- adschai- Original Message 
-From: Robert A LaBudde Date: Sunday, June 10, 2007 3:32 pmSubject: Re: [R] 
Windows vista's early terminate Rgui executionTo: R-help@stat.math.ethz.ch At 
03:28 PM 6/10/2007, [EMAIL PROTECTED] wrote: Hi,I have a frustrating problem 
from vista that I wonder if  anyone  has come across the same problem. I 
wrote a script that  involves  long computational time (although, during the 
calculation, it  spits  out text on the gui to notify me the progress of the 
 calculation  periodically). Windows vista always stopped my calculation and 
 claimed that 'Rgui is stop-working. Windows is checking for  solution.' 
And when I looked into task manager, windows alread!
 y  stopped my Rgui process. I am quite disappointed with this. I  would  
really appreciate if anyone finds a solution to go around this  windows 
vista problem? Particularly, how to turn off this  feature in  vista? Any 
help would be really appreciated. Thank you!- adschai  You probably need to 
contact Vista periodically so it knows you  are awake.  Just include a line 
that does a call to Vista that doesn't do  output, such as  useless - 
dir()  placed in some outer loop that satisfies the drop dead time  
between calls.  Alternatively, you can attempt to find out how to change the 
 registry  entry corresponding to the wait time and increase it to a value  
you  can live with.  
 Robert A. 
LaBudde, PhD, PAS, Dpl. ACAFS  e-mail: [EMAIL PROTECTED] Least Cost 
Formulations, Ltd.URL: http://lcfltd.com/ 824 Timberlake Drive 
Tel: 757-467-0954 Virginia Beach, !
 VA 23464-3239Fax: 757-467-2947  Vere scire est per caus
as scire  __ 
R-help@stat.math.ethz.ch mailing list 
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide 
http://www.R- project.org/posting-guide.html and provide commented, minimal, 
self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] {nlme} Multilevel estimation heteroscedasticity

2007-06-10 Thread Andrew Robinson
Rense,

how about 

weights = varPower(form = ~ schavg)

or 

weights = varConstPower(form = ~ schavg)

or even 

weights = varPower(form = ~ schavg | type)

Yuo might find Pinheiro and Bates (2000) to be a valuable investment.

I hope that this helps,

Andrew


On Sun, Jun 10, 2007 at 04:35:58PM +0200, Rense Nieuwenhuis wrote:
 Dear All,
 
 I'm trying to model heteroscedasticity using a multilevel model. To  
 do so, I make use of the nlme package and the weigths-parameter.  
 Let's say that I hypothesize that the exam score of students  
 (normexam) is influenced by their score on a standardized LR test  
 (standLRT). Students are of course nested in schools. These  
 variables are contained in the Exam-data in the mlmRev package.
 
 library(nlme)
 library(mlmRev)
 lme(fixed = normexam ~ standLRT,
   data = Exam,
   random = ~ 1 | school)
 
 
 If I want to model only a few categories of variance, all works fine.  
 For instance, should I (for whatever reason) hypothesize that the  
 variance on the normexam-scores is larger in mixed schools than in  
 boys-schools, I'd use weights = varIdent(form = ~ 1 | type), leading to:
 
 heteroscedastic - lme(fixed = normexam ~ standLRT,
   data = Exam,
   weights = varIdent(form = ~ 1 | type),
   random = ~ 1 | school)
 
 This gives me nice and clear output, part of which is shown below:
 Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~normexam | type
 Parameter estimates:
   Mxd Sngl
 1.00 1.034607
 Number of Observations: 4059
 Number of Groups: 65
 
 
 Though, should I hypothesize that the variance on the normexam- 
 variable is larger on schools that have a higher average score on  
 intake-exams (schavg), I run into troubles. I'd use weights = varIdent 
 (form = ~ 1 | schavg), leading to:
 
 heteroscedastic - lme(fixed = normexam ~ standLRT,
   data = Exam,
   weights = varIdent(form = ~ 1 | schavg),
   random = ~ 1 | school)
 
 This leads to estimation problems. R tells me:
 Error in lme.formula(fixed = normexam ~ standLRT, data = Exam,  
 weights = varIdent(form = ~1 |  :
   nlminb problem, convergence error code = 1; message = iteration  
 limit reached without convergence (9)
 
 Fiddling with maxiter and setting an unreasonable tolerance doesn't  
 help. I think the origin of this problem lies within the large number  
 of categories on schavg (65), that may make estimation troublesome.
 
 This leads to my two questions:
 - How to solve this estimation-problem?
 - Is is possible that the varIdent (or more general: VarFunc) of lme  
 returns a single value, representing a co?ffici?nt along which  
 variance is increasing / decreasing?
 
 - In general: how can a variance-component / heteroscedasticity be  
 made dependent on some level-2 variable (school level in my examples) ?
 
 Many thanks in advance,
 
 Rense Nieuwenhuis
   [[alternative HTML version deleted]]
 

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Andrew Robinson  
Department of Mathematics and StatisticsTel: +61-3-8344-9763
University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599
http://www.ms.unimelb.edu.au/~andrewpr
http://blogs.mbs.edu/fishing-in-the-bay/

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Ted Harding
On 10-Jun-07 19:27:50, Stephen Tucker wrote:
 
 Since R is supposed to be a complete programming language,
 I wonder why these tools couldn't be implemented in R
 (unless speed is the issue). Of course, it's a naive desire
 to have a single language that does everything, but it seems
 that R currently has most of the functions necessary to do
 the type of data cleaning described.

In principle that is certainly true. A couple of comments,
though.

1. R's rich data structures are likely to be superfluous.
   Mostly, at the sanitisation stage, one is working with
   flat files (row  column). This straightforward format
   is often easier to handle using simple programs for the
   kind of basic filtering needed, rather then getting into
   the heavier programming constructs of R.

2. As follow-on and contrast at the same time, very often
   what should be a nice flat file with no rough edges is not.
   If there are variable numbers of fields per line, R will
   not handle it straightforwardly (you can force it in,
   but it's more elaborate). There are related issues as well.

a) If someone entering data into an Excel table lets their
   cursor wander outside the row/col range of the table,
   this can cause invisible entities to be planted in the
   extraneous cells. When saved as a CSV, this file then
   has variable numbers of fields per line, and possibly
   also extra lines with arbitrary blank fields.

   cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'

   will give you the numbers of fields in each line.

   If you further pipe it into | sort -nu you will get
   the distinct field-numbers. If you know (by now) how many
   fields there should be (e.g. 10), then

   cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR ,  NF}'

   will tell you which lines have the wrong number of fields,
   and how many fields they have. You can similarly count how
   many lines there are (e.g. pipe into wc -l).

b) Poeple sometimes randomly use a blank space or a . in a
   cell to demote a missing value. Consistent use of either
   is OK: ,, in a CSV will be treated as NA by R. The use
   of . can be more problematic. If for instance you try to
   read the following CSV into R as a dataframe:

   1,2,.,4
   2,.,4,5
   3,4,.,6

   the . in cols 2 and 3 is treated as the character .,
   with the result that something complicated happens to
   the typing of the items.

   typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
   sum(D[1,2]) gives a type-error, even though the entry
   is in fact 2. And so on , in various combinations.

   And (as.nmatrix(D)) is of course a matrix of characters.

   In fact, columns 2 and 3 of D are treated as factors!

   for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
   [1] 1
   [1] 2
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 4
   [1] 2
   [1] .
   Levels: . 2 4
   [1] 4
   Levels: . 4
   [1] 5
   [1] 3
   [1] 4
   Levels: . 2 4
   [1] .
   Levels: . 4
   [1] 6

   This is getting altogether too complicated for the job
   one wants to do!

   And it gets worse when people mix ,, and ,.,!

   On the other hand, a simple brush with awk (or sed in
   this case) can sort it once and for all, without waking
   the sleeping dogs in R.

I could go on. R undoubtedly has the power, but it can very
quickly get over-complicated for simple jobs.

Best wishes to all,
Ted.


E-Mail: (Ted Harding) [EMAIL PROTECTED]
Fax-to-email: +44 (0)870 094 0861
Date: 10-Jun-07   Time: 22:14:35
-- XFMail --

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread roger koenker
An important potential benefit of R solutions shared by awk, sed, ...
is that they provide a reproducible way to  document  exactly how one  
got
from one version of the data to the next.  This  seems to be the main
problem with handicraft methods like editing excel files, it  is too
easy to introduce  new errors that can't be tracked down at later
stages of the analysis.


url:www.econ.uiuc.edu/~rogerRoger Koenker
email   [EMAIL PROTECTED]   Department of Economics
vox:217-333-4558University of Illinois
fax:217-244-6678Champaign, IL 61820


On Jun 10, 2007, at 4:14 PM, (Ted Harding) wrote:

 On 10-Jun-07 19:27:50, Stephen Tucker wrote:

 Since R is supposed to be a complete programming language,
 I wonder why these tools couldn't be implemented in R
 (unless speed is the issue). Of course, it's a naive desire
 to have a single language that does everything, but it seems
 that R currently has most of the functions necessary to do
 the type of data cleaning described.

 In principle that is certainly true. A couple of comments,
 though.

 1. R's rich data structures are likely to be superfluous.
Mostly, at the sanitisation stage, one is working with
flat files (row  column). This straightforward format
is often easier to handle using simple programs for the
kind of basic filtering needed, rather then getting into
the heavier programming constructs of R.

 2. As follow-on and contrast at the same time, very often
what should be a nice flat file with no rough edges is not.
If there are variable numbers of fields per line, R will
not handle it straightforwardly (you can force it in,
but it's more elaborate). There are related issues as well.

 a) If someone entering data into an Excel table lets their
cursor wander outside the row/col range of the table,
this can cause invisible entities to be planted in the
extraneous cells. When saved as a CSV, this file then
has variable numbers of fields per line, and possibly
also extra lines with arbitrary blank fields.

cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'

will give you the numbers of fields in each line.

If you further pipe it into | sort -nu you will get
the distinct field-numbers. If you know (by now) how many
fields there should be (e.g. 10), then

cat datafile.csv | awk 'BEGIN{FS=,} (NF != 10){print NR ,  NF}'

will tell you which lines have the wrong number of fields,
and how many fields they have. You can similarly count how
many lines there are (e.g. pipe into wc -l).

 b) Poeple sometimes randomly use a blank space or a . in a
cell to demote a missing value. Consistent use of either
is OK: ,, in a CSV will be treated as NA by R. The use
of . can be more problematic. If for instance you try to
read the following CSV into R as a dataframe:

1,2,.,4
2,.,4,5
3,4,.,6

the . in cols 2 and 3 is treated as the character .,
with the result that something complicated happens to
the typing of the items.

typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but
sum(D[1,2]) gives a type-error, even though the entry
is in fact 2. And so on , in various combinations.

And (as.nmatrix(D)) is of course a matrix of characters.

In fact, columns 2 and 3 of D are treated as factors!

for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}}
[1] 1
[1] 2
Levels: . 2 4
[1] .
Levels: . 4
[1] 4
[1] 2
[1] .
Levels: . 2 4
[1] 4
Levels: . 4
[1] 5
[1] 3
[1] 4
Levels: . 2 4
[1] .
Levels: . 4
[1] 6

This is getting altogether too complicated for the job
one wants to do!

And it gets worse when people mix ,, and ,.,!

On the other hand, a simple brush with awk (or sed in
this case) can sort it once and for all, without waking
the sleeping dogs in R.

 I could go on. R undoubtedly has the power, but it can very
 quickly get over-complicated for simple jobs.

 Best wishes to all,
 Ted.

 
 E-Mail: (Ted Harding) [EMAIL PROTECTED]
 Fax-to-email: +44 (0)870 094 0861
 Date: 10-Jun-07   Time: 22:14:35
 -- XFMail --

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting- 
 guide.html
 and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Nonlinear Regression

2007-06-10 Thread Spencer Graves
  Have you worked through the examples in the 'nls' help file, 
especially the following: 

 DNase1 - subset(DNase, Run == 1)
 fm3DNase1 - nls(density ~ Asym/(1 + exp((xmid - log(conc))/scal)),
  data = DNase1,
  start = list(Asym = 3, xmid = 0, scal = 1),
  trace = TRUE)

   Treated - Puromycin[Puromycin$state == treated, ]
 weighted.MM - function(resp, conc, Vm, K)
 {
 ## Purpose: exactly as white book p. 451 -- RHS for nls()
 ##  Weighted version of Michaelis-Menten model
 ## --
 ## Arguments: 'y', 'x' and the two parameters (see book)
 ## --
 ## Author: Martin Maechler, Date: 23 Mar 2001

 pred - (Vm * conc)/(K + conc)
 (resp - pred) / sqrt(pred)
 }

 Pur.wt - nls( ~ weighted.MM(rate, conc, Vm, K), data = Treated,
   start = list(Vm = 200, K = 0.1),
   trace = TRUE)
112.5978 :  200.0   0.1
17.33824 :  205.67588840   0.04692873
14.6097 :  206.33087396   0.05387279
14.59694 :  206.79883508   0.05457132
14.59690 :  206.83291286   0.05460917
14.59690 :  206.83468191   0.05461109

# In the call to 'nls' here, 'Vm' and 'K' are in 'start' and must 
therefore be parameters to be estimated. 
# The other names passed to the global 'weighted.MM' must be columns of 
'data = Treated'. 

# To get the residual sum of squares, first note that it is printed as 
the first column in the trace output. 

# To get that from Pur.wt, I first tried 'class(Pur.wt)'. 
# This told me it was of class 'nls'. 
# I then tried method(class='nls'). 
# One of the functions listed was 'residuals.nls'.  That gave me the 
residuals. 
# I then tried 'sum(residuals(Pur.wt)^2)', which returned 14.59690. 

  Hope this helps. 
  Spencer Graves
p.s.  Did this answer your question?  Your example did not seem to me to 
be self contained, which makes it more difficult for me to know if I'm 
misinterpreting your question.  If the example had been self contained, 
I might have replied a couple of days ago. 

tronter wrote:
 Hello

 I followed the example in page 59, chapter 11 of the 'Introduction to R'
 manual. I entered my own x,y data. I used the least squares. My function has
 5 parameters: p[1], p[2], p[3], p[4], p[5]. I plotted the x-y data. Then I
 used lines(spline(xfit,yfit)) to overlay best curves on the data while
 changing the parameters. My question is how do I calculate the residual sum
 of squares. In the example they have the following:

 df - data.frame( x=x, y=y)

 fit - nls(y ~SSmicmen(s, Vm, K), df)

 fit


 In the second line how would I input my function? Would it be:

 fit - nls(y ~ myfunction(p[1], p[2], p[3], p[4], p[5]), df) where
 myfunction is the actual function? My function doesnt have a name, so should
 I just enter it?

 Thanks



__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Potential junk email moved to Junk Folder

2007-06-10 Thread webere
MailMarshal (an automated content monitoring gateway) has not 
delivered the following message:

   Message: B466c9a69.0001.0001.mml
   From:r-help@stat.math.ethz.ch
   To:  [EMAIL PROTECTED]
   Subject: [Spam] delivery failed

This is due to automatic rules that have determined that the 
message is probably junk email.  If you believe the message was 
business related please contact [EMAIL PROTECTED] and request
that the message be released.  If no contact is made within 5 
days the message will automatically be deleted.

MailMarshal Rule: SPAM subject block : Spam Subject Block
Script spam in subject Triggered in Subject
Expression: SPAM Triggered 1 times weighting 5


Email Content Security provided by NetIQ MailMarshal.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How do I obtain standard error of each estimated coefficients in polr

2007-06-10 Thread adschai
Hi,

I obtained all the coefficients that I need from polr. However, I'm wondering 
how I can obtain the standard error of each estimated coefficient? I saved the 
Hessian and do something like summary(polrObj), I don't see any standard error 
like when doing regression using lm. Any help would be really appreciated. 
Thank you!

- adschai

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Tools For Preparing Data For Analysis

2007-06-10 Thread Stephen Tucker
Embarrasingly, I don't know awk or sed but R's code seems to be
shorter for most tasks than Python, which is my basis for comparison.

It's true that R's more powerful data structures usually aren't
necessary for the data cleaning, but sometimes in the filtering
process I will pick out lines that contain certain data, in which case
I have to convert text to numbers and perform operations like
which.min(), order(), etc., so in that sense I like to have R's
vectorized notation and the objects/functions that support it.

As far as some of the tasks you described, I've tried transcribing
them to R. I know you provided only the simplest examples, but even in
these cases I think R's functions for handling these situations
exemplify their usefulness in this step of the analysis. But perhaps
you would argue that this code is too long... In any event it will
still save the trouble of keeping track of an extra (intermediate)
file passed between awk and R.

(1) the numbers of fields in each line equivalent to 
cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'
in awk

# R equivalent:
nFields - count.fields(datafile.csv,sep=,)
# or 
nFields - sapply(strsplit(readLines(datafile.csv),,),length)

(2) which lines have the wrong number of fields, and how many fields
they have. You can similarly count how many lines there are (e.g. pipe
into wc -l).

# number of lines with wrong number of fields
nWrongFields - length(nFields[nFields  10])

# select only first ten fields from each line
# and return a matrix
firstTenFields - 
  do.call(rbind,
  lapply(strsplit(readLines(datafile.csv),,),
 function(x) x[1:10]))

# select only those lines which contain ten fields
# and return a matrix
onlyTenFields - 
  do.call(rbind,
  lapply(strsplit(readLines(datafile.csv),,),
 function(x) if(length(x) = 10) x else NULL))

(3)
If for instance you try to
read the following CSV into R as a dataframe:
 
1,2,.,4
2,.,4,5
3,4,.,6
 

txtC - textConnection(
1,2,.,4
2,.,4,5
3,4,.,6)
# using read.csv() specifying na.string argument:
 read.csv(txtC,header=FALSE,na.string=.)
  V1 V2 V3 V4
1  1  2 NA  4
2  2 NA  4  5
3  3  4 NA  6

# Of course, read.csv will work only if data is formatted correctly.
# More generally, using readLines(), strsplit(), etc., which are more
# flexible :

 do.call(rbind,
+ lapply(strsplit(readLines(txtC),,),
+type.convert,na.string=.))
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6

(4) Situations where people mix ,, and ,.,!

# type.convert (and read.csv) will still work when missing values are ,,
# and ,., (automatically recognizes  as NA and through
# specification of 'na.string', can recognize . as NA)

# If it is desired to convert . to  first, this is simple as
# well:

m - do.call(rbind,
lapply(strsplit(readLines(txtC),,),
   function(x) gsub(^\\.$,,x)))
 m
 [,1] [,2] [,3] [,4]
[1,] 1  2 4 
[2,] 2 4  5 
[3,] 3  4 6 

# then
mode(m) - numeric
# or
m - apply(m,2,type.convert)
# will give
 m
 [,1] [,2] [,3] [,4]
[1,]12   NA4
[2,]2   NA45
[3,]34   NA6


--- [EMAIL PROTECTED] wrote:

 On 10-Jun-07 19:27:50, Stephen Tucker wrote:
  
  Since R is supposed to be a complete programming language,
  I wonder why these tools couldn't be implemented in R
  (unless speed is the issue). Of course, it's a naive desire
  to have a single language that does everything, but it seems
  that R currently has most of the functions necessary to do
  the type of data cleaning described.
 
 In principle that is certainly true. A couple of comments,
 though.
 
 1. R's rich data structures are likely to be superfluous.
Mostly, at the sanitisation stage, one is working with
flat files (row  column). This straightforward format
is often easier to handle using simple programs for the
kind of basic filtering needed, rather then getting into
the heavier programming constructs of R.
 
 2. As follow-on and contrast at the same time, very often
what should be a nice flat file with no rough edges is not.
If there are variable numbers of fields per line, R will
not handle it straightforwardly (you can force it in,
but it's more elaborate). There are related issues as well.
 
 a) If someone entering data into an Excel table lets their
cursor wander outside the row/col range of the table,
this can cause invisible entities to be planted in the
extraneous cells. When saved as a CSV, this file then
has variable numbers of fields per line, and possibly
also extra lines with arbitrary blank fields.
 
cat datafile.csv | awk 'BEGIN{FS=,}{n=NF;print n}'
 
will give you the numbers of fields in each line.
 
If you further pipe it into | sort -nu you will get
the distinct field-numbers. If you know (by now) how many
fields there should be (e.g. 10), then
 
cat 

[R] Determination of % of misclassification

2007-06-10 Thread spime

Hi R-users,

Suppose i have a two class discrimination problem and i am using logistic
regression for the classification.

 model.logit -
 glm(formula=RES~NUM01+NUM02+NUM03+NUM04,family=binomial(link=logit),data=train.data)
 predict.logit-predict.glm(model.logit,newdata=test.data,type='response',se.fit=FALSE)
 predict.logit

I have two questions:

1.  Suppose our training data consists of 700 observations and testing set
of 300. How can i determine no of misclassifications from predicted values
and fitted values.

2. How to determine AUC from ROC curve and also threshold value?

Waiting for reply,

Thanks in advance,

bye
-- 
View this message in context: 
http://www.nabble.com/Determination-of---of-misclassification-tf3899598.html#a11055026
Sent from the R help mailing list archive at Nabble.com.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] lm for matrix of response...

2007-06-10 Thread vinod gullu
Dear All,
1)Can I use lm() to fit more than one response in
single expression. e.g data is a matrix of these
variables
R1 R2 R3  X Y Z
1 2 1 1 2 3 

Now i wnat to fit R1:R3 ~ X+Y+Z.
2) How can i use Singular Value decomposition (SVD) as
an alternate to lsq.
Regards,

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] Updated ggplot2 package (beta version)

2007-06-10 Thread hadley wickham
ggplot2
===

ggplot2 is a plotting system for R, based on the grammar of graphics,
which tries to take the good parts of base and lattice graphics and
none of the bad parts. It takes care of many of the fiddly details
that make plotting a hassle (like drawing legends) as well as
providing a powerful model of graphics that makes it easy to produce
complex multi-layered graphics.

Find out more at http://had.co.nz/ggplot2

Changes in version 0.5.1 --

 * new chapter in book and changes to package to make it possible to
customise every aspect of ggplot display using grid

 * a new economic data set to help demonstrate line, path and area plots

 * many bug fixes reported by beta testers

Hadley

___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] lm for matrix of response...

2007-06-10 Thread Prof Brian Ripley
On Sun, 10 Jun 2007, vinod gullu wrote:

 Dear All,
 1)Can I use lm() to fit more than one response in
 single expression. e.g data is a matrix of these
 variables
 R1 R2 R3  X Y Z
 1 2 1 1 2 3
 
 Now i wnat to fit R1:R3 ~ X+Y+Z.

?lm says

  If 'response' is a matrix a linear model is fitted separately by
  least-squares to each column of the matrix.

so cbind(R1,R2,R3) ~ X+Y+Z

 2) How can i use Singular Value decomposition (SVD) as
 an alternate to lsq.

See ?svd.

Note that SVD is not a model-fitting criterion, and can be used to fit by 
least squares.  If you mean something else, please study the posting guide 
and tell us precisely what you mean, with references.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.