[R] opinions please: text editors and reporting/Sweave?

2007-05-30 Thread Tim Howard
dear all - 

I currently use Tinn-R as my text editor to work with code that I submit to R, 
with some output dumped to text files, some images dumped to pdf. (system: 
Windows 2K and XP, R 2.4.1 and R 2.5). We are using R for overnight runs to 
create large output data files for GIS, but then I need simple output reports 
for analysis results for each separate data set. Thus, I create many reports of 
the same style, but just based on different input data.

I am recognizing that I need a better reporting system, so that I can create 
clean reports for each separate R run. This obviously means using Sweave and 
some implementation of LaTex, both of which are new to me. I've installed 
MikTex and successfully completed a demo or two for creating pdfs from raw 
LaTeX.

It appears that if I want to ease my entry into the world of LaTeX, I might 
need to switch editors to something like Emacs (I read somewhere that Emacs 
helps with the TeX markup?). After quite a while wallowing at the Emacs site, I 
am finding that ESS is well integrated with R and might be the way to go. 
Aaaagh... I'm in way over my head!

My questions:

What, in your opinion, is the simplest way to integrate text and graphics 
reports into a single report such as a pdf file. 

If the answer to this is LaTeX and Sweave, is it difficult to use a text editor 
such as Tinn-R or would you strongly recommend I leave behind Tinn and move 
over to an editor that has more LaTeX help?  

In reading over Friedrich Leisch's Sweave User Manual (v 1.6.0) I am 
beginning to think I can do everything I need with my simple editor. Before 
spending many hours going down that path, I thought it prudent to ask the R 
community.

It is likely I am misunderstanding some of the process here and any 
clarifications are welcome. 

Thank you in advance for any thoughts. 
Tim Howard

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ROC optimal threshold

2006-03-31 Thread Tim Howard
Jose - 

I've struggled a bit with the same question, said another way: how do you find 
the value in a ROC curve that minimizes false positives while maximizing true 
positives?

Here's something I've come up with. I'd be curious to hear from the list 
whether anyone thinks this code might get stuck in local minima, or if it does 
find the global minimum each time. (I think it's ok).

From your ROC object you need to grab the sensitivity (=true positive rate) 
and specificity (= 1- false positive rate) and the cutoff levels.  Then find 
the value that minimizes abs(sensitivity-specificity), or  
sqrt((1-sens)^2)+(1-spec)^2)) as follows:

absMin - extract[which.min(abs(extract$sens-extract$spec)),];
sqrtMin - extract[which.min(sqrt((1-extract$sens)^2+(1-extract$spec)^2)),];

In this example, 'extract' is a dataframe containing three columns: 
extract$sens = sensitivity values, extract$spec = specificity values, 
extract$votes = cutoff values. The command subsets the dataframe to a single 
row containing the desired cutoff and the sens and spec values that are 
associated with it.

Most of the time these two answers (abs or sqrt) are the same, sometimes they 
differ quite a bit. 

I do not see this application of ROC curves very often. A question for those 
much more knowledgeable than I is there a problem with using ROC curves in 
this manner?

Tim Howard




Date: Fri, 31 Mar 2006 11:58:14 +0200
From: Anadon Herrera, Jose Daniel [EMAIL PROTECTED]
Subject: [R] ROC optimal threshold
To: 'r-help@stat.math.ethz.ch' r-help@stat.math.ethz.ch
Message-ID:
[EMAIL PROTECTED]
Content-Type: text/plain;   charset=iso-8859-1

hello,

I am using the ROC package to evaluate predictive models
I have successfully plot the ROC curve, however

?is there anyway to obtain the value of operating point=optimal threshold
value (i.e. the nearest point of the curve to the top-left corner of the
axes)?

thank you very much,


jose daniel anadon
area de ecologia
universidad miguel hernandez

espa?a

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] ROC optimal threshold

2006-03-31 Thread Tim Howard
Dr. Harrell, 
Thank you for your response. I had noted, and appreciate, your perspective on 
ROC in past listserv entries and am glad to have an opportunity to delve a 
little deeper.

I (and, I think, Jose Daniel Anadon, the original poster of this question) have 
a predictive model for the presence of, say, animal_X. This is a spatial model 
that can be represented on maps and is based on known locations where  animal_X 
is present and (usually) known locations where animal_X is absent. Output of 
the analysis (using any number of analytic routines, including logit, 
randomForest, maximum entropy, mahalanobis distance...) is a full map where 
every spot on the map has a probability that that particular location has the 
appropriate habitat for animal_x.

This output can be visualized by just using a color scale (perhaps blue for low 
probability to red for high probability), BUT, there are times when we want to 
apply a cutoff to this probability output and create a product where we can say 
either yes, animal_X habitat is predicted here or no, animal_X habitat is 
not predicted here.

Note this is the final analytic step. There are no later anaylsis steps and so 
(possibly) adjustments for multiple comparisons do not come into play.

Indeed, it seems that using a standard process to find a threshold reduces the 
arbitrariness of the probabiliity color scale (at what probability do we set 
'red'? at what probability do we set 'blue'?).

Are there alternative approaches that reduce the drawbacks you allude to? 

How would you turn a surface of probabilities into a binary surface of yes-no?

Thank you for your time.
Sincerely,
Tim Howard

Ecologist
New York Natural Heritage Program

 Frank E Harrell Jr [EMAIL PROTECTED] 03/31/06 11:20 AM 

Choosing cutoffs is frought with difficulties, arbitrariness, 
inefficiency, and the necessity to use a complex adjustment for multiple 
comparisons in later analysis steps unless the dataset used to generate 
the cutoff was so large as could be considered infinite.

-- 
Frank E Harrell Jr   Professor and Chair   School of Medicine
  Department of Biostatistics   Vanderbilt University

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] how to use the randomForest and rpart function?

2006-03-09 Thread Tim Howard
Michael - 
I recall reading something Breiman wrote that said essentially don't skimp on 
the number of trees - they are cheap to build and it makes for a better model. 
 Also, look at your error rates (using plot), and make sure you run enough 
trees so that the error settles down. You'll likely be building 1000 or so 
trees. 

Tim



Hi Andy,

Does the randomForest have a Cross Validation built-in to decide what is the
best number of trees or I have to find the best number manually by myself?

Thanks a lot!

Michael.

On 3/7/06, Liaw, Andy [EMAIL PROTECTED] wrote:

 Yes, I do know.  That's why I pointed you to the reference linked from the
 help page.

 BTW, there's also an R News article describing the initial version of the
 package.  Have you perused that?

 Andy

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


RE: [R] Assign factor and levels inside function

2005-04-22 Thread Tim Howard
Aha!
   You've just opened the door to another level for this blundering R
user.  I even went back to my well-used copy of An Introduction to R
to see where I missed this standard approach for processing new data. 
Nothing clear but certainly alluded to in many of the function examples.
 I don't know why I was stuck in that rut.

I'm sure 99.9% of you on this list know this, but... To be clear for
anyone searching these archives later:  Don't bother to ask your
function to make assignments to pos=1 (the global environment), just do
the assignment yourself when calling the function. For example, instead
of coding a function call like this:

processData(dat)

to assign the processed data to pos=1, simply make the assignment when
calling the function:

dat - processData(dat)


Thanks for being gentle on me, Andy.

Tim

 Liaw, Andy [EMAIL PROTECTED] 4/21/2005 9:57:22 PM 
Tim,

 From: Tim Howard 
 
 Andy, 
   Thank you for the help. Yes, my question really did seem like I
was
 going through a lot of unnecessary steps just to define levels of a
 variable. But that was just for the example. In my 
 application, I bring
 new datasets into R on a daily basis. While the data differs, the
 variables are the same, and the categorical variables have the same
 levels. So I find myself daily applying the same factor and level
 definitions (by cutting and pasting the large chunk of commands from
a
 text file). It really would be simpler to have it wrapped up in a
 function.  That's why I asked the question about putting this into a
 function.
   Upon reading your answer, I thought maybe I could use your example
 and use the super-assignment '-' in the function. But, your method
 assigns levels, but does not define the var as a factor 
 (interesting!).
 
   levels(y$one) - seq(1, 9, by=2)
  y$one
 [1] 1 1 3 3 5 7
 attr(,levels)
 [1] 1 3 5 7 9
  is.factor(y$one)
 [1] FALSE

Ouch!  levels- is generic, and the default method simply attach the
levels attribute to the object.  You need to coerce the object into a
factor
explicitly.

 Unfortunately, whenever I try to use - with the dataframe as the
 variable, I get an error message: 
 
  fncFact - function(datfra){
 + datfra$one - factor(datfra$one, levels=c(1,3,5,7,9))
 + }
  fncFact(y)
 Error in fncFact(y) : Object datfra not found

I believe the canonical ways of doing something like this in R is
something
along the line of:

processData - function(dat) {
dat$f1 - factor(dat$f1, levels=...)
...  ## any other manipulations you want to do
dat
}

Then when you get new data, you just do:

newData - processData(newData)

HTH,
Andy

 
 Tim
 
  Liaw, Andy [EMAIL PROTECTED] 4/20/2005 4:03:24 PM 
 Wouldn't it be easier to do this?
 
  levels(y$one) - seq(1, 9, by=2)
  y$one
 [1] 1 1 3 3 5 7
 attr(,levels)
 [1] 1 3 5 7 9
 
 Andy
 
  From: Tim Howard
  
  R-help,
After cogitating for a while, I finally figured out how to
define
 a
  data.frame column as factor and assign the levels within a
 function...
  BUT I still need to pass the data.frame and its name 
  separately. I can't
  seem to find any other way to pass the name of the data.frame,
 rather
  than the data.frame itself.  Any suggestions on how to go 
  about it?  Is
  there something like value(object) or name(object) that I can't
 find?
  
  #sample dataframe for this example
  y - data.frame(
   one=c(1,1,3,3,5,7),
   two=c(2,2,6,6,8,8))
  
   levels(y$one)   # check out levels
  NULL
  
  # the function I've come up with
  fncFact - function(datfra, datfraNm){
  datfra$one - factor(datfra$one, levels=c(1,3,5,7,9))
  assign(datfraNm, datfra, pos=1)
  }
  
  fncFact(y, y)
   levels(y$one)
  [1] 1 3 5 7 9
  
  I suppose only for aesthetics and simplicity, I'd like to have
only
  pass the data.frame and get the same result.
  Thanks in advance,
  Tim Howard
  
  
   version
   _  
  platform i386-pc-mingw32
  arch i386   
  os   mingw32
  system   i386, mingw32  
  status  
  major2  
  minor0.1
  year 2004   
  month11 
  day  15 
  language R
  
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help 
  PLEASE do read the posting guide! 
  http://www.R-project.org/posting-guide.html 
  
  
  
 
 
 
 --
 
 Notice:  This e-mail message, together with any attachments,
contains
 information of Merck  Co., Inc. (One Merck Drive, Whitehouse
Station,
 New Jersey, USA 08889), and/or its affiliates (which may be known
 outside the United States as Merck Frosst, Merck Sharp  Dohme or
MSD
 and in Japan, as Banyu) that may be confidential, proprietary
 copyrighted and/or legally privileged. It is intended solely 
 for the use
 of the individual or entity named on this message.  If you are not
the
 intended recipient, and have

RE: [R] Assign factor and levels inside function

2005-04-21 Thread Tim Howard
Andy, 
  Thank you for the help. Yes, my question really did seem like I was
going through a lot of unnecessary steps just to define levels of a
variable. But that was just for the example. In my application, I bring
new datasets into R on a daily basis. While the data differs, the
variables are the same, and the categorical variables have the same
levels. So I find myself daily applying the same factor and level
definitions (by cutting and pasting the large chunk of commands from a
text file). It really would be simpler to have it wrapped up in a
function.  That's why I asked the question about putting this into a
function.
  Upon reading your answer, I thought maybe I could use your example
and use the super-assignment '-' in the function. But, your method
assigns levels, but does not define the var as a factor (interesting!).

  levels(y$one) - seq(1, 9, by=2)
 y$one
[1] 1 1 3 3 5 7
attr(,levels)
[1] 1 3 5 7 9
 is.factor(y$one)
[1] FALSE
 

Unfortunately, whenever I try to use - with the dataframe as the
variable, I get an error message: 

 fncFact - function(datfra){
+ datfra$one - factor(datfra$one, levels=c(1,3,5,7,9))
+ }
 fncFact(y)
Error in fncFact(y) : Object datfra not found
 


Tim

 Liaw, Andy [EMAIL PROTECTED] 4/20/2005 4:03:24 PM 
Wouldn't it be easier to do this?

 levels(y$one) - seq(1, 9, by=2)
 y$one
[1] 1 1 3 3 5 7
attr(,levels)
[1] 1 3 5 7 9

Andy

 From: Tim Howard
 
 R-help,
   After cogitating for a while, I finally figured out how to define
a
 data.frame column as factor and assign the levels within a
function...
 BUT I still need to pass the data.frame and its name 
 separately. I can't
 seem to find any other way to pass the name of the data.frame,
rather
 than the data.frame itself.  Any suggestions on how to go 
 about it?  Is
 there something like value(object) or name(object) that I can't
find?
 
 #sample dataframe for this example
 y - data.frame(
  one=c(1,1,3,3,5,7),
  two=c(2,2,6,6,8,8))
 
  levels(y$one)   # check out levels
 NULL
 
 # the function I've come up with
 fncFact - function(datfra, datfraNm){
 datfra$one - factor(datfra$one, levels=c(1,3,5,7,9))
 assign(datfraNm, datfra, pos=1)
 }
 
 fncFact(y, y)
  levels(y$one)
 [1] 1 3 5 7 9
 
 I suppose only for aesthetics and simplicity, I'd like to have only
 pass the data.frame and get the same result.
 Thanks in advance,
 Tim Howard
 
 
  version
  _  
 platform i386-pc-mingw32
 arch i386   
 os   mingw32
 system   i386, mingw32  
 status  
 major2  
 minor0.1
 year 2004   
 month11 
 day  15 
 language R
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html 
 
 
 



--
Notice:  This e-mail message, together with any attachments,...{{dropped}}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Assign factor and levels inside function

2005-04-20 Thread Tim Howard
R-help,
  After cogitating for a while, I finally figured out how to define a
data.frame column as factor and assign the levels within a function...
BUT I still need to pass the data.frame and its name separately. I can't
seem to find any other way to pass the name of the data.frame, rather
than the data.frame itself.  Any suggestions on how to go about it?  Is
there something like value(object) or name(object) that I can't find?

#sample dataframe for this example
y - data.frame(
 one=c(1,1,3,3,5,7),
 two=c(2,2,6,6,8,8))

 levels(y$one)   # check out levels
NULL

# the function I've come up with
fncFact - function(datfra, datfraNm){
datfra$one - factor(datfra$one, levels=c(1,3,5,7,9))
assign(datfraNm, datfra, pos=1)
}

fncFact(y, y)
 levels(y$one)
[1] 1 3 5 7 9

I suppose only for aesthetics and simplicity, I'd like to have only
pass the data.frame and get the same result.
Thanks in advance,
Tim Howard


 version
 _  
platform i386-pc-mingw32
arch i386   
os   mingw32
system   i386, mingw32  
status  
major2  
minor0.1
year 2004   
month11 
day  15 
language R

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Encodebuf? yet another memory question

2005-03-09 Thread Tim Howard
Hi all,
   I was surprised to see this memory error:

Error in scan(Cn.minex13, nlines = 2, quiet = TRUE) : 
Could not allocate memory for Encodebuf
 memory.size(max=TRUE)
[1] 256843776
 memory.size(FALSE)
[1] 180144528
 memory.limit()
[1] 2147483648


I don't have any objects named 'Encodebuf' and help and the R site
search turn up no matches for this word.  

As memory.size and memory.limit indicate, I'm way below my limit (but,
I grant that maybe windows won't give R any more memory...).   In my
next run, I'll ask to scan fewer lines, but I thought it worth asking
the group if this 'Encodebuf' error meant anything different than the
standard can't allocate x bytes message. (btw, if you are confused
that scanning only 2 lines would max out my memory... I'm scanning two
long lines from 36 different connections so it does add up).

 version
 _  
platform i386-pc-mingw32
arch i386   
os   mingw32
system   i386, mingw32  
status  
major2  
minor0.1
year 2004   
month11 
day  15 
language R  

Thanks.

Tim Howard

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] subset data.frame with value != in all columns

2005-02-07 Thread Tim Howard
Petr,
   Thank you!  Yes, rowSums appears to be even a little bit faster than
unique(which()), and it also maintains the original order. I do want
original order maintained, but I first apply a function to one of my
data.frames (that without any -s ... yes, these do represent nulls,
as someone asked earlier) and rbind these two dataframes back together,
so I need to sort (by rownames) after the rbind (there doesn't seem to
be a sortby option in rbind). 
   I apologize for not jumping on rowSums earlier, I hadn't caught on
that it was summing counts of occurrence of the search value, not
summing the search value itself.
   Thanks again, this is very instructive and *very* helpful.
humbly,
Tim

 Petr Pikal [EMAIL PROTECTED] 02/07/05 02:12AM 
Hi Tim

I can not say much about apply, but the code with unique(which()) 
gives you reordered rows in case of - selection

try

set.seed(1)
in.df - data.frame(
c1=rnorm(4),
c2=rnorm(4),
c3=rnorm(4),
c4=rnorm(4),
c5=rnorm(4))
in.df[in.df3] - (-)

system.time(e - in.df[unique(which(in.df == -, arr.ind = 
TRUE)[,1]), ])
system.time(e1 - in.df[(rowSums(in.df == -)) != 0,])

all.equal(e,e1)

So if you mind you need to do reordering.

ooo-order(as.numeric(rownames(e)))
all.equal(e[ooo,],e1)

Cheers
Petr

On 4 Feb 2005 at 11:17, Tim Howard wrote:

 Because I'll be doing this on big datasets and time is important, I
 thought I'd time all the different approaches that were suggested on
a
 small dataframe. The results were very instructive so I thought I'd
 pass them on. I also discovered that my numeric columns (e.g.
 -.000) weren't found by apply() but were found by which() and
the
 simple replace. Was it apply's fault or something else?
 
 Note how much faster unique(which()) is; wow! Thanks to Marc
Schwartz
 for this blazing solution.
 
  nrow(in.df)
 [1] 4
 #extract rows with no -
  system.time(x - subset(in.df, apply(in.df, 1,
 function(in.df){all(in.df != -)})))
 [1] 3.25 0.00 3.25   NA   NA
  system.time(y- in.df[-unique(which(in.df == -, arr.ind =
  TRUE)[,
 1]), ])
 [1] 0.17 0.00 0.17   NA   NA
  system.time({is.na(in.df) -in.df == -; z - na.omit(in.df)})
 [1] 0.25 0.02 0.26   NA   NA
 
  nrow(x);nrow(y);nrow(z)
 [1] 39990
 [1] 39626
 [1] 39626
 
 #extract rows with -
  system.time(d-subset(in.df, apply(in.df, 1,
 function(in.df){any(in.df == -)})))
 [1] 3.40 0.00 3.45   NA   NA
  system.time(e-in.df[unique(which(in.df == -, arr.ind =
TRUE)[,
 1]), ])
 [1] 0.11 0.00 0.11   NA   NA
 
  nrow(d); nrow(e)
 [1] 10
 [1] 374
 
 Tim Howard
 
 
  Marc Schwartz [EMAIL PROTECTED] 02/03/05 03:24PM 
 On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote: 
   ... snip...
  My questions: 
  Is there a cleaner way to extract all rows containing a specified
  value? How can I extract all rows that don't have this value in
any
  col?
  
  #create dummy dataset
  x - data.frame(
  c1=c(-99,-99,-99,4:10),
  c2=1:10,
  c3=c(1:3,-99,5:10),
  c4=c(10:1),
  c5=c(1:9,-99))
  
 ..snip...
 
 How about this, presuming that your data frame is all numeric:
 
 For rows containing -99:
 
  x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
 c1 c2  c3 c4  c5
 1  -99  1   1 10   1
 2  -99  2   2  9   2
 3  -99  3   3  8   3
 44  4 -99  7   4
 10  10 10  10  1 -99
 
 
 For rows not containing -99:
 
  x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ]
   c1 c2 c3 c4 c5
 5  5  5  5  6  5
 6  6  6  6  5  6
 7  7  7  7  4  7
 8  8  8  8  3  8
 9  9  9  9  2  9
 
 
 What I have done here is to use which(), setting arr.ind = TRUE.
This
 returns the row, column indices for the matches to the boolean
 statement. The first column returned by which() in this case are the
 row numbers matching the statement, so I take the first column only.
 
 Since it is possible that more than one element in a row can match
the
 boolean, I then use unique() to get the singular row values.
 
 Thus, I can use the returned row indices above to subset the data
 frame.
 
 HTH,
 
 Marc Schwartz
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide!
 http://www.R-project.org/posting-guide.html 

Petr Pikal
[EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] subset data.frame with value != in all columns

2005-02-03 Thread Tim Howard
I am trying to extract rows from a data.frame based on the
presence/absence of a single value in any column.  I've figured out how
to do get the positive matches, but the remainder (rows without this
value) eludes me.  Mining the help pages and archives brought me,
frustratingly,  very close, as you'll see below. 

My goal: two data frames, one with -99 in at least one column in each
row, one with no occurrences of -99. I want to preserve rownames in
each.

My questions: 
Is there a cleaner way to extract all rows containing a specified
value?
How can I extract all rows that don't have this value in any col?

#create dummy dataset
x - data.frame(
c1=c(-99,-99,-99,4:10),
c2=1:10,
c3=c(1:3,-99,5:10),
c4=c(10:1),
c5=c(1:9,-99))

#extract data.frame of rows with -99 in them
for(i in 1:ncol(x))
{
y-subset(x, x[,i]==-99, drop=FALSE);
ifelse(i==1, z-y, z - rbind(z,y));
}

#various attempts to get rows not containing -99:

# this attempt was to create, in list, the exclusion formula for each
column.
# Here, I couldn't get subset to recognize list as the correct type.
# e.g. it works if I paste the value of list in the subset command
{
for(i in 1:ncol(x)){
if(i==1)
list-paste(x[,i,]!=-99, sep=)
else
list-paste(list, ,   x[,i,]!=-99, sep=)
}
y-subset(x, list, drop=FALSE);
}

# this will do it for one col, but if I index more
# it returns all rows
y - x[!(x[,3] %in% -99),]

# this also works for one col
y-x[x[,1]!=-99,]

# but if I index more, I get extra rows of NAs
y-x[x[,1:5]!=-99,]

Thanks in advance.
Tim Howard

platform i386-pc-mingw32
arch i386   
os   mingw32
system   i386, mingw32  
status  
major2  
minor0.1
year 2004   
month11 
day  15 
language R

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] subset data.frame with value != in all columns

2005-02-03 Thread Tim Howard
apply, of course, does the trick exceptionally well. Thank you,
everyone, for the help.

tim

 Chuck Cleland [EMAIL PROTECTED] 02/03/05 03:10PM 
How about this?

#extract data.frame of rows with -99 in them

subset(x, apply(x, 1, function(x){any(x == -99)}))

#extract data.frame of rows not containing -99 in them

subset(x, apply(x, 1, function(x){all(x != -99)}))

hope this helps,

Chuck Cleland

Tim Howard wrote:
 I am trying to extract rows from a data.frame based on the
 presence/absence of a single value in any column.  I've figured out
how
 to do get the positive matches, but the remainder (rows without this
 value) eludes me.  Mining the help pages and archives brought me,
 frustratingly,  very close, as you'll see below. 
 
 My goal: two data frames, one with -99 in at least one column in
each
 row, one with no occurrences of -99. I want to preserve rownames in
 each.
 
 My questions: 
 Is there a cleaner way to extract all rows containing a specified
 value?
 How can I extract all rows that don't have this value in any col?
 
 #create dummy dataset
 x - data.frame(
 c1=c(-99,-99,-99,4:10),
 c2=1:10,
 c3=c(1:3,-99,5:10),
 c4=c(10:1),
 c5=c(1:9,-99))
 
 #extract data.frame of rows with -99 in them
 for(i in 1:ncol(x))
 {
 y-subset(x, x[,i]==-99, drop=FALSE);
 ifelse(i==1, z-y, z - rbind(z,y));
 }
 
 #various attempts to get rows not containing -99:
 
 # this attempt was to create, in list, the exclusion formula for
each
 column.
 # Here, I couldn't get subset to recognize list as the correct
type.
 # e.g. it works if I paste the value of list in the subset command
 {
 for(i in 1:ncol(x)){
 if(i==1)
 list-paste(x[,i,]!=-99, sep=)
 else
 list-paste(list, ,   x[,i,]!=-99, sep=)
 }
 y-subset(x, list, drop=FALSE);
 }
 
 # this will do it for one col, but if I index more
 # it returns all rows
 y - x[!(x[,3] %in% -99),]
 
 # this also works for one col
 y-x[x[,1]!=-99,]
 
 # but if I index more, I get extra rows of NAs
 y-x[x[,1:5]!=-99,]
 
 Thanks in advance.
 Tim Howard
 
 platform i386-pc-mingw32
 arch i386   
 os   mingw32
 system   i386, mingw32  
 status  
 major2  
 minor0.1
 year 2004   
 month11 
 day  15 
 language R
 
 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html 
 

-- 
Chuck Cleland, Ph.D.
NDRI, Inc.
71 West 23rd Street, 8th floor
New York, NY 10010
tel: (212) 845-4495 (Tu, Th)
tel: (732) 452-1424 (M, W, F)
fax: (917) 438-0894

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] assign connections automatically

2005-02-01 Thread Tim Howard
Hi all, 
   I am trying to create a function that will open connections to all
files of 
one type within the working directory.
   I've got the function to open the connections, but I am having a
bugger of a 
time trying to get these connections named as objects in the workspace.
 I am at the point where I can do it outside of the function, but not
inside, using assign.  I'm sure I'm missing something obvious about the
inherent properties of functions.

#first six lines just setup for this example
 x-1:20
 y-20:40
 z-40:60
 write(x, file=testx.txt)
 write(y, file=testy.txt)
 write(z, file=testz.txt)

 inConnect - function(){
+ fn - dir(pattern=*.txt) # grab only *.txt files
+ fn2 - gsub('.txt', , fn) # removes the '.txt' from each string
+ for(i in 1:length(fn))
+ assign((fn2[[i]]),file(fn[i], open=r))
+ }

 showConnections()  #currently, no connections
 description class mode text isopen can read can write
 inConnect()  # run function
 showConnections()  #the connections are now there
  description class  mode text   isopen   can read can write
3 testx.txt file r  text opened yesno 
4 testy.txt file r  text opened yesno 
5 testz.txt file r  text opened yesno 
 ls()  #but NOT there as objects
 [1] fn   fn2  inConnectlast.warning
 [5]   xyz

 fn - dir(pattern=*.txt)  #but if I do it manually
 fn2 - gsub('.txt', , fn)
 assign((fn2[[3]]),file(fn[3], open=r))
 ls()  #the connection, testz, appears
 [1] fn   fn2   inConnectlast.warning
 [5]  testzxy z   

What am I missing? or is there a better way? 

I am using R 2.0.1 on a Windows2K box.

Thanks so much!
Tim Howard

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] assign connections automatically

2005-02-01 Thread Tim Howard
Thank you for the help!  Both, env=.GlobalEnv and pos=1 do the trick.  
I'm embarrassed I didn't glean this from the assign help pages earlier.

?assign suggests that env is there for back compatibility so I'm
going with pos.

Tim Howard

 James Holtman 
try:

 inConnect - function(){
+ fn - dir(pattern=*.txt) # grab only *.txt files
+ fn2 - gsub('.txt', , fn) # removes the '.txt' from each string
+ for(i in 1:length(fn))
+ assign((fn2[[i]]),file(fn[i], open=r),env = .GlobalEnv)
+ }
__
James HoltmanWhat is the problem you are trying to solve?
Executive Technical Consultant  --  Office of Technology, Convergys
[EMAIL PROTECTED] 
+1 (513) 723-2929

 Prof Brian Ripley [EMAIL PROTECTED] 02/01/05 08:34AM 
You are assigning in the frame of the function, not in the user
workspace.
See ?assign and try pos=1 (if that is what you intended) but it might
well 
be better to return a list of objects.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Dropping a digit with scan() on a connection

2005-01-19 Thread Tim Howard
Thank you Dr. Ripley and Christoph Buser for your explanations and
help.

Using sep =within scan worked within lines of my file, but then I
gained an NA record when wrapping from one line to the next (because the
linebreak character is no longer recognized as a sep?).  So, I'll
continue by ensuring each group I read ends at the end of a line (as
scan was designed), and by using scan without the sep option.

FYI, Here's how the NA showed up, each line is 800 numbers long:

test4 - scan(cn.test, n=1600, sep =  )
test5 - scan(cn.test, n=1600)
test4[797:803]
[1]  81.0  81.08746  81.89484  82.0NA 580.09030
576.90300
 test5[797:803]
[1]  81.01944  81.62060  81.96495  82.0  82.0 567.91840
563.10470

Thanks again.
Tim


 Prof Brian Ripley [EMAIL PROTECTED] 01/19/05 03:42AM 
This is because scan() has a private pushback.
Either:

1) Read the file a whole line at a time: I cannot see why you need to
do 
so here nor in your sketched application.

or

2) Use an explicit separator, e.g.   in your example.

scan() is not designed to read parts of lines of a file,


On Tue, 18 Jan 2005, Tim Howard wrote:

 R gurus,

 My use of scan() seems to be dropping the first digit of sequential
 scans on a connection. It looks like it happens only within a line:

 cat(TITLE extra line, 235 335 535 735, 115 135 175,
 file=ex.data, sep=\n)
 cn.x - file(ex.data, open=r)
 a - scan(cn.x, skip=1, n=2)
 Read 2 items
 a
 [1] 235 335
 b - scan(cn.x, n=2)
 Read 2 items
 b
 [1]  35 735
 c - scan(cn.x, n=2)
 Read 2 items
 c
 [1] 115 135
 d - scan(cn.x, n=1)
 Read 1 items
 d
 [1] 75


 Note in b, I should get 535, not 35 as the first value. In d, I
should
 get 175.  Does anyone know how to get these digits?

 The reason I'm not scanning the entire file at once is that my real
 dataset is much larger than a Gig and I'll need to pull only portions
of
 the file in at once. I got readLines to work, but then I have to
figure
 out how to convert each entire line into a data.frame. Scan seems a
lot
 cleaner, with the exception of the funny character dropping issue.

 Thanks so much!
 Tim Howard

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help 
 PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html 


-- 
Brian D. Ripley,  [EMAIL PROTECTED] 
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/ 
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Dropping a digit with scan() on a connection

2005-01-18 Thread Tim Howard
R gurus,

My use of scan() seems to be dropping the first digit of sequential
scans on a connection. It looks like it happens only within a line:

 cat(TITLE extra line, 235 335 535 735, 115 135 175,
file=ex.data, sep=\n)
 cn.x - file(ex.data, open=r)
 a - scan(cn.x, skip=1, n=2)
Read 2 items
 a
[1] 235 335
 b - scan(cn.x, n=2)
Read 2 items
 b
[1]  35 735
 c - scan(cn.x, n=2)
Read 2 items
 c
[1] 115 135
 d - scan(cn.x, n=1)
Read 1 items
 d
[1] 75
 

Note in b, I should get 535, not 35 as the first value. In d, I should
get 175.  Does anyone know how to get these digits?

The reason I'm not scanning the entire file at once is that my real
dataset is much larger than a Gig and I'll need to pull only portions of
the file in at once. I got readLines to work, but then I have to figure
out how to convert each entire line into a data.frame. Scan seems a lot
cleaner, with the exception of the funny character dropping issue.

Thanks so much!
Tim Howard

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] predict.randomForest

2004-12-10 Thread Tim Howard
I have a data.frame with a series of variables tagged to a binary
response ('present'/'absent').  I am trying to use randomForest to
predict present/absent in a second dataset.After a lot a fiddling
(using two data frames, making sure data types are the same, lots of
testing with data that works such as data(iris)) I've settled on
combining all my data into one data.frame and then subset()'ing the
known present/absent portion of the data.frame for the randomForest run
and then using the other subset for the predict.   This worked with test
data, but then when I try it on a larger dataset (63,000 rows to
predict), I get this error:

Error in predict.randomForest(stsw.rf, stsw.out, type = prob) : 
Type of predictors in new data do not match that of the
training data.

This is the error I was getting earlier, but I thought I had solved it
by joining into one data.frame and subsetting.  The values for each
variable in the 'unknown' data (that which I want to predict) fall
within (are bound by) the values in the 'known' data.  

Does this error message have more than one meaning?

Any suggestions on how to work through this?

I am using R 2.0.1.  randomForest 4.4-2 (2004-11-02); I'm a new user to
R, but doing my best to learn as much as I can... if I'm obviously
clueless, please forgive me!

Any help would be greatly appreciated,

Thanks in advance!
Tim Howard


More background for anyone interested:
  CART (as well as many other statistical techniques) has been used for
a while to predict plant and animal distributions across a landscape.
You feed it data about places where you know the Plant to occur and not
occur and CART provides you with a tree with which you can then model
the potential distribution across your region (state, country, etc)
using GIS.
   I've heard good things about the randomForests and would like to try
to do the same thing. My biggest stumbling block is that I can't
(obviously once I realized it) get a single 'best tree' from
randomForests with which to apply my GIS models.  Or, is there any way
to extract a formula from randomForest similar to a CART or rPart tree
and apply it to a dataset outside of R?  The only solution I've been
able to come up with is bring ALL of the environmental variables into R,
have randomForest do the prediction, and the get that prediction back
into GIS. Thus my problem as I stated it above. I'm worried because my
datasets are going to be huge (100's of millions of records) when we
really get going. Should I be worried?

thanks,  Tim

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html