[R] opinions please: text editors and reporting/Sweave?
dear all - I currently use Tinn-R as my text editor to work with code that I submit to R, with some output dumped to text files, some images dumped to pdf. (system: Windows 2K and XP, R 2.4.1 and R 2.5). We are using R for overnight runs to create large output data files for GIS, but then I need simple output reports for analysis results for each separate data set. Thus, I create many reports of the same style, but just based on different input data. I am recognizing that I need a better reporting system, so that I can create clean reports for each separate R run. This obviously means using Sweave and some implementation of LaTex, both of which are new to me. I've installed MikTex and successfully completed a demo or two for creating pdfs from raw LaTeX. It appears that if I want to ease my entry into the world of LaTeX, I might need to switch editors to something like Emacs (I read somewhere that Emacs helps with the TeX markup?). After quite a while wallowing at the Emacs site, I am finding that ESS is well integrated with R and might be the way to go. Aaaagh... I'm in way over my head! My questions: What, in your opinion, is the simplest way to integrate text and graphics reports into a single report such as a pdf file. If the answer to this is LaTeX and Sweave, is it difficult to use a text editor such as Tinn-R or would you strongly recommend I leave behind Tinn and move over to an editor that has more LaTeX help? In reading over Friedrich Leisch's Sweave User Manual (v 1.6.0) I am beginning to think I can do everything I need with my simple editor. Before spending many hours going down that path, I thought it prudent to ask the R community. It is likely I am misunderstanding some of the process here and any clarifications are welcome. Thank you in advance for any thoughts. Tim Howard __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC optimal threshold
Jose - I've struggled a bit with the same question, said another way: how do you find the value in a ROC curve that minimizes false positives while maximizing true positives? Here's something I've come up with. I'd be curious to hear from the list whether anyone thinks this code might get stuck in local minima, or if it does find the global minimum each time. (I think it's ok). From your ROC object you need to grab the sensitivity (=true positive rate) and specificity (= 1- false positive rate) and the cutoff levels. Then find the value that minimizes abs(sensitivity-specificity), or sqrt((1-sens)^2)+(1-spec)^2)) as follows: absMin - extract[which.min(abs(extract$sens-extract$spec)),]; sqrtMin - extract[which.min(sqrt((1-extract$sens)^2+(1-extract$spec)^2)),]; In this example, 'extract' is a dataframe containing three columns: extract$sens = sensitivity values, extract$spec = specificity values, extract$votes = cutoff values. The command subsets the dataframe to a single row containing the desired cutoff and the sens and spec values that are associated with it. Most of the time these two answers (abs or sqrt) are the same, sometimes they differ quite a bit. I do not see this application of ROC curves very often. A question for those much more knowledgeable than I is there a problem with using ROC curves in this manner? Tim Howard Date: Fri, 31 Mar 2006 11:58:14 +0200 From: Anadon Herrera, Jose Daniel [EMAIL PROTECTED] Subject: [R] ROC optimal threshold To: 'r-help@stat.math.ethz.ch' r-help@stat.math.ethz.ch Message-ID: [EMAIL PROTECTED] Content-Type: text/plain; charset=iso-8859-1 hello, I am using the ROC package to evaluate predictive models I have successfully plot the ROC curve, however ?is there anyway to obtain the value of operating point=optimal threshold value (i.e. the nearest point of the curve to the top-left corner of the axes)? thank you very much, jose daniel anadon area de ecologia universidad miguel hernandez espa?a __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] ROC optimal threshold
Dr. Harrell, Thank you for your response. I had noted, and appreciate, your perspective on ROC in past listserv entries and am glad to have an opportunity to delve a little deeper. I (and, I think, Jose Daniel Anadon, the original poster of this question) have a predictive model for the presence of, say, animal_X. This is a spatial model that can be represented on maps and is based on known locations where animal_X is present and (usually) known locations where animal_X is absent. Output of the analysis (using any number of analytic routines, including logit, randomForest, maximum entropy, mahalanobis distance...) is a full map where every spot on the map has a probability that that particular location has the appropriate habitat for animal_x. This output can be visualized by just using a color scale (perhaps blue for low probability to red for high probability), BUT, there are times when we want to apply a cutoff to this probability output and create a product where we can say either yes, animal_X habitat is predicted here or no, animal_X habitat is not predicted here. Note this is the final analytic step. There are no later anaylsis steps and so (possibly) adjustments for multiple comparisons do not come into play. Indeed, it seems that using a standard process to find a threshold reduces the arbitrariness of the probabiliity color scale (at what probability do we set 'red'? at what probability do we set 'blue'?). Are there alternative approaches that reduce the drawbacks you allude to? How would you turn a surface of probabilities into a binary surface of yes-no? Thank you for your time. Sincerely, Tim Howard Ecologist New York Natural Heritage Program Frank E Harrell Jr [EMAIL PROTECTED] 03/31/06 11:20 AM Choosing cutoffs is frought with difficulties, arbitrariness, inefficiency, and the necessity to use a complex adjustment for multiple comparisons in later analysis steps unless the dataset used to generate the cutoff was so large as could be considered infinite. -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] how to use the randomForest and rpart function?
Michael - I recall reading something Breiman wrote that said essentially don't skimp on the number of trees - they are cheap to build and it makes for a better model. Also, look at your error rates (using plot), and make sure you run enough trees so that the error settles down. You'll likely be building 1000 or so trees. Tim Hi Andy, Does the randomForest have a Cross Validation built-in to decide what is the best number of trees or I have to find the best number manually by myself? Thanks a lot! Michael. On 3/7/06, Liaw, Andy [EMAIL PROTECTED] wrote: Yes, I do know. That's why I pointed you to the reference linked from the help page. BTW, there's also an R News article describing the initial version of the package. Have you perused that? Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
RE: [R] Assign factor and levels inside function
Aha! You've just opened the door to another level for this blundering R user. I even went back to my well-used copy of An Introduction to R to see where I missed this standard approach for processing new data. Nothing clear but certainly alluded to in many of the function examples. I don't know why I was stuck in that rut. I'm sure 99.9% of you on this list know this, but... To be clear for anyone searching these archives later: Don't bother to ask your function to make assignments to pos=1 (the global environment), just do the assignment yourself when calling the function. For example, instead of coding a function call like this: processData(dat) to assign the processed data to pos=1, simply make the assignment when calling the function: dat - processData(dat) Thanks for being gentle on me, Andy. Tim Liaw, Andy [EMAIL PROTECTED] 4/21/2005 9:57:22 PM Tim, From: Tim Howard Andy, Thank you for the help. Yes, my question really did seem like I was going through a lot of unnecessary steps just to define levels of a variable. But that was just for the example. In my application, I bring new datasets into R on a daily basis. While the data differs, the variables are the same, and the categorical variables have the same levels. So I find myself daily applying the same factor and level definitions (by cutting and pasting the large chunk of commands from a text file). It really would be simpler to have it wrapped up in a function. That's why I asked the question about putting this into a function. Upon reading your answer, I thought maybe I could use your example and use the super-assignment '-' in the function. But, your method assigns levels, but does not define the var as a factor (interesting!). levels(y$one) - seq(1, 9, by=2) y$one [1] 1 1 3 3 5 7 attr(,levels) [1] 1 3 5 7 9 is.factor(y$one) [1] FALSE Ouch! levels- is generic, and the default method simply attach the levels attribute to the object. You need to coerce the object into a factor explicitly. Unfortunately, whenever I try to use - with the dataframe as the variable, I get an error message: fncFact - function(datfra){ + datfra$one - factor(datfra$one, levels=c(1,3,5,7,9)) + } fncFact(y) Error in fncFact(y) : Object datfra not found I believe the canonical ways of doing something like this in R is something along the line of: processData - function(dat) { dat$f1 - factor(dat$f1, levels=...) ... ## any other manipulations you want to do dat } Then when you get new data, you just do: newData - processData(newData) HTH, Andy Tim Liaw, Andy [EMAIL PROTECTED] 4/20/2005 4:03:24 PM Wouldn't it be easier to do this? levels(y$one) - seq(1, 9, by=2) y$one [1] 1 1 3 3 5 7 attr(,levels) [1] 1 3 5 7 9 Andy From: Tim Howard R-help, After cogitating for a while, I finally figured out how to define a data.frame column as factor and assign the levels within a function... BUT I still need to pass the data.frame and its name separately. I can't seem to find any other way to pass the name of the data.frame, rather than the data.frame itself. Any suggestions on how to go about it? Is there something like value(object) or name(object) that I can't find? #sample dataframe for this example y - data.frame( one=c(1,1,3,3,5,7), two=c(2,2,6,6,8,8)) levels(y$one) # check out levels NULL # the function I've come up with fncFact - function(datfra, datfraNm){ datfra$one - factor(datfra$one, levels=c(1,3,5,7,9)) assign(datfraNm, datfra, pos=1) } fncFact(y, y) levels(y$one) [1] 1 3 5 7 9 I suppose only for aesthetics and simplicity, I'd like to have only pass the data.frame and get the same result. Thanks in advance, Tim Howard version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This e-mail message, together with any attachments, contains information of Merck Co., Inc. (One Merck Drive, Whitehouse Station, New Jersey, USA 08889), and/or its affiliates (which may be known outside the United States as Merck Frosst, Merck Sharp Dohme or MSD and in Japan, as Banyu) that may be confidential, proprietary copyrighted and/or legally privileged. It is intended solely for the use of the individual or entity named on this message. If you are not the intended recipient, and have
RE: [R] Assign factor and levels inside function
Andy, Thank you for the help. Yes, my question really did seem like I was going through a lot of unnecessary steps just to define levels of a variable. But that was just for the example. In my application, I bring new datasets into R on a daily basis. While the data differs, the variables are the same, and the categorical variables have the same levels. So I find myself daily applying the same factor and level definitions (by cutting and pasting the large chunk of commands from a text file). It really would be simpler to have it wrapped up in a function. That's why I asked the question about putting this into a function. Upon reading your answer, I thought maybe I could use your example and use the super-assignment '-' in the function. But, your method assigns levels, but does not define the var as a factor (interesting!). levels(y$one) - seq(1, 9, by=2) y$one [1] 1 1 3 3 5 7 attr(,levels) [1] 1 3 5 7 9 is.factor(y$one) [1] FALSE Unfortunately, whenever I try to use - with the dataframe as the variable, I get an error message: fncFact - function(datfra){ + datfra$one - factor(datfra$one, levels=c(1,3,5,7,9)) + } fncFact(y) Error in fncFact(y) : Object datfra not found Tim Liaw, Andy [EMAIL PROTECTED] 4/20/2005 4:03:24 PM Wouldn't it be easier to do this? levels(y$one) - seq(1, 9, by=2) y$one [1] 1 1 3 3 5 7 attr(,levels) [1] 1 3 5 7 9 Andy From: Tim Howard R-help, After cogitating for a while, I finally figured out how to define a data.frame column as factor and assign the levels within a function... BUT I still need to pass the data.frame and its name separately. I can't seem to find any other way to pass the name of the data.frame, rather than the data.frame itself. Any suggestions on how to go about it? Is there something like value(object) or name(object) that I can't find? #sample dataframe for this example y - data.frame( one=c(1,1,3,3,5,7), two=c(2,2,6,6,8,8)) levels(y$one) # check out levels NULL # the function I've come up with fncFact - function(datfra, datfraNm){ datfra$one - factor(datfra$one, levels=c(1,3,5,7,9)) assign(datfraNm, datfra, pos=1) } fncFact(y, y) levels(y$one) [1] 1 3 5 7 9 I suppose only for aesthetics and simplicity, I'd like to have only pass the data.frame and get the same result. Thanks in advance, Tim Howard version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Assign factor and levels inside function
R-help, After cogitating for a while, I finally figured out how to define a data.frame column as factor and assign the levels within a function... BUT I still need to pass the data.frame and its name separately. I can't seem to find any other way to pass the name of the data.frame, rather than the data.frame itself. Any suggestions on how to go about it? Is there something like value(object) or name(object) that I can't find? #sample dataframe for this example y - data.frame( one=c(1,1,3,3,5,7), two=c(2,2,6,6,8,8)) levels(y$one) # check out levels NULL # the function I've come up with fncFact - function(datfra, datfraNm){ datfra$one - factor(datfra$one, levels=c(1,3,5,7,9)) assign(datfraNm, datfra, pos=1) } fncFact(y, y) levels(y$one) [1] 1 3 5 7 9 I suppose only for aesthetics and simplicity, I'd like to have only pass the data.frame and get the same result. Thanks in advance, Tim Howard version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Encodebuf? yet another memory question
Hi all, I was surprised to see this memory error: Error in scan(Cn.minex13, nlines = 2, quiet = TRUE) : Could not allocate memory for Encodebuf memory.size(max=TRUE) [1] 256843776 memory.size(FALSE) [1] 180144528 memory.limit() [1] 2147483648 I don't have any objects named 'Encodebuf' and help and the R site search turn up no matches for this word. As memory.size and memory.limit indicate, I'm way below my limit (but, I grant that maybe windows won't give R any more memory...). In my next run, I'll ask to scan fewer lines, but I thought it worth asking the group if this 'Encodebuf' error meant anything different than the standard can't allocate x bytes message. (btw, if you are confused that scanning only 2 lines would max out my memory... I'm scanning two long lines from 36 different connections so it does add up). version _ platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R Thanks. Tim Howard __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset data.frame with value != in all columns
Petr, Thank you! Yes, rowSums appears to be even a little bit faster than unique(which()), and it also maintains the original order. I do want original order maintained, but I first apply a function to one of my data.frames (that without any -s ... yes, these do represent nulls, as someone asked earlier) and rbind these two dataframes back together, so I need to sort (by rownames) after the rbind (there doesn't seem to be a sortby option in rbind). I apologize for not jumping on rowSums earlier, I hadn't caught on that it was summing counts of occurrence of the search value, not summing the search value itself. Thanks again, this is very instructive and *very* helpful. humbly, Tim Petr Pikal [EMAIL PROTECTED] 02/07/05 02:12AM Hi Tim I can not say much about apply, but the code with unique(which()) gives you reordered rows in case of - selection try set.seed(1) in.df - data.frame( c1=rnorm(4), c2=rnorm(4), c3=rnorm(4), c4=rnorm(4), c5=rnorm(4)) in.df[in.df3] - (-) system.time(e - in.df[unique(which(in.df == -, arr.ind = TRUE)[,1]), ]) system.time(e1 - in.df[(rowSums(in.df == -)) != 0,]) all.equal(e,e1) So if you mind you need to do reordering. ooo-order(as.numeric(rownames(e))) all.equal(e[ooo,],e1) Cheers Petr On 4 Feb 2005 at 11:17, Tim Howard wrote: Because I'll be doing this on big datasets and time is important, I thought I'd time all the different approaches that were suggested on a small dataframe. The results were very instructive so I thought I'd pass them on. I also discovered that my numeric columns (e.g. -.000) weren't found by apply() but were found by which() and the simple replace. Was it apply's fault or something else? Note how much faster unique(which()) is; wow! Thanks to Marc Schwartz for this blazing solution. nrow(in.df) [1] 4 #extract rows with no - system.time(x - subset(in.df, apply(in.df, 1, function(in.df){all(in.df != -)}))) [1] 3.25 0.00 3.25 NA NA system.time(y- in.df[-unique(which(in.df == -, arr.ind = TRUE)[, 1]), ]) [1] 0.17 0.00 0.17 NA NA system.time({is.na(in.df) -in.df == -; z - na.omit(in.df)}) [1] 0.25 0.02 0.26 NA NA nrow(x);nrow(y);nrow(z) [1] 39990 [1] 39626 [1] 39626 #extract rows with - system.time(d-subset(in.df, apply(in.df, 1, function(in.df){any(in.df == -)}))) [1] 3.40 0.00 3.45 NA NA system.time(e-in.df[unique(which(in.df == -, arr.ind = TRUE)[, 1]), ]) [1] 0.11 0.00 0.11 NA NA nrow(d); nrow(e) [1] 10 [1] 374 Tim Howard Marc Schwartz [EMAIL PROTECTED] 02/03/05 03:24PM On Thu, 2005-02-03 at 14:57 -0500, Tim Howard wrote: ... snip... My questions: Is there a cleaner way to extract all rows containing a specified value? How can I extract all rows that don't have this value in any col? #create dummy dataset x - data.frame( c1=c(-99,-99,-99,4:10), c2=1:10, c3=c(1:3,-99,5:10), c4=c(10:1), c5=c(1:9,-99)) ..snip... How about this, presuming that your data frame is all numeric: For rows containing -99: x[unique(which(x == -99, arr.ind = TRUE)[, 1]), ] c1 c2 c3 c4 c5 1 -99 1 1 10 1 2 -99 2 2 9 2 3 -99 3 3 8 3 44 4 -99 7 4 10 10 10 10 1 -99 For rows not containing -99: x[-unique(which(x == -99, arr.ind = TRUE)[, 1]), ] c1 c2 c3 c4 c5 5 5 5 5 6 5 6 6 6 6 5 6 7 7 7 7 4 7 8 8 8 8 3 8 9 9 9 9 2 9 What I have done here is to use which(), setting arr.ind = TRUE. This returns the row, column indices for the matches to the boolean statement. The first column returned by which() in this case are the row numbers matching the statement, so I take the first column only. Since it is possible that more than one element in a row can match the boolean, I then use unique() to get the singular row values. Thus, I can use the returned row indices above to subset the data frame. HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Petr Pikal [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] subset data.frame with value != in all columns
I am trying to extract rows from a data.frame based on the presence/absence of a single value in any column. I've figured out how to do get the positive matches, but the remainder (rows without this value) eludes me. Mining the help pages and archives brought me, frustratingly, very close, as you'll see below. My goal: two data frames, one with -99 in at least one column in each row, one with no occurrences of -99. I want to preserve rownames in each. My questions: Is there a cleaner way to extract all rows containing a specified value? How can I extract all rows that don't have this value in any col? #create dummy dataset x - data.frame( c1=c(-99,-99,-99,4:10), c2=1:10, c3=c(1:3,-99,5:10), c4=c(10:1), c5=c(1:9,-99)) #extract data.frame of rows with -99 in them for(i in 1:ncol(x)) { y-subset(x, x[,i]==-99, drop=FALSE); ifelse(i==1, z-y, z - rbind(z,y)); } #various attempts to get rows not containing -99: # this attempt was to create, in list, the exclusion formula for each column. # Here, I couldn't get subset to recognize list as the correct type. # e.g. it works if I paste the value of list in the subset command { for(i in 1:ncol(x)){ if(i==1) list-paste(x[,i,]!=-99, sep=) else list-paste(list, , x[,i,]!=-99, sep=) } y-subset(x, list, drop=FALSE); } # this will do it for one col, but if I index more # it returns all rows y - x[!(x[,3] %in% -99),] # this also works for one col y-x[x[,1]!=-99,] # but if I index more, I get extra rows of NAs y-x[x[,1:5]!=-99,] Thanks in advance. Tim Howard platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] subset data.frame with value != in all columns
apply, of course, does the trick exceptionally well. Thank you, everyone, for the help. tim Chuck Cleland [EMAIL PROTECTED] 02/03/05 03:10PM How about this? #extract data.frame of rows with -99 in them subset(x, apply(x, 1, function(x){any(x == -99)})) #extract data.frame of rows not containing -99 in them subset(x, apply(x, 1, function(x){all(x != -99)})) hope this helps, Chuck Cleland Tim Howard wrote: I am trying to extract rows from a data.frame based on the presence/absence of a single value in any column. I've figured out how to do get the positive matches, but the remainder (rows without this value) eludes me. Mining the help pages and archives brought me, frustratingly, very close, as you'll see below. My goal: two data frames, one with -99 in at least one column in each row, one with no occurrences of -99. I want to preserve rownames in each. My questions: Is there a cleaner way to extract all rows containing a specified value? How can I extract all rows that don't have this value in any col? #create dummy dataset x - data.frame( c1=c(-99,-99,-99,4:10), c2=1:10, c3=c(1:3,-99,5:10), c4=c(10:1), c5=c(1:9,-99)) #extract data.frame of rows with -99 in them for(i in 1:ncol(x)) { y-subset(x, x[,i]==-99, drop=FALSE); ifelse(i==1, z-y, z - rbind(z,y)); } #various attempts to get rows not containing -99: # this attempt was to create, in list, the exclusion formula for each column. # Here, I couldn't get subset to recognize list as the correct type. # e.g. it works if I paste the value of list in the subset command { for(i in 1:ncol(x)){ if(i==1) list-paste(x[,i,]!=-99, sep=) else list-paste(list, , x[,i,]!=-99, sep=) } y-subset(x, list, drop=FALSE); } # this will do it for one col, but if I index more # it returns all rows y - x[!(x[,3] %in% -99),] # this also works for one col y-x[x[,1]!=-99,] # but if I index more, I get extra rows of NAs y-x[x[,1:5]!=-99,] Thanks in advance. Tim Howard platform i386-pc-mingw32 arch i386 os mingw32 system i386, mingw32 status major2 minor0.1 year 2004 month11 day 15 language R __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Chuck Cleland, Ph.D. NDRI, Inc. 71 West 23rd Street, 8th floor New York, NY 10010 tel: (212) 845-4495 (Tu, Th) tel: (732) 452-1424 (M, W, F) fax: (917) 438-0894 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] assign connections automatically
Hi all, I am trying to create a function that will open connections to all files of one type within the working directory. I've got the function to open the connections, but I am having a bugger of a time trying to get these connections named as objects in the workspace. I am at the point where I can do it outside of the function, but not inside, using assign. I'm sure I'm missing something obvious about the inherent properties of functions. #first six lines just setup for this example x-1:20 y-20:40 z-40:60 write(x, file=testx.txt) write(y, file=testy.txt) write(z, file=testz.txt) inConnect - function(){ + fn - dir(pattern=*.txt) # grab only *.txt files + fn2 - gsub('.txt', , fn) # removes the '.txt' from each string + for(i in 1:length(fn)) + assign((fn2[[i]]),file(fn[i], open=r)) + } showConnections() #currently, no connections description class mode text isopen can read can write inConnect() # run function showConnections() #the connections are now there description class mode text isopen can read can write 3 testx.txt file r text opened yesno 4 testy.txt file r text opened yesno 5 testz.txt file r text opened yesno ls() #but NOT there as objects [1] fn fn2 inConnectlast.warning [5] xyz fn - dir(pattern=*.txt) #but if I do it manually fn2 - gsub('.txt', , fn) assign((fn2[[3]]),file(fn[3], open=r)) ls() #the connection, testz, appears [1] fn fn2 inConnectlast.warning [5] testzxy z What am I missing? or is there a better way? I am using R 2.0.1 on a Windows2K box. Thanks so much! Tim Howard __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] assign connections automatically
Thank you for the help! Both, env=.GlobalEnv and pos=1 do the trick. I'm embarrassed I didn't glean this from the assign help pages earlier. ?assign suggests that env is there for back compatibility so I'm going with pos. Tim Howard James Holtman try: inConnect - function(){ + fn - dir(pattern=*.txt) # grab only *.txt files + fn2 - gsub('.txt', , fn) # removes the '.txt' from each string + for(i in 1:length(fn)) + assign((fn2[[i]]),file(fn[i], open=r),env = .GlobalEnv) + } __ James HoltmanWhat is the problem you are trying to solve? Executive Technical Consultant -- Office of Technology, Convergys [EMAIL PROTECTED] +1 (513) 723-2929 Prof Brian Ripley [EMAIL PROTECTED] 02/01/05 08:34AM You are assigning in the frame of the function, not in the user workspace. See ?assign and try pos=1 (if that is what you intended) but it might well be better to return a list of objects. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Dropping a digit with scan() on a connection
Thank you Dr. Ripley and Christoph Buser for your explanations and help. Using sep =within scan worked within lines of my file, but then I gained an NA record when wrapping from one line to the next (because the linebreak character is no longer recognized as a sep?). So, I'll continue by ensuring each group I read ends at the end of a line (as scan was designed), and by using scan without the sep option. FYI, Here's how the NA showed up, each line is 800 numbers long: test4 - scan(cn.test, n=1600, sep = ) test5 - scan(cn.test, n=1600) test4[797:803] [1] 81.0 81.08746 81.89484 82.0NA 580.09030 576.90300 test5[797:803] [1] 81.01944 81.62060 81.96495 82.0 82.0 567.91840 563.10470 Thanks again. Tim Prof Brian Ripley [EMAIL PROTECTED] 01/19/05 03:42AM This is because scan() has a private pushback. Either: 1) Read the file a whole line at a time: I cannot see why you need to do so here nor in your sketched application. or 2) Use an explicit separator, e.g. in your example. scan() is not designed to read parts of lines of a file, On Tue, 18 Jan 2005, Tim Howard wrote: R gurus, My use of scan() seems to be dropping the first digit of sequential scans on a connection. It looks like it happens only within a line: cat(TITLE extra line, 235 335 535 735, 115 135 175, file=ex.data, sep=\n) cn.x - file(ex.data, open=r) a - scan(cn.x, skip=1, n=2) Read 2 items a [1] 235 335 b - scan(cn.x, n=2) Read 2 items b [1] 35 735 c - scan(cn.x, n=2) Read 2 items c [1] 115 135 d - scan(cn.x, n=1) Read 1 items d [1] 75 Note in b, I should get 535, not 35 as the first value. In d, I should get 175. Does anyone know how to get these digits? The reason I'm not scanning the entire file at once is that my real dataset is much larger than a Gig and I'll need to pull only portions of the file in at once. I got readLines to work, but then I have to figure out how to convert each entire line into a data.frame. Scan seems a lot cleaner, with the exception of the funny character dropping issue. Thanks so much! Tim Howard __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Dropping a digit with scan() on a connection
R gurus, My use of scan() seems to be dropping the first digit of sequential scans on a connection. It looks like it happens only within a line: cat(TITLE extra line, 235 335 535 735, 115 135 175, file=ex.data, sep=\n) cn.x - file(ex.data, open=r) a - scan(cn.x, skip=1, n=2) Read 2 items a [1] 235 335 b - scan(cn.x, n=2) Read 2 items b [1] 35 735 c - scan(cn.x, n=2) Read 2 items c [1] 115 135 d - scan(cn.x, n=1) Read 1 items d [1] 75 Note in b, I should get 535, not 35 as the first value. In d, I should get 175. Does anyone know how to get these digits? The reason I'm not scanning the entire file at once is that my real dataset is much larger than a Gig and I'll need to pull only portions of the file in at once. I got readLines to work, but then I have to figure out how to convert each entire line into a data.frame. Scan seems a lot cleaner, with the exception of the funny character dropping issue. Thanks so much! Tim Howard __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] predict.randomForest
I have a data.frame with a series of variables tagged to a binary response ('present'/'absent'). I am trying to use randomForest to predict present/absent in a second dataset.After a lot a fiddling (using two data frames, making sure data types are the same, lots of testing with data that works such as data(iris)) I've settled on combining all my data into one data.frame and then subset()'ing the known present/absent portion of the data.frame for the randomForest run and then using the other subset for the predict. This worked with test data, but then when I try it on a larger dataset (63,000 rows to predict), I get this error: Error in predict.randomForest(stsw.rf, stsw.out, type = prob) : Type of predictors in new data do not match that of the training data. This is the error I was getting earlier, but I thought I had solved it by joining into one data.frame and subsetting. The values for each variable in the 'unknown' data (that which I want to predict) fall within (are bound by) the values in the 'known' data. Does this error message have more than one meaning? Any suggestions on how to work through this? I am using R 2.0.1. randomForest 4.4-2 (2004-11-02); I'm a new user to R, but doing my best to learn as much as I can... if I'm obviously clueless, please forgive me! Any help would be greatly appreciated, Thanks in advance! Tim Howard More background for anyone interested: CART (as well as many other statistical techniques) has been used for a while to predict plant and animal distributions across a landscape. You feed it data about places where you know the Plant to occur and not occur and CART provides you with a tree with which you can then model the potential distribution across your region (state, country, etc) using GIS. I've heard good things about the randomForests and would like to try to do the same thing. My biggest stumbling block is that I can't (obviously once I realized it) get a single 'best tree' from randomForests with which to apply my GIS models. Or, is there any way to extract a formula from randomForest similar to a CART or rPart tree and apply it to a dataset outside of R? The only solution I've been able to come up with is bring ALL of the environmental variables into R, have randomForest do the prediction, and the get that prediction back into GIS. Thus my problem as I stated it above. I'm worried because my datasets are going to be huge (100's of millions of records) when we really get going. Should I be worried? thanks, Tim __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html