Re: [R] Using R for Production - Discussion
I worked on a project where we used a random forest classifier to predict a binary response. We trained a model in the ec2 cloud with 3 million observations and 44 features. We stored the model that was generated by R using save(mymodel,file=model.Rdata). Now we use model.Rdata locally to predict new observations. In our local system, we built a parser in Perl to generate the csv representation of the observation we want to predict, then we used RSPerl to communicate between Perl and R. But there is a catch, instead of loading the random forest model (model.Rdata) every time we want to predict a new observation, we have an R console running as a daemon with the model.Rdata loaded already. Then, we send the observation to be predicted from Perl to R. If anyone else has better solutions/ideas, please feel free to share. Thanks, Saeed On Mon, Nov 1, 2010 at 9:04 PM, Santosh Srinivas santosh.srini...@gmail.com wrote: Hello Group, This is an open-ended question. Quite fascinated by the things I can do and the control I have on my activities since I started using R. I basically have been using this for analytical related work off my desktop. My experience has been quite good and most issues where I need to investigate and solve are typical items more related to data errors, format corruption, etc... not necessarily R Related. Complementing this with Python gives enough firepower to do lots of production (analytical related activities) on the cloud (from my research I see that every innovative technology provider seems to support Python ... google, amazon, etc). Question on using R for Production activities: Q1) Does anyone have experience of using R-scripts etc ... for production related activities. E.g. serving off a computational/ analytical / simulation environment from a webportal with the analytical processing done in R. I've seen that most useful things for normal (not rocket science) business (80-20 rule) can be done just as well in R in comparison with tools like SAS, Matlab, etc. Q2) I haven't tried the processing routines for much larger data-sets assuming size is not a constraint nowadays. I know that I should try out ... but any forewarnings would help. Is it likely that something that works for my desktop dataset is quite as likely to work when scaled up to a cloud dataset? Assuming that I do the clearing out of unused objects, not running into infinite loops, etc? i.e. is there any problem with the fundamental architecture of R itself? (like press articles often say) Q3) There are big fans of the SAS, Matlab, Mathworks environments out there does anyone have a comparison of how R fares. From my experience R is quite neat and low level ... so overheads should be quite low. Most slowness comes due to lack of knowledge (see my code ... like using the wrong structures, functions, loops, etc.) rather than something wrong with the way R itself is. Perhaps there is no commercial focus to enhance performance related issues but my guess is that it is just matter of time till the community evolves the language to score higher on that too. And perhaps develops documentation to assist the challenge users with performance tips (the ten commandments types) Q4) You must have heard about the latest comment from James Goodnight of SAS ... We haven't noticed that a lot. Most of our companies need industrial strength software that has been tested, put through every possible scenario or failure to make sure everything works correctly. My gut is that random passionate geeks (playing part-time) do better testing than a military of professionals ... (but I've no empirical evidence here) I am not taking a side here (although I appreciate those who do!) .. but looking for an objective reasoning. Thanks, S __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help with time in R
You can use strptime to specify the format of the date and time you want, e.g. x1-strptime(x, %Y-%m-%d %H:%M:%S) x1 [1] 2010-04-02 12:00:05 str(x1) POSIXlt[1:1], format: 2010-04-02 12:00:05 On Wed, Jul 21, 2010 at 8:02 AM, Aaditya Nanduri aaditya.nand...@gmail.com wrote: Ms. Chisholm, If you could tell us how you plan to use the variables, we will have a better understanding of what you are looking for and will be able to help you. Are you looking for the time in seconds? In that case, do as Mr. Holfman says. He just skipped the part about converting the factors to characters. You can do that by: y - as.character(x) where x is the vector of factors. Are you looking to have a list of hours, minutes and seconds? That can be done too...Although it would be much easier to just have hours and min.sec On Tue, Jul 20, 2010 at 7:33 AM, Sarah Chisholm sarah.chisholm...@ucl.ac.uk wrote: Hi, I have a problem with the time formatting in R. I have entered time in the format MM:SS.xyz and R has automatically classified this as a factor, but I need it numerically. However when I use as.numeric() it gives me totally different numbers. Is there any way I can tell R to read thes input as a number? Thank you very much [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Aaditya Nanduri aaditya.nand...@gmail.com [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Figures in Latex
http://nixtricks.wordpress.com/2009/11/09/latex-multiple-figures-under-the-same-caption-using-subfigure/ It will create two rows of subfigures with two subfigures on each row On Fri, Jul 23, 2010 at 6:43 AM, li li hannah@gmail.com wrote: Hi all, I want to add 6 plots in the format of 2 columns and 3 rows as one figure in latex. The plots are in .eps file. I know how to add 2 plots side by side, but could not figure out how to do multiple rows. I know this may not be the right place to ask such a question. But I do not know who to ask, so just try my luck here. Thank you in advance. Hannah [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] transforming dates into years
myFrame$year-years(strptime(x)) On Fri, Aug 13, 2010 at 12:36 PM, Dimitri Liakhovitski dimitri.liakhovit...@gmail.com wrote: Hello! If I have in my data frame MyFrame a variable saved as a Date and want to translate it into years, I currently do it like this using zoo: library(zoo) as.year - function(x) as.numeric(floor(as.yearmon(x))) myFrame$year-as.year(myFrame$date) Is there a function that would do it directly - like as.yearmon - but for years? Thank you! -- Dimitri Liakhovitski Ninah Consulting www.ninah.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Importance of levels in a factor variable
I have a dataset of multiple variables and a response. For example, str(x) 'data.frame': 3557238 obs. of 44 variables: $ response : Factor w/ 2 levels $ var2: Factor w/5000 levels If var2 for example is a factor with 5000 levels, what is the best approach to determine which of these levels is the most important to include in building the model, and which ones to discard. Assuming there is a way to do that, is it accurate to only include the important levels and discard the rest for that variable when building the model. Thansk, Saeed --- sessionInfo() R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu 32 GB RAM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Looking for an image (R 64-bit on Linux 64-bit) on Amazon EC2
No need to do that. They have some instances that run 64-bit ubuntu. If I remember correctly we had to install 64-bit R from the debian packages on the ubuntu instance. On Wed, Aug 25, 2010 at 6:12 PM, noclue_ tim@netzero.net wrote: You have a 64 bit Linux? If so... Dowload the sources Do you mean download Linux kernel source code and then compile it on Amazon EC2? -- View this message in context: http://r.789695.n4.nabble.com/Looking-for-an-image-R-64-bit-on-Linux-64-bit-on-Amazon-EC2-tp2338938p2339072.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Importance of levels in a factor variable
Thanks Greg. Actually, we have 5000 levels and it is not an import problem. I looked into combine.levels in the Hmisc package. The problem with this approach is that it takes the frequency of levels, then combines infrequent levels into one level called Others. If you apply this to the complete dataset (positive and negative samples), and if the number of negative samples is much greater than the positive ones, then most of the levels of the positive samples will go into the Others level in the final result. Thats why I was wondering if there is a more accurate way to remove the unimportant levels. On Thu, Aug 26, 2010 at 3:47 PM, Greg Snow greg.s...@imail.org wrote: A factor with 5000 levels looks like it may be a numeric variable that was accidently coded as a factor (functions like read.table will do this if there is a non numeric character in with the numbers). If you really have a 5000 level factor, which levels can be discarded or combined is a question for the subject specific scientist, not the statistician. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare greg.s...@imail.org 801.408.8111 -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- project.org] On Behalf Of Saeed Abu Nimeh Sent: Thursday, August 26, 2010 1:40 PM To: r-help@r-project.org Subject: [R] Importance of levels in a factor variable I have a dataset of multiple variables and a response. For example, str(x) 'data.frame': 3557238 obs. of 44 variables: $ response : Factor w/ 2 levels $ var2: Factor w/5000 levels If var2 for example is a factor with 5000 levels, what is the best approach to determine which of these levels is the most important to include in building the model, and which ones to discard. Assuming there is a way to do that, is it accurate to only include the important levels and discard the rest for that variable when building the model. Thansk, Saeed --- sessionInfo() R version 2.10.1 (2009-12-14) x86_64-pc-linux-gnu 32 GB RAM __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting- guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Collapsing levels of categorical variables
In this paper [1] the author mentioned a procedure by M. Greenace that can be used to collapse the levels of a categorical variable by setting up a table with the frequency of each level and the proportion of the target value in each level. Then collapsing the table level by level looking at the change in chi-square as the table is collapsed. Does anyone know if such a procedure is available in R. [1] http://www2.sas.com/proceedings/sugi31/079-31.pdf __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Communicating with an R daemon from perl
Is there a way to communicate with a running R daemon from perl. I tried RSPerl but the functions there initiate an R instance first. I would like to keep an R instance running in the background and communicate with it using Perl. The problem is due to a large object that we need which has to be loaded every time the R instance is initialized: load(file=model.Rdata). Thanks, Saeed --- R version 2.11.1 (2010-05-31) x86_64-pc-linux-gnu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] can not print probabilities in svm of e1071
svm.model - svm(y~.,data=dataset,probability=TRUE) svm.pred-predict(svm.model, test.set, decision.values = TRUE, probability = TRUE) library(ROCR) svm.roc - prediction(attributes(svm.pred)$decision.values, test.set) svm.auc - performance(svm.roc, 'tpr', 'fpr') plot(svm.auc) On Thu, Apr 29, 2010 at 4:17 PM, Changbin Du changb...@gmail.com wrote: x - train[,c( 2:18, 20:21, 24, 27:31)] y - train$out svm.pr - svm(x, y, probability = TRUE, method=C-classification, kernel=radial, cost=bestc, gamma=bestg, cross=10) pred - predict(svm.pr, valid[,c( 2:18, 20:21, 24, 27:31)], decision.values = TRUE, probability = TRUE) attr(pred, decision.values)[1:4,] 16 23 43 52 1.08157648 0.51241842 0.06234319 1.20656580 attr(pred, probabilities)[1:4,] NULL HI, Dear David and R community, I am trying to print out the probabilities and set a threshold for make ROC curve. I dont know why it showed NULL for the probabilities. y-train$out, is consisting of 0 and 1 binary values. Can you help me with this? Thanks so much! -- Sincerely, Changbin -- [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROC curve in R
Try the ROCR package. http://rocr.bioinf.mpi-sb.mpg.de/ROCR.pdf Saeed On Thu, Jul 1, 2010 at 9:50 AM, ashu6886 ashu.infy.m...@gmail.com wrote: Hi, i have a fairly large amount of genomic data. I have created a dataframe which has Reference as one column and Variation as another. I want to plot a ROC curve based on these 2 columns. I have serached the R manual but I could not understand. Can anybody help me with the R code for plotting ROC curve. Thnx ashu6886 -- View this message in context: http://r.789695.n4.nabble.com/ROC-curve-in-R-tp2275431p2275431.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Function similar to combine.levels in Hmisc package
Is there a function similar to combine.levels ( in the Hmisc package) that combines the levels of factors, but not based on their frequency. Alternatively, I am looking into using the significance of the dummy variables of factors based on their Pr(|t|) value using the linear model, then deleting the non-significant levels. Any other suggestions? Thanks, Saeed __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to make R plot under Linux
Try to install xming in your windows box http://www.straightrunning.com/XmingNotes/. Make sure to run xming before plotting. Saeed On Mon, Feb 22, 2010 at 12:46 PM, xin wei xin...@stat.psu.edu wrote: hi, Guys: thank you so much for all the suggestion. Now I seem to be able to set up x11 forwarding in PUTTY. however, I still could not get plot and I get the following error msg: Error in function (display = , width, height, pointsize, gamma, bg, : X11 I/O error while opening X11 connection to 'localhost:20.0' Is this error msg indication of lack of appropriate plotting package on the server or the server is not properly set up for X11 forwarding? thanks -- View this message in context: http://n4.nabble.com/how-to-make-R-plot-under-Linux-tp1562060p1565113.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R Graphics into Latex
Use \usepackage{epsfig} after your \documentclass. Then make sure to run LaTex not PDFLaTex On Wed, Feb 24, 2010 at 3:29 PM, Lars Bishop lars...@gmail.com wrote: Hi, I'm new in Latex and I'm trying to include an R chart into a Latex document. This is what I'm doing: 1) In R: save the chart as a a Postcript in a folder C:/xxx/Density.eps 2) In Latex (using TexWorks on windows xp) : In the preambule: \documentclass[11pt]{article} \usepackage{graphicx} \begin{document} blah..blah…blah \begin{figure} \centering \includegraphics{C:/xxx/Density.eps} \label{fig:Density} \end{figure} --This is the Error Message I'm getting: LaTeX Warning: File `R:/MarsTH/Studies/Misc/LIA QA/R/Density.eps' not found on input line 26. ! LaTeX Error: Unknown graphics extension: .eps. See the LaTeX manual or LaTeX Companion for explanation. Type H return for immediate help. I'll appreciate your help. Thanks in advance, Lars. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] snow package on multi core unix box
Is the rmpi package (or rpvm) needed to exploit multiple cores on a single unix box using the snow package. The documentation of the package does not provide info about setting up a single machine with multiple cores. Also, if how effective is it to run a bayesian simulation on parallel (or distributed) processors using the snow package. Thanks, Saeed __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R on a multi core unix box
Hi, I installed the snow package on a unix box that has multiple cores. To be able to exploit the multiple cores (on one pc) do I still need to install the rmpi package (or rpvm). Another question, if i run a bayesian simulation on the multiple core after setting them up correctly (using snow), would you think there will be a noticeable speedup gain. Thanks, Saeed --- linux centos 4 dual core processors 32 gb ram R (2.6.0) snow 0.29 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Dual Core vs Quad Core
I ran a bayesian simulation sometime ago and it took me 1 week to finish on a debian box (Dell PE 2850 Dual Intel [EMAIL PROTECTED] 6GB). I think it depends on the setting of the experiment and whether the code can be parallelized. Simon Blomberg wrote: I've been running R on a quad-core using Debian Gnu/Linux since March this year, and I am very pleased with the performance. Simon. On Mon, 2007-12-17 at 20:13 -0500, Andrew Perrin wrote: On Mon, 17 Dec 2007, Kitty Lee wrote: Dear R-users, I use R to run spatial stuff and it takes up a lot of ram. Runs can take hours or days. I am thinking of getting a new desktop. Can R take advantage of the dual-core system? I have a dual-core computer at work. But it seems that right now R is using only one processor. The new computers feature quad core with 3GB of RAM. Can R take advantage of the 4 chips? Or am I better off getting a dual core with faster processing speed per chip? Thanks! Any advice would be really appreciated! K. If I have my information right, R will use dual- or quad-cores if it's doing two (or four) things at once. The second core will help a little bit insofar as whatever else your machine is doing won't interfere with the one core on which it's running, but generally things that take a single thread will remain on a single core. As for RAM, if you're doing memory-bound work you should certainly be using a 64-bit machine and OS so you can utilize the larger memory space. -- Andrew J Perrin - andrew_perrin (at) unc.edu - http://perrin.socsci.unc.edu Associate Professor of Sociology; Book Review Editor, _Social Forces_ University of North Carolina - CB#3210, Chapel Hill, NC 27599-3210 USA __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Installing R on BSD
add_pkg -r R Kitty Lee wrote: Dear users, I try to follow the instruction on this page to install R on 4.4BSD network. http://cran.r-project.org/doc/manuals/R-admin.html#Using-make I can unpack the file but the system can't recognize the command: ./configure make Any ideas what should be the right command? Thanks!! K. - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Invoking R on BSD
when you do pkg_add -r R it should install R and you will not need to run make. To invoke R, you just need to type R in your prompt. Here is what I have on my FreeBSD: FreeBSD 7.0-PRERELEASE (GENERIC2) #0: Sat Jan 5 21:27:47 CST 2008 Welcome to %R R version 2.6.0 (2007-10-03) Copyright (C) 2007 The R Foundation for Statistical Computing ISBN 3-900051-07-0 R is free software and comes with ABSOLUTELY NO WARRANTY. You are welcome to redistribute it under certain conditions. Type 'license()' or 'licence()' for distribution details. R is a collaborative project with many contributors. Type 'contributors()' for more information and 'citation()' on how to cite R or R packages in publications. Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R. Kitty Lee wrote: Thanks to Saeed Abu Nimeh. I used pkg_add to install R package on 4.4BSD. My directory now has the following: BUILDDIRMakefrag.cc_lo config.log m4 tests MakeconfMakefrag.cxxconfig.status po tools MakefileR-2.6.1 doc roots Makefile.bakR-2.6.1.tar.gz etc share Makefrag.cc SVN-REVISIONlibtool src But the make check shows errors: [2:32pm][~] make check `Makedeps' is up to date. make: don't know how to make ../../bin/exec/R. Stop *** Error code 2 Stop in /usr/home/xxx/tests/Examples. *** Error code 1 Stop in /usr/home/xxx/tests. *** Error code 1 Stop in /usr/home/xxx/tests. *** Error code 1 How to fix this error? And then what are the steps involved to invoke R? I know eventually I need to use commands like R CMD. But what are the steps before this? (Sorry, I have not done anything before on unix and am trying to figure things out from bits and pieces off the internet. I would truly appreciate any help or hint!) K. - [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ROCR package finding maximum accuracy and optimal cutoff point
If we use the ROCR package to find the accuracy of a classifier pred - prediction(svm.pred, testset[,2]) perf.acc - performance(pred,acc) Do we find the maximum accuracy as follows (is there a simplier way?): max(perf@x.values[[1]]) Then to find the cutoff point that maximizes the accuracy do we do the following (is there a simpler way): cutoff.list - unlist(perf@x.values[[1]]) cutoff.list[which.max(perf@y.values[[1]])] If the above is correct how is it possible to find the average false positive and negative rates from the following perf.fpr - performance(pred, fpr) perf.fnr - performance(pred, fnr) The dataset that consists of two columns; score and a binary response, similar to this: 2.5, 0 -1, 0 2, 1 6.3, 1 4.1, 0 3.3, 1 Thanks, Saeed --- R 2.8.1 Win XP Pro SP2 ROCR package v1.0-2 e1071 v1.5-19 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] ROCR package finding maximum accuracy and optimal cutoff point
Found the solution to my own question. To find the false positive rate and the false negative rate that correspond to a certain cutoff point using the ROCR package, one can do the following (for sure there is simpler ways, but this works): library(ElemStatLearn) library(rpart) data(spam) ## # create a train and test sets # ## index- 1:nrow(spam) testindex - sample(index, trunc(length(index)/3)) testset - spam[testindex, ] trainset - spam[-testindex, ] rpart.model - rpart(spam ~ ., data = trainset) # training model ## # use ROCR to calculate accuracy # # fp,fn,tp,tn rates # ## library(ROCR) rpart.pred2 - predict(rpart.model, testset)[,2] #testing model pred-prediction(rpart.pred2,testset[,58]) #prediction using rocr perf.acc-performance(pred,acc) #find list of accuracies perf.fpr-performance(pred,fpr) # find list of fp rates perf.fnr-performance(pred,fnr) # find list of fn rates acc.rocr-max(perf@y.values[[1]]) # accuracy using rocr #find cutoff list for accuracies cutoff.list.acc - unlist(perf@x.values[[1]]) #find optimal cutoff point for accuracy optimal.cutoff.acc-cutoff.list.acc[which.max(perf@y.values[[1]])] #find optimal cutoff fpr, as numeric because a list is returned optimal.cutoff.fpr-which(perf@x.values[[1]]==as.numeric(optimal.cutoff.acc)) # find cutoff list for fpr cutoff.list.fpr - unlist(perf@y.values[[1]]) # find fpr using rocr fpr.rocr-cutoff.list.fpr[as.numeric(optimal.cutoff.fpr)] #find optimal cutoff fnr optimal.cutoff.fnr-which(perf@x.values[[1]]==as.numeric(optimal.cutoff.acc)) #find list of fnr cutoff.list.fnr - unlist(perf@y.values[[1]]) #find fnr using rocr fnr.rocr-cutoff.list.fnr[as.numeric(optimal.cutoff.fnr)] Now acc.rocr, fpr.rocr, fnr.rocr will give you the accuracy, fpr, and fnr percentages Saeed Abu Nimeh wrote: If we use the ROCR package to find the accuracy of a classifier pred - prediction(svm.pred, testset[,2]) perf.acc - performance(pred,acc) Do we find the maximum accuracy as follows (is there a simplier way?): max(perf@x.values[[1]]) Then to find the cutoff point that maximizes the accuracy do we do the following (is there a simpler way): cutoff.list - unlist(perf@x.values[[1]]) cutoff.list[which.max(perf@y.values[[1]])] If the above is correct how is it possible to find the average false positive and negative rates from the following perf.fpr - performance(pred, fpr) perf.fnr - performance(pred, fnr) The dataset that consists of two columns; score and a binary response, similar to this: 2.5, 0 -1, 0 2, 1 6.3, 1 4.1, 0 3.3, 1 Thanks, Saeed --- R 2.8.1 Win XP Pro SP2 ROCR package v1.0-2 e1071 v1.5-19 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] ROCR package partial false positive and accuracy
Hi, In the ROCR package is there a way to find the accuracy that corresponds to a given false positive rate. In version 1.0-2, the authors of the package added an option to find the partial area under the ROC curve up to a given false positive rate by passing an optional parameter fpr.stop: perf.auc-performance(pred,auc,fpr.stop=0.15) Is there a way to find the accuracy up to a given false positive rate. We use a classification tree (rpart) and a binary response. Thanks, Saeed --- R 2.8.1 Win XP pro rpart 3.1-43 rocr 1.0-2 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to compute a roc curve
Try library(ROCR) Pau Marc Munoz Torres wrote: Hi, I'm trying to set up a prediction software, now i testing the performance of my method, so i need to calculate a ROC curve, specially auc, cut-off, sens and spec, i just looking at ROCH package, but it's a mass for me, i'm not a math guy and I'm getting lost Could any of you recommend me an easy-to-use package to do this task? i just have a list of positive/negative samples and his score on my program. can I compute a roc curve with this? thanks pau __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Security Data extraction
Subba Rao wrote: Hi, Today I came across the R application and I will admit I am not a Statistician. However, I think this application will be useful for me at work. I am a Network/System Security Engineer trying to make sense of the huge security data I collect. I am trying to visualize the traffic on our network. The data in the packet header (captured by tcpdump) has all the information about the systems on the network. There are lots of visual tools that can present the data in a meaningful way. Each tool seems to have a different data format while most tools seem to understand CSV format? How do I select the subset of the network data or syslog data and create a CSV file? Sniff is a good tool: http://www.thedumbterminal.co.uk/software/sniff.shtml How else can the R application help me present the security data in a meaningful way to the management? Depends on what you want to present Please excuse my ignorance. Thank you. Subba Rao __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] SVM
read Support Vector Machines in R http://www.jstatsoft.org/v15/i09/paper On Thu, Sep 17, 2009 at 4:39 AM, Samuel Okoye samu...@yahoo.com wrote: Hello, I have 12 sample each sample has got 1000 observation, i.e I have a matrix X with 1000 rows and 12 columns! m - svm(t(X)) p - predict (m) Can anyone tell me how to use svmtrain() in R! Many Yhanks, Samuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] two questions for R beginners
On Thu, Feb 25, 2010 at 9:31 AM, Patrick Burns pbu...@pburns.seanet.com wrote: * What were your biggest misconceptions or stumbling blocks to getting up and running with R? 1- Compared to other programming languages it is hard to learn R by example, because it is hard to find code on the web that will do the exact thing you are looking for, sometimes you might get lucky though. By contrast, take Perl for example, it is an easy language to learn by example. 2- The R mailing list. Beginners get frustrated after they struggle for a long time to solve a problem and the easiest thing then is to send an email to the R mailing list. I did this in the past. The best thing that happened was that my request was neglected and I had to spend more time on the problem and find a solution by myself eventually. Do not get me wrong, I am not saying that the mailing list is bad, but it should be more organized. Maybe broken down into couple of other mailing lists. This might bring up a good discussion thread. * What documents helped you the most in this initial phase? An Introduction to R by Venables simpleR – Using R for Introductory Statistics by Verzani __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] two questions for R beginners
Pat, Off the bat, beginners and advanced. In addition, splitting by domain would be very helpful -- something along the lines of: http://cran.r-project.org/web/views/. But we should be careful, we do not want to create 20 other mailing lists :) We have to group things. This will help splitting the volume of the list and will help in targeting lists by expertise. Thanks, Saeed On Fri, Feb 26, 2010 at 2:08 AM, Patrick Burns pbu...@pburns.seanet.com wrote: Saeed, If the R-help list were split, what do you see as the pieces? Pat On 26/02/2010 01:53, Saeed Abu Nimeh wrote: On Thu, Feb 25, 2010 at 9:31 AM, Patrick Burnspbu...@pburns.seanet.com wrote: * What were your biggest misconceptions or stumbling blocks to getting up and running with R? 1- Compared to other programming languages it is hard to learn R by example, because it is hard to find code on the web that will do the exact thing you are looking for, sometimes you might get lucky though. By contrast, take Perl for example, it is an easy language to learn by example. 2- The R mailing list. Beginners get frustrated after they struggle for a long time to solve a problem and the easiest thing then is to send an email to the R mailing list. I did this in the past. The best thing that happened was that my request was neglected and I had to spend more time on the problem and find a solution by myself eventually. Do not get me wrong, I am not saying that the mailing list is bad, but it should be more organized. Maybe broken down into couple of other mailing lists. This might bring up a good discussion thread. * What documents helped you the most in this initial phase? An Introduction to R by Venables simpleR – Using R for Introductory Statistics by Verzani -- Patrick Burns pbu...@pburns.seanet.com http://www.burns-stat.com (home of 'The R Inferno' and 'A Guide for the Unwilling S User') __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] two questions for R beginners
Hi Ivan, On 2/26/10 6:30 AM, Ivan Calandra wrote: You are definitely right... What to do with bad beginner's questions is not a simple issue. If a beginner's mailing list is created, who will answer to such questions? If I subscribe to the beginners mailing list, then I have to expect novice questions and I should be willing to help. Otherwise, I should not be there. And moreover, the beginners won't take advantage of the other questions (I've personally learned a lot trying to understand the questions and answers to other's problems). They can still subscribe to the advanced, but they will know that they are here to observe and learn, not to ask novice questions. You want to ask basic stuff, go to the beginners list :) Not sure if you guys have been on some of the linux mailing lists out there, but man let me tell you, some of these lists have a RTFM attitude and they will fry you if you ask novice questions. Frankly, that is understandable, as most of the members are geeks and they have higher expectations. This mailing list is different, I have seen posts from different disciplines; biology, biostats, stats, computer science, oceanography, etc. So, IMO, there should be a beginners list to cope with such broad committee. Thanks, Saeed And also, as you said, the problems might persist. The beginner's mailing list might be good in one aspect though: the experts who subscribe to it would be willing to help the beginners to get started with R, knowing that the questions might not be clearly stated. As you pointed out, the mailing list is not the best for basic stuff (the question is of course what is basic?). Not everybody knows some colleagues who work with R (I'm personally the 1st one to use R in my lab). I think, somehow and I have no idea how, documentation and guidance to search for help should be more accessible as soon as you start with R. Maybe a _*clear*_ section on the R homepage or in the introduction to R manual like where to find help, including all of the most common and useful resources available (from ? and RSiteSearch() to R Wiki and Crantastic). I hope that this whole discussion might help to make the R world better. Thank you Patrick for initiating it! Regards, Ivan Le 2/26/2010 15:09, Paul Hiemstra a écrit : Ivan Calandra wrote: Since you want input from beginners, here are some thoughts I had and still have two big problems with R: - this vectorization thing. I've read many manuals (including R inferno), but I'm still not completely clear about it. In simple examples, it's fine. But when it gets a bit more complex, then... Related to it, the *apply functions are still a bit difficult to understand. When I have to use them, I just try one and see what happens. I don't understand them well enough to know which one I need. - the second problem is where to find the functions/packages I need. There are many options, and that's actually the problem. R Wiki, Rseek, RSiteSearch, Crantastic, etc... When you start with R, you discover that the capabilities of R are almost unlimited and you don't really know where to start, where to find what you need. As noted in earlier posts, the mailing list is really great, but some people are really hard with beginners. It was noted in a discussion a few days ago, but it looks like some don't realize how difficult it is at the beginning to formulate a good question, clear, with self-contained example and so on. Moreover, not everybody speaks English natively. I don't mean that you must help, even when the question is really vague and not clear and whatever. I'm just saying that if you don't want to help (whatever the reason), you don't have to say it badly. But in any cases, the mailing list is still really helpful. As someone noted (sorry I erased the email so I don't remember who), it might be a good idea to split it. Hi everyone, My 2ct about the mailing list :). I understand that beginners have a hard time formulating a good question. But the problem is that we can't answer the question when it is unclear. So either I: - Don't bother answering - Try do discuss with the author of the question, taking lots of time to find out what exactly is the question. - Send a read the posting guide answer I mostly do the first, as I have to get things done during my PhD :). So this leaves us with kind of a problem, the person mailing the list doesn't have the knowledge to ask the right question, the list can't answer properly and consequently, the person mailing the list still doesn't get the information he/she needs. We could start an R-beginner mailing list, but this would also suffer from this problem. What do you guys think? Maybe the mailing list is not the right medium for really basic stuff. For that I would recommend a good R-book or (better) a course in R or (even better) some colleagues who work with R that you can ask questions to. cheers, Paul Hope that's what you wanted Ivan Le 2/26/2010 08:39, Dieter Menne a
Re: [R] two questions for R beginners
sorry meant community not committee On 2/26/10 8:36 PM, Saeed Abu Nimeh wrote: Hi Ivan, On 2/26/10 6:30 AM, Ivan Calandra wrote: You are definitely right... What to do with bad beginner's questions is not a simple issue. If a beginner's mailing list is created, who will answer to such questions? If I subscribe to the beginners mailing list, then I have to expect novice questions and I should be willing to help. Otherwise, I should not be there. And moreover, the beginners won't take advantage of the other questions (I've personally learned a lot trying to understand the questions and answers to other's problems). They can still subscribe to the advanced, but they will know that they are here to observe and learn, not to ask novice questions. You want to ask basic stuff, go to the beginners list :) Not sure if you guys have been on some of the linux mailing lists out there, but man let me tell you, some of these lists have a RTFM attitude and they will fry you if you ask novice questions. Frankly, that is understandable, as most of the members are geeks and they have higher expectations. This mailing list is different, I have seen posts from different disciplines; biology, biostats, stats, computer science, oceanography, etc. So, IMO, there should be a beginners list to cope with such broad committee. Thanks, Saeed And also, as you said, the problems might persist. The beginner's mailing list might be good in one aspect though: the experts who subscribe to it would be willing to help the beginners to get started with R, knowing that the questions might not be clearly stated. As you pointed out, the mailing list is not the best for basic stuff (the question is of course what is basic?). Not everybody knows some colleagues who work with R (I'm personally the 1st one to use R in my lab). I think, somehow and I have no idea how, documentation and guidance to search for help should be more accessible as soon as you start with R. Maybe a _*clear*_ section on the R homepage or in the introduction to R manual like where to find help, including all of the most common and useful resources available (from ? and RSiteSearch() to R Wiki and Crantastic). I hope that this whole discussion might help to make the R world better. Thank you Patrick for initiating it! Regards, Ivan Le 2/26/2010 15:09, Paul Hiemstra a écrit : Ivan Calandra wrote: Since you want input from beginners, here are some thoughts I had and still have two big problems with R: - this vectorization thing. I've read many manuals (including R inferno), but I'm still not completely clear about it. In simple examples, it's fine. But when it gets a bit more complex, then... Related to it, the *apply functions are still a bit difficult to understand. When I have to use them, I just try one and see what happens. I don't understand them well enough to know which one I need. - the second problem is where to find the functions/packages I need. There are many options, and that's actually the problem. R Wiki, Rseek, RSiteSearch, Crantastic, etc... When you start with R, you discover that the capabilities of R are almost unlimited and you don't really know where to start, where to find what you need. As noted in earlier posts, the mailing list is really great, but some people are really hard with beginners. It was noted in a discussion a few days ago, but it looks like some don't realize how difficult it is at the beginning to formulate a good question, clear, with self-contained example and so on. Moreover, not everybody speaks English natively. I don't mean that you must help, even when the question is really vague and not clear and whatever. I'm just saying that if you don't want to help (whatever the reason), you don't have to say it badly. But in any cases, the mailing list is still really helpful. As someone noted (sorry I erased the email so I don't remember who), it might be a good idea to split it. Hi everyone, My 2ct about the mailing list :). I understand that beginners have a hard time formulating a good question. But the problem is that we can't answer the question when it is unclear. So either I: - Don't bother answering - Try do discuss with the author of the question, taking lots of time to find out what exactly is the question. - Send a read the posting guide answer I mostly do the first, as I have to get things done during my PhD :). So this leaves us with kind of a problem, the person mailing the list doesn't have the knowledge to ask the right question, the list can't answer properly and consequently, the person mailing the list still doesn't get the information he/she needs. We could start an R-beginner mailing list, but this would also suffer from this problem. What do you guys think? Maybe the mailing list is not the right medium for really basic stuff. For that I would recommend a good R-book or (better) a course in R or (even better) some colleagues who work with R that you can ask questions to. cheers, Paul
Re: [R] svm of e1071 package
I think the problem is that you have R configured as 32-bits. If that is the case, then you will only have access to 4 gigs of RAM (see http://www.brianmadden.com/blogs/brianmadden/archive/2004/02/19/the-4gb-windows-memory-limit-what-does-it-really-mean.aspx). Try booting up an ubuntu instance in the cloud and then install R using the 64-bit configuration. I am interested to know if this solves the problem. Let me know. Thanks, Saeed On Tue, Apr 6, 2010 at 5:07 AM, Shyamasree Saha [shs] s...@aber.ac.uk wrote: Hello List, I am having a great trouble using svm function in e1071 package. I have 4gb of data that i want to use to train svm. I am using Amazon cloud, my Amazon Machine Image(AMI) has 34.2 GB of memory. my R process was killed several times when i tried to use 4GB of data for svm. Now I am using a subset of that data and it is only 1.4 GB. i remove all unnecessary objects before calling svm(). I have monitored the memory consumption and found that before i call svm() my AMI has 25GB of free memory. after calling svm(), this free memory starts going down and at the end i have only 1.7 gb of memory and R gives me error that it can not create vector of size 3.4 gb. Its true that if i do not have enough memory then how R will create the vector. But my question is how svm function is eating up that 25gb of memory?? do i have anything to do to solve this problem or its a problem in e1071 package ? by problem in e1071 package, i mean does svm() in e1071 normally consume that high amount ! of memory? if svm() really consume this much memory then i have to think of some other way to train svm. if 34gb ram is not enough for 1.4 gb of data then i am in trouble. Amazon has maximum 68.4gb ram. Please help. Thanks in advance. Regards Shyama __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] In svm(), how to connect quantitative prediction result to categorical result?
I trained a linear svm and did classification. looking at the model I have, with a binary response 0/1, the decision values look like this: head(svm.model$decision.values) 2.5 3.1 -1.0 looking at the fitted values head(svm.model$fitted) 1 1 0 So it looks like anything less than or equal 0 is mapped to the negative class, i.e. 0), otherwise it is mapped to the positive class, i.e. 1. On Fri, Apr 8, 2011 at 8:35 PM, Li, Yunfei yunfei...@wsu.edu wrote: Hi, I am studying using SVM functions of e1071 package to do prediction, and I found during the training data are factor type, then svm.predict() can predict data directly by categories; but if response variables are numerical, the predicted value from svm will be continuous quantitative numbers, then how can I connect these quantitative numbers to categories? (for example:in an example data set, the response variables are numerical and have two categories: 0 and 1, and the predicted value are continuous quantitative numbers from 0 to 1.3, how can I know which of them represent category 0 and which represent 1?) Best, Yunfei Li -- Research Assistant Department of Statistics School of Molecular Biosciences Biotechnology Life Sciences Building 427 Washington State University Pullman, WA 99164-7520 Phone: 509-339-5096 http://www.wsu.edu/~ye_lab/people.html [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] prediction error in ROCR package when sampled y consists of only one class
Try performing stratified sampling when doing cv. cran.r-project.org/web/packages/ipred On Fri, Apr 15, 2011 at 11:00 AM, Soyeon Kim yunni0...@gmail.com wrote: Dear R users, Hi. I am using prediction function in ROCR package. y consists of two classes 0 and 1. However, since I am using cross-validation, a sampled small number of y may consist of only one class y [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 In this case, prediction function gives an error: Error in prediction(predic, y) : Number of classes is not equal to 2. ROCR currently supports only evaluation of binary classification tasks. How can I solve this problem? Thank you, Soyeon Kim __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to reference a package in academical paper
http://www.iiap.res.in/astrostat/School07/R/html/utils/html/citation.html On Mon, Mar 7, 2011 at 4:12 PM, Jan Hornych jh.horn...@gmail.com wrote: Dear, I am now writing more formal academical paper, and would like to reference an R package. Do you have any recommendation how to do it? Taking for instance the RODBC package as an example, how would the reference look like? http://cran.r-project.org/web/packages/RODBC/index.html Thank you Jan [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.