Re: [R] Reasons to Use R (no memory limitations :-))
This thread discussed R memory limitations, compared handling with S and SAS. Since I routinely use R to process multi-gigababyte sets on computers with sometimes 256mb of memory - here are some comments on that. Most memory limitations vanish if R is used with any relational database. [My personal preference is SQLite (RSQLite packaga) because of speed and no-admin (used in embedded mode)]. The comments below apply to any relational database, unless otherwise stated. Most people appear to think about database tables as dataframes - that is to store and load the _whole_ dataframe in one go - probably because appropriate function names are suggesting this approach. Also, it is a natural mapping. This is convenient if the data set can fit fully in memory - but limits the size of the data set the same way as without using the database. However, using SQL language directly one can expand the size of the data set R is capable of operating on - we just have to stop treating database tables as 'atomic'. For example, assume we have a set of several million patients and want to analyze some specific subset - the following SQL statement SELECT * FROM patients WHERE gender='M" AND AGE BETWEEN 30 AND 35 will result in bringing to R much smaller dataframe than selection of the whole table. [Also, such subset selection may take _less_time_ then selecting from the total dataframe - assuming the table is properly indexed]. Also, direct SQL statements can be used to pre-compute some characteristics internally in the database and bring only the summaries to R: SELECT AVG(age) FROM patients GROUP BY gender will bring a data frame of two rows only. Admittedly, if the data set is really large and we cannot operate on its subsets, the above does not help. Though I do not believe that this would the the majority of the situations. Naturally, going for a 64bit system with enough memory will solve some problems without using the database - but not all of them. Relational databases can be very efficient at selecting subsets as they do not have to do linear scans [when the tables are indexed] - while R has to do a linear scan every time(??? I did not look up the source code of R - please correct me if I am wrong). Two other areas where a database is better than R, especially for large data sets: - verification of data correctness for individual points [a frequent problem with large data sets] - combining data from several different types of tables into one dataframe In summary: using SQL from R allows to process extremely large data sets in a limited memory, sometimes even faster then if we had a large memory and kept our data set fully in it. Relational database perfectly complements R capabilities. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
(Ted Harding) wrote: > On 12-Apr-07 10:14:21, Jim Lemon wrote: > >>Charilaos Skiadas wrote: >> >>>A new fortune candidate perhaps? >>> >>>On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: >>> >>> >>> Remember, everything is better than everything else given the right comparison. >> >>Only if we remove the grammatical blip that turns it into an infinite >>regress, i.e. >> >>"Remember, anything is better than everything else given the right >>comparison" >> >>Jim > > > Oh dear, I would be disappointed with that, Jim. > > I was rather enjoying the vision of a "topological sort tree" > (ordered by "better according to some comparison") in which every > single thing had everything else hanging off it, and in turn was > hanging off everything else! > Sorry, Ted, I think Benoit Mandelbrot beat you to it. Jim __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On 12-Apr-07 10:14:21, Jim Lemon wrote: > Charilaos Skiadas wrote: >> A new fortune candidate perhaps? >> >> On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: >> >> >>>Remember, everything is better than everything else given the >>>right comparison. >>> > Only if we remove the grammatical blip that turns it into an infinite > regress, i.e. > > "Remember, anything is better than everything else given the right > comparison" > > Jim Oh dear, I would be disappointed with that, Jim. I was rather enjoying the vision of a "topological sort tree" (ordered by "better according to some comparison") in which every single thing had everything else hanging off it, and in turn was hanging off everything else! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 12-Apr-07 Time: 11:45:05 -- XFMail -- __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Lucke, Joseph F writes: > A re-interpretation of Zorn's lemma? > > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon > Sent: Thursday, April 12, 2007 5:14 AM > To: [EMAIL PROTECTED] > Subject: Re: [R] Reasons to Use R > > Charilaos Skiadas wrote: > > A new fortune candidate perhaps? > > > > On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: > > > > > >>Remember, everything is better than everything else given the right > >>comparison. > >> > Only if we remove the grammatical blip that turns it into an infinite > regress, i.e. > > "Remember, anything is better than everything else given the right > comparison" > > Jim Anything is potentially better than any other thing given the right comparison. Joel -- Joel J. Adamson Biostatistician Pediatric Psychopharmacology Research Unit Massachusetts General Hospital Boston, MA 02114 (617) 643-1432 (303) 880-3109 The information transmitted in this electronic communication...{{dropped}} __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
A re-interpretation of Zorn's lemma? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Jim Lemon Sent: Thursday, April 12, 2007 5:14 AM To: [EMAIL PROTECTED] Subject: Re: [R] Reasons to Use R Charilaos Skiadas wrote: > A new fortune candidate perhaps? > > On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: > > >>Remember, everything is better than everything else given the right >>comparison. >> Only if we remove the grammatical blip that turns it into an infinite regress, i.e. "Remember, anything is better than everything else given the right comparison" Jim __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Douglas Bates writes: > One > can do data analysis by using the computer as a blunt instrument with > which to bludgeon the problem to death but one can't do elegant data > analysis like that. One nice thing about a "blunt instrument" like Stata is the ability to hold an entire dataset in memory and interactively play with the model and generate new variables all in one session. I figure out what I want interactively and then separate the data management and analysis in .do-files, then run them in batch mode. However, when I first read of the approach of using Perl, sed or awk to manage data and then only doing the analysis in R, I immediately thought "Wow, that is a really great idea, I never thought of it like that before." It would really get me to think about the modelling and the data management clearly. A little voice said "Dude, you're not using a PDP-11...(oh wait, that might be kinda cool)" but the logic of it immediately made sense. I consider it a big part of my re-Unix-ization. Joel -- Joel J. Adamson Biostatistician Pediatric Psychopharmacology Research Unit Massachusetts General Hospital Boston, MA 02114 (617) 643-1432 (303) 880-3109 The information transmitted in this electronic communication...{{dropped}} __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Charilaos Skiadas wrote: > A new fortune candidate perhaps? > > On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: > > >>Remember, everything is better than everything else given the right >>comparison. >> Only if we remove the grammatical blip that turns it into an infinite regress, i.e. "Remember, anything is better than everything else given the right comparison" Jim __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Wed, 11 Apr 2007, Alan Zaslavsky wrote: > I have thought for a long time that a facility for efficient rowwise > calculations might be a valuable enhancement to S/R. The storage of the > object would be handled by a database and there would have to be an > efficient interface for pulling a row (or small chunk of rows) out of the > database repeatedly; alternatively the operatons could be conducted inside > the database. Basic operations of rowwise calculation and cumulation > (such as forming a column sum or a sum of outer-products) would be > written in an R-like syntax and translated into an efficient set of > operations that work through the database. (Would be happy to share > some jejeune notes on this.) However the main answer to thie problem > in the R world seems to have been Moore's Law. Perhaps somebody could > tell us more about the S-Plus large objects library, or the work that > Doug Bates is doing on efficient calculations with large datasets. > I have been surprised to find how much you can get done in SQL, only transferring summaries of the data into R. There is soon going to be an experimental "surveyNG" package that works with survey data stored in a SQLite database without transferring the whole thing into R for most operations (and I could get further if SQLite had the log() and exp() functions that most other SQL implementations for large databases provide). I'll be submitting a paper on this to useR2007. The approach of transferring blocks of data into R and using a database just as backing store will allow more general computation but will be less efficient than performing the computation in the database, so a mixture of both is likely to be helpful. Moore's Law will settle some issues, but there are problems where it is working to increase the size of datasets just as fast as it increases computational power. -thomas Thomas Lumley Assoc. Professor, Biostatistics [EMAIL PROTECTED] University of Washington, Seattle __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On 4/11/07, Robert Duval <[EMAIL PROTECTED]> wrote: > So I guess my question is... > > Is there any hope of R being modified on its core in order to handle > more graciously large datasets? (You've mentioned SAS and SPSS, I'd > add Stata to the list). > > Or should we (the users of large datasets) expect to keep on working > with the present tools for the time to come? We're certainly aware of the desire of many users to be able to handle large data sets. I have just spent a couple of days working with a student from another department who wanted to work with a very large data set that was poorly structured. Most of my time was spent trying to convince her about the limitations in the structure of her data and what could realistically be expected to be computed with it. If your purpose is to perform data manipulation and extraction on large data sets then I think that it is not unreasonable to be expected to learn to use SQL. I find it convenient to use R to do data manipulation because I know the language and the support tools well but I don't expect to do data cleaning on millions of records with it. I am probably too conservative in what I will ask R to handle for me because I started using S on a Vax-11/750 that had 2 megabytes of memory and it's hard to break old habits. I think the trend in working with large data sets in R will be toward a hybrid approach of using a database for data storage and retrieval plus R for the model definition and computation. Miguel Manese's SQLiteDF package and some of the work in Bioconductor are steps in this direction. However, as was mentioned earlier in this thread, there is an underlying assumption with R that the user is thinking about the analysis as he/she is doing it. We sometimes see questions about "I have a data set with (some large number) of records on several hundred or thousands of variables" and I want to fit a generalized linear model to it. I would be hard pressed to think of a situation where I wanted hundreds of variables in a statistical model unless they are generated from one or more factors that have many levels. And, in that case, I would want to use random effects rather than fixed effects in a model. So just saying that the big challenge is to fit some kind of model with lots of coefficients to a very large number of observations may be missing the point. Defining the model better may be the point. Let me conclude by saying that these are general observations and not directed to you personally, Robert. I don't know what you want R to do graciously to large data sets so my response is more to the general point that there should always be a balance between thinking about the structure of the data and the model and brute force computation. One can do data analysis by using the computer as a blunt instrument with which to bludgeon the problem to death but one can't do elegant data analysis like that. > > robert > > On 4/11/07, Marc Schwartz <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: > > > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info > > > (http://members.home.nl/bi-info) wrote: > > > > I certainly have that idea too. SPSS functions in a way the same, > > > > although it specialises in PC applications. Memory addition to a PC is > > > > not a very expensive thing these days. On my first AT some extra memory > > > > cost 300 dollars or more. These days you get extra memory with a package > > > > of marshmellows or chocolate bars if you need it. > > > > All computations on a computer are discrete steps in a way, but I've > > > > heard that SAS computations are split up in strictly divided steps. That > > > > also makes procedures "attachable" I've been told, and interchangable. > > > > Different procedures can use the same code which alternatively is > > > > cheaper in memory usages or disk usage (the old days...). That makes SAS > > > > by the way a complicated machine to build because procedures who are > > > > split up into numerous fragments which make complicated bookkeeping. If > > > > you do it that way, I've been told, you can do a lot of computations > > > > with very little memory. One guy actually computed quite complicated > > > > models with "only 32MB or less", which wasn't very much for "his type of > > > > calculations". Which means that SAS is efficient in memory handling I > > > > think. It's not very efficient in dollar handling... I estimate. > > > > > > > > Wilfred > > > > > > > > > > > > OhSAS is quite efficient in dollar handling, at least when it comes > > > to the annual commercial licenses...along the same lines as the > > > purported efficiency of the U.S. income tax system: > > > > > > "How much money do you have? Send it in..." > > > > > > There is a reason why SAS is the largest privately held software company > > > in the world and it is not due to the academic licensing structure, > > > which constitutes only about 12% of their revenue, based upon th
Re: [R] Reasons to Use R
I think the reason that stata is fast is because it only keeps 1 work table in ram. if you just keep 1 data frame in R, it will run fast too. But ... On 4/11/07, Robert Duval <[EMAIL PROTECTED]> wrote: > So I guess my question is... > > Is there any hope of R being modified on its core in order to handle > more graciously large datasets? (You've mentioned SAS and SPSS, I'd > add Stata to the list). > > Or should we (the users of large datasets) expect to keep on working > with the present tools for the time to come? > > robert > > On 4/11/07, Marc Schwartz <[EMAIL PROTECTED]> wrote: > > On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: > > > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info > > > (http://members.home.nl/bi-info) wrote: > > > > I certainly have that idea too. SPSS functions in a way the same, > > > > although it specialises in PC applications. Memory addition to a PC is > > > > not a very expensive thing these days. On my first AT some extra memory > > > > cost 300 dollars or more. These days you get extra memory with a package > > > > of marshmellows or chocolate bars if you need it. > > > > All computations on a computer are discrete steps in a way, but I've > > > > heard that SAS computations are split up in strictly divided steps. That > > > > also makes procedures "attachable" I've been told, and interchangable. > > > > Different procedures can use the same code which alternatively is > > > > cheaper in memory usages or disk usage (the old days...). That makes SAS > > > > by the way a complicated machine to build because procedures who are > > > > split up into numerous fragments which make complicated bookkeeping. If > > > > you do it that way, I've been told, you can do a lot of computations > > > > with very little memory. One guy actually computed quite complicated > > > > models with "only 32MB or less", which wasn't very much for "his type of > > > > calculations". Which means that SAS is efficient in memory handling I > > > > think. It's not very efficient in dollar handling... I estimate. > > > > > > > > Wilfred > > > > > > > > > > > > OhSAS is quite efficient in dollar handling, at least when it comes > > > to the annual commercial licenses...along the same lines as the > > > purported efficiency of the U.S. income tax system: > > > > > > "How much money do you have? Send it in..." > > > > > > There is a reason why SAS is the largest privately held software company > > > in the world and it is not due to the academic licensing structure, > > > which constitutes only about 12% of their revenue, based upon their > > > public figures. > > > > Hmmm..here is a classic example of the problems of reading pie > > charts. > > > > The figure I quoted above, which is from reading the 2005 SAS Annual > > Report on their web site (such as it is for a private company) comes > > from a 3D exploded pie chart (ick...). > > > > The pie chart uses 3 shades of grey and 5 shades of blue to > > differentiate 8 market segments and their percentages of total worldwide > > revenue. > > > > I mis-read the 'shade of grey' allocated to Education as being 12% > > (actually 11.7%). > > > > A re-read of the chart, zooming in close on the pie in a PDF reader, > > appears to actually show that Education is but 1.8% of their annual > > worldwide revenue. > > > > Government based installations, which are presumably the other notable > > market segment in which substantially discounted licenses are provided, > > is 14.6%. > > > > The report is available here for anyone else curious: > > > > http://www.sas.com/corporate/report05/annualreport05.pdf > > > > Somebody needs to send SAS a copy of Tufte or Cleveland. > > > > I have to go and rest my eyes now... ;-) > > > > Regards, > > > > Marc > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
So I guess my question is... Is there any hope of R being modified on its core in order to handle more graciously large datasets? (You've mentioned SAS and SPSS, I'd add Stata to the list). Or should we (the users of large datasets) expect to keep on working with the present tools for the time to come? robert On 4/11/07, Marc Schwartz <[EMAIL PROTECTED]> wrote: > On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: > > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info > > (http://members.home.nl/bi-info) wrote: > > > I certainly have that idea too. SPSS functions in a way the same, > > > although it specialises in PC applications. Memory addition to a PC is > > > not a very expensive thing these days. On my first AT some extra memory > > > cost 300 dollars or more. These days you get extra memory with a package > > > of marshmellows or chocolate bars if you need it. > > > All computations on a computer are discrete steps in a way, but I've > > > heard that SAS computations are split up in strictly divided steps. That > > > also makes procedures "attachable" I've been told, and interchangable. > > > Different procedures can use the same code which alternatively is > > > cheaper in memory usages or disk usage (the old days...). That makes SAS > > > by the way a complicated machine to build because procedures who are > > > split up into numerous fragments which make complicated bookkeeping. If > > > you do it that way, I've been told, you can do a lot of computations > > > with very little memory. One guy actually computed quite complicated > > > models with "only 32MB or less", which wasn't very much for "his type of > > > calculations". Which means that SAS is efficient in memory handling I > > > think. It's not very efficient in dollar handling... I estimate. > > > > > > Wilfred > > > > > > > > OhSAS is quite efficient in dollar handling, at least when it comes > > to the annual commercial licenses...along the same lines as the > > purported efficiency of the U.S. income tax system: > > > > "How much money do you have? Send it in..." > > > > There is a reason why SAS is the largest privately held software company > > in the world and it is not due to the academic licensing structure, > > which constitutes only about 12% of their revenue, based upon their > > public figures. > > Hmmm..here is a classic example of the problems of reading pie > charts. > > The figure I quoted above, which is from reading the 2005 SAS Annual > Report on their web site (such as it is for a private company) comes > from a 3D exploded pie chart (ick...). > > The pie chart uses 3 shades of grey and 5 shades of blue to > differentiate 8 market segments and their percentages of total worldwide > revenue. > > I mis-read the 'shade of grey' allocated to Education as being 12% > (actually 11.7%). > > A re-read of the chart, zooming in close on the pie in a PDF reader, > appears to actually show that Education is but 1.8% of their annual > worldwide revenue. > > Government based installations, which are presumably the other notable > market segment in which substantially discounted licenses are provided, > is 14.6%. > > The report is available here for anyone else curious: > > http://www.sas.com/corporate/report05/annualreport05.pdf > > Somebody needs to send SAS a copy of Tufte or Cleveland. > > I have to go and rest my eyes now... ;-) > > Regards, > > Marc > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Wed, 2007-04-11 at 11:26 -0500, Marc Schwartz wrote: > On Wed, 2007-04-11 at 17:56 +0200, Bi-Info > (http://members.home.nl/bi-info) wrote: > > I certainly have that idea too. SPSS functions in a way the same, > > although it specialises in PC applications. Memory addition to a PC is > > not a very expensive thing these days. On my first AT some extra memory > > cost 300 dollars or more. These days you get extra memory with a package > > of marshmellows or chocolate bars if you need it. > > All computations on a computer are discrete steps in a way, but I've > > heard that SAS computations are split up in strictly divided steps. That > > also makes procedures "attachable" I've been told, and interchangable. > > Different procedures can use the same code which alternatively is > > cheaper in memory usages or disk usage (the old days...). That makes SAS > > by the way a complicated machine to build because procedures who are > > split up into numerous fragments which make complicated bookkeeping. If > > you do it that way, I've been told, you can do a lot of computations > > with very little memory. One guy actually computed quite complicated > > models with "only 32MB or less", which wasn't very much for "his type of > > calculations". Which means that SAS is efficient in memory handling I > > think. It's not very efficient in dollar handling... I estimate. > > > > Wilfred > > > > OhSAS is quite efficient in dollar handling, at least when it comes > to the annual commercial licenses...along the same lines as the > purported efficiency of the U.S. income tax system: > > "How much money do you have? Send it in..." > > There is a reason why SAS is the largest privately held software company > in the world and it is not due to the academic licensing structure, > which constitutes only about 12% of their revenue, based upon their > public figures. Hmmm..here is a classic example of the problems of reading pie charts. The figure I quoted above, which is from reading the 2005 SAS Annual Report on their web site (such as it is for a private company) comes from a 3D exploded pie chart (ick...). The pie chart uses 3 shades of grey and 5 shades of blue to differentiate 8 market segments and their percentages of total worldwide revenue. I mis-read the 'shade of grey' allocated to Education as being 12% (actually 11.7%). A re-read of the chart, zooming in close on the pie in a PDF reader, appears to actually show that Education is but 1.8% of their annual worldwide revenue. Government based installations, which are presumably the other notable market segment in which substantially discounted licenses are provided, is 14.6%. The report is available here for anyone else curious: http://www.sas.com/corporate/report05/annualreport05.pdf Somebody needs to send SAS a copy of Tufte or Cleveland. I have to go and rest my eyes now... ;-) Regards, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R [Broadcast]
From: Douglas Bates > > On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > > Greg, > > As far as I understand, SAS is more efficient handling large data > > probably than S+/R. Do you have any idea why? > > SAS originated at a time when large data sets were stored on > magnetic tape and the only reasonable way to process them was > sequentially. > Thus most statistics procedures in SAS act as filters, > processing one record at a time and accumulating summary > information. In the past SAS performed a least squares fit > by accumulating the crossproduct of [X:y] and then using the > using the sweep operator to reduce that matrix. For such an > approach the number of observations does not affect the > amount of storage required. Adding observations just > requires more time. > > This works fine (although there are numerical disadvantages > to this approach - try mentioning the sweep operator to an > expert in numerical linear algebra - you get a blank stare) For those who stared blankly at the above: The sweep operator is just a facier version of the good old Gaussian elimination... Andy > as long as the operations that you wish to perform fit into > this model. Making the desired operations fit into the model > is the primary reason for the awkwardness in many SAS analyses. > > The emphasis in R is on flexibility and the use of good > numerical techniques - not on processing large data sets > sequentially. The algorithms used in R for most least > squares fits generate and analyze the complete model matrix > instead of summary quantities. (The algorithms in the biglm > package are a compromise that work on horizontal sections of > the model matrix.) > > If your only criterion for comparison is the ability to work > with very large data sets performing operations that can fit > into the filter model used by SAS then SAS will be a better > choice. However you do lock yourself into a certain set of > operations and you are doing it to save memory, which is a > commodity that decreases in price very rapidly. > > As mentioned in other replies, for many years the majority of > SAS uses are for data manipulation rather than for > statistical analysis so the filter model has been modified in > later versions. > > > > > > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > > -Original Message- > > > > From: [EMAIL PROTECTED] > > > > [mailto:[EMAIL PROTECTED] On Behalf Of Bi-Info > > > > (http://members.home.nl/bi-info) > > > > Sent: Monday, April 09, 2007 4:23 PM > > > > To: Gabor Grothendieck > > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > > Subject: Re: [R] Reasons to Use R > > > > > > [snip] > > > > > > > So what's the big deal about S using files instead of > memory like > > > > R. I don't get the point. Isn't there enough swap space for S? > > > > (Who cares > > > > anyway: it works, isn't it?) Or are there any problems > with S and > > > > large datasets? I don't get it. You use them, Greg. So > you might > > > > discuss that issue. > > > > > > > > Wilfred > > > > > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > > > If you use up all the memory while in R, then the OS will start > > > swapping memory to disk, but the OS does not know what parts of > > > memory correspond to which objects, so it is entirely > possible that > > > the chunk swapped to disk contains parts of different > data objects, > > > so when you need one of those objects again, everything > needs to be > > > swapped back in. This is very inefficient. > > > > > > S-PLUS occasionally runs into the same problem, but since it does > > > some of its own swapping to disk it can be more efficient by > > > swapping single data objects (data frames, etc.). Also, since > > > S-PLUS is already saving everything to disk, it does not actually > > > need to do a full swap, it can just look and see that a > particular > > > data frame has not been used for a while, know that it is already > > > saved on the disk, and unload it from memory without > having to write it to disk first. > > > > > > The g.data package for R has some of this functionality > of keeping > > > data on the disk until needed. > > > > > > The better approach for large data sets is to o
Re: [R] Reasons to Use R
thanks, I will take a look. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Wed, 2007-04-11 at 17:56 +0200, Bi-Info (http://members.home.nl/bi-info) wrote: > I certainly have that idea too. SPSS functions in a way the same, > although it specialises in PC applications. Memory addition to a PC is > not a very expensive thing these days. On my first AT some extra memory > cost 300 dollars or more. These days you get extra memory with a package > of marshmellows or chocolate bars if you need it. > All computations on a computer are discrete steps in a way, but I've > heard that SAS computations are split up in strictly divided steps. That > also makes procedures "attachable" I've been told, and interchangable. > Different procedures can use the same code which alternatively is > cheaper in memory usages or disk usage (the old days...). That makes SAS > by the way a complicated machine to build because procedures who are > split up into numerous fragments which make complicated bookkeeping. If > you do it that way, I've been told, you can do a lot of computations > with very little memory. One guy actually computed quite complicated > models with "only 32MB or less", which wasn't very much for "his type of > calculations". Which means that SAS is efficient in memory handling I > think. It's not very efficient in dollar handling... I estimate. > > Wilfred OhSAS is quite efficient in dollar handling, at least when it comes to the annual commercial licenses...along the same lines as the purported efficiency of the U.S. income tax system: "How much money do you have? Send it in..." There is a reason why SAS is the largest privately held software company in the world and it is not due to the academic licensing structure, which constitutes only about 12% of their revenue, based upon their public figures. Since SPSS is mentioned, it also functions using similar economic models... :-) Regards, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Alan Zaslavsky > Sent: Wednesday, April 11, 2007 9:07 AM > To: R-help@stat.math.ethz.ch > Subject: [R] Reasons to Use R [snip] > I have thought for a long time that a facility for efficient > rowwise calculations might be a valuable enhancement to S/R. > The storage of the object would be handled by a database and > there would have to be an efficient interface for pulling a > row (or small chunk of rows) out of the database repeatedly; > alternatively the operatons could be conducted inside the > database. Basic operations of rowwise calculation and > cumulation (such as forming a column sum or a sum of > outer-products) would be written in an R-like syntax and > translated into an efficient set of operations that work > through the database. (Would be happy to share some jejeune > notes on this.) The biglm and SQLiteDF packages have made a start in this direction (unless I am missunderstanding you), adding functionality to either of those seems the best use of effort. > However the main answer to thie problem in > the R world seems to have been Moore's Law. Perhaps somebody > could tell us more about the S-Plus large objects library, or > the work that Doug Bates is doing on efficient calculations > with large datasets. This link gives an overview and some detail of the S-PLUS big data library http://www.insightful.com/support/splus70win/eduguide.pdf > Alan Zaslavsky > [EMAIL PROTECTED] -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Rajarshi Guha wrote: > On Wed, 2007-04-11 at 11:06 -0400, Alan Zaslavsky wrote: > > > I have thought for a long time that a facility for efficient rowwise > > calculations might be a valuable enhancement to S/R. The storage of the > > object would be handled by a database and there would have to be an > > efficient interface for pulling a row (or small chunk of rows) out of the > > database repeatedly; alternatively the operatons could be conducted inside > > the database. > > You can embed R inside postgres, though I don't know how efficient this > would be. But it does allow one to operator on a per row basis. > > http://www.omegahat.org/RSPostgres/ I still like this idea a lot and a more recent implementation of it was created by Joe Conway and can be found at http://www.joeconway.com/plr/ D. > > --- > Rajarshi Guha <[EMAIL PROTECTED]> > GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE > --- > Finally I am becoming stupider no more > - Paul Erdos' epitaph > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Duncan Temple Lang[EMAIL PROTECTED] Department of Statistics work: (530) 752-4782 4210 Mathematical Sciences Bldg. fax: (530) 752-7099 One Shields Ave. University of California at Davis Davis, CA 95616, USA pgpinhw3Ik8qk.pgp Description: PGP signature __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Wed, 2007-04-11 at 11:06 -0400, Alan Zaslavsky wrote: > I have thought for a long time that a facility for efficient rowwise > calculations might be a valuable enhancement to S/R. The storage of the > object would be handled by a database and there would have to be an > efficient interface for pulling a row (or small chunk of rows) out of the > database repeatedly; alternatively the operatons could be conducted inside > the database. You can embed R inside postgres, though I don't know how efficient this would be. But it does allow one to operator on a per row basis. http://www.omegahat.org/RSPostgres/ --- Rajarshi Guha <[EMAIL PROTECTED]> GPG Fingerprint: 0CCA 8EE2 2EEB 25E2 AB04 06F7 1BB9 E634 9B87 56EE --- Finally I am becoming stupider no more - Paul Erdos' epitaph __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
I certainly have that idea too. SPSS functions in a way the same, although it specialises in PC applications. Memory addition to a PC is not a very expensive thing these days. On my first AT some extra memory cost 300 dollars or more. These days you get extra memory with a package of marshmellows or chocolate bars if you need it. All computations on a computer are discrete steps in a way, but I've heard that SAS computations are split up in strictly divided steps. That also makes procedures "attachable" I've been told, and interchangable. Different procedures can use the same code which alternatively is cheaper in memory usages or disk usage (the old days...). That makes SAS by the way a complicated machine to build because procedures who are split up into numerous fragments which make complicated bookkeeping. If you do it that way, I've been told, you can do a lot of computations with very little memory. One guy actually computed quite complicated models with "only 32MB or less", which wasn't very much for "his type of calculations". Which means that SAS is efficient in memory handling I think. It's not very efficient in dollar handling... I estimate. Wilfred -- Certainly true. In particular, SAS was designed from to store data items on disk, and to read into core memory the minimum needed for a particular calculation. The kind of data SAS handles is (for the most part) limited to rectangular arrays, similar to R data frames. In many procedures they can be read from disk sequentially (row by row), which undoubtedly simplifies memory handling. It seems logical to suppose that in developing SAS, algorithms were chosen to support that style of memory management. Finally, the style of writing programs in SAS consists of discrete steps of computation, between which nothing but the program need be held in core memory. "Gabor Grothendieck" <[EMAIL PROTECTED]> wrote: > I think SAS was developed at a time when computer memory was > much smaller than it is now and the legacy of that is its better > usage of computer resources. > > On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > > Greg, > > As far as I understand, SAS is more efficient handling large data > > probably than S+/R. Do you have any idea why? -- Mike Prager, NOAA, Beaufort, NC * Opinions expressed are personal and not represented otherwise. * Any use of tradenames does not constitute a NOAA endorsement. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- No virus found in this incoming message. 22:44 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Reasons to Use R
Right: SAS objects (at least in the base and statistics components of the system -- there are dozens of add-ons for particular markets) are simple databases. the predominant model for data manipulation and statistical calculation is a row by row operation that creates modified rows and/or accumulates totals. This was pretty much the only way things could be done in the days when real (and typically virtual) memory was much smaller than it now is. It can be a pretty efficient model for calculatons that fit that pattern. One downside of course is that a line of R code can easily turn into 30 lines of SAS with data steps, sort steps, steps to accumulate totals, etc. As noted by a couple of previous writers, S-Plus might be regarded as somewhat intermediate in its model in that objects constitute files but rows do not correspond to chunks of adjacent bytes in memory or filespace. I have thought for a long time that a facility for efficient rowwise calculations might be a valuable enhancement to S/R. The storage of the object would be handled by a database and there would have to be an efficient interface for pulling a row (or small chunk of rows) out of the database repeatedly; alternatively the operatons could be conducted inside the database. Basic operations of rowwise calculation and cumulation (such as forming a column sum or a sum of outer-products) would be written in an R-like syntax and translated into an efficient set of operations that work through the database. (Would be happy to share some jejeune notes on this.) However the main answer to thie problem in the R world seems to have been Moore's Law. Perhaps somebody could tell us more about the S-Plus large objects library, or the work that Doug Bates is doing on efficient calculations with large datasets. Alan Zaslavsky [EMAIL PROTECTED] > Date: Tue, 10 Apr 2007 16:27:50 -0600 > From: "Greg Snow" <[EMAIL PROTECTED]> > Subject: Re: [R] Reasons to Use R > To: "Wensui Liu" <[EMAIL PROTECTED]> > > I think SAS has the database part built into it. I have heard 2nd hand > of new statisticians going to work for a company and asking if they have > SAS, the reply is "Yes we use SAS for our database, does it do > statistics also?" Also I heard something about SAS is no longer > considered an acronym, they like having it be just a name and don't want > the fact that one of the S's used to stand for statistics to scare away > companies that use it as a database. > > Maybe someone more up on SAS can confirm or deny this. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Certainly true. In particular, SAS was designed from to store data items on disk, and to read into core memory the minimum needed for a particular calculation. The kind of data SAS handles is (for the most part) limited to rectangular arrays, similar to R data frames. In many procedures they can be read from disk sequentially (row by row), which undoubtedly simplifies memory handling. It seems logical to suppose that in developing SAS, algorithms were chosen to support that style of memory management. Finally, the style of writing programs in SAS consists of discrete steps of computation, between which nothing but the program need be held in core memory. "Gabor Grothendieck" <[EMAIL PROTECTED]> wrote: > I think SAS was developed at a time when computer memory was > much smaller than it is now and the legacy of that is its better > usage of computer resources. > > On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > > Greg, > > As far as I understand, SAS is more efficient handling large data > > probably than S+/R. Do you have any idea why? -- Mike Prager, NOAA, Beaufort, NC * Opinions expressed are personal and not represented otherwise. * Any use of tradenames does not constitute a NOAA endorsement. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > Greg, > As far as I understand, SAS is more efficient handling large data > probably than S+/R. Do you have any idea why? SAS originated at a time when large data sets were stored on magnetic tape and the only reasonable way to process them was sequentially. Thus most statistics procedures in SAS act as filters, processing one record at a time and accumulating summary information. In the past SAS performed a least squares fit by accumulating the crossproduct of [X:y] and then using the using the sweep operator to reduce that matrix. For such an approach the number of observations does not affect the amount of storage required. Adding observations just requires more time. This works fine (although there are numerical disadvantages to this approach - try mentioning the sweep operator to an expert in numerical linear algebra - you get a blank stare) as long as the operations that you wish to perform fit into this model. Making the desired operations fit into the model is the primary reason for the awkwardness in many SAS analyses. The emphasis in R is on flexibility and the use of good numerical techniques - not on processing large data sets sequentially. The algorithms used in R for most least squares fits generate and analyze the complete model matrix instead of summary quantities. (The algorithms in the biglm package are a compromise that work on horizontal sections of the model matrix.) If your only criterion for comparison is the ability to work with very large data sets performing operations that can fit into the filter model used by SAS then SAS will be a better choice. However you do lock yourself into a certain set of operations and you are doing it to save memory, which is a commodity that decreases in price very rapidly. As mentioned in other replies, for many years the majority of SAS uses are for data manipulation rather than for statistical analysis so the filter model has been modified in later versions. > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of > > > Bi-Info (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of memory > > > like R. I don't get the point. Isn't there enough swap space > > > for S? (Who cares > > > anyway: it works, isn't it?) Or are there any problems with S > > > and large datasets? I don't get it. You use them, Greg. So > > > you might discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start swapping > > memory to disk, but the OS does not know what parts of memory correspond > > to which objects, so it is entirely possible that the chunk swapped to > > disk contains parts of different data objects, so when you need one of > > those objects again, everything needs to be swapped back in. This is > > very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since it does some > > of its own swapping to disk it can be more efficient by swapping single > > data objects (data frames, etc.). Also, since S-PLUS is already saving > > everything to disk, it does not actually need to do a full swap, it can > > just look and see that a particular data frame has not been used for a > > while, know that it is already saved on the disk, and unload it from > > memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping data > > on the disk until needed. > > > > The better approach for large data sets is to only have some of the data > > in memory at a time and to automatically read just the parts that you > > need. So for big datasets it is recommended to have the actual data > > stored in a database and use one of the database connection packages to > > only read in the subset that you need. The SQLiteDF package for R is > > working on automating this process for R. There are also the bigdata > > module for S-PLUS and the biglm package for R have ways of doing some of > > the common analyses using chunks of data at a time. This idea is not > > new. There was a program in the late 1970s and 80s called Rummage by > > Del Scott (I guess technically it st
Re: [R] Reasons to Use R
A new fortune candidate perhaps? On Apr 10, 2007, at 6:27 PM, Greg Snow wrote: > Remember, everything is better than everything else given the right > comparison. > > -- > Gregory (Greg) L. Snow Ph.D. Haris Skiadas Department of Mathematics and Computer Science Hanover College __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
I think SAS has the database part built into it. I have heard 2nd hand of new statisticians going to work for a company and asking if they have SAS, the reply is "Yes we use SAS for our database, does it do statistics also?" Also I heard something about SAS is no longer considered an acronym, they like having it be just a name and don't want the fact that one of the S's used to stand for statistics to scare away companies that use it as a database. Maybe someone more up on SAS can confirm or deny this. Also one issue to always look at is central control versus ease of extendability. If you have a program that is completely under your control and does one set of things, then extending it to a new model (big data) is fairly straight forward. R is the opposite end of the spectrum with many contributers and many techniques. Extending some basic pieces to be very efficient with big data could be done easily, but would break many other pieces. Getting all the different packages to conform to a single standard in a short amount of time would be near impossible. With R's flexibility, there are probably some problems that can be done quicker with a proper use of biglm than with SAS and I expect that with some more work and maturity the SQLiteDF package may start to rival SAS as well on certain problems. While SAS is a useful program and great at certain things, there are some tecniques that I would not even attempt using SAS that are fairly straigh forward in R (I remember seeing some SAS code to do a bootstrap that included a datastep to read in and extract information from a SAS output file, <> SAS/ODS has improved this, but I would much rather bootstrap in R/S-PLUS than anything else). Remember, everything is better than everything else given the right comparison. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -Original Message- > From: Wensui Liu [mailto:[EMAIL PROTECTED] > Sent: Tuesday, April 10, 2007 3:26 PM > To: Greg Snow > Cc: Bi-Info (http://members.home.nl/bi-info); Gabor > Grothendieck; Lorenzo Isella; r-help@stat.math.ethz.ch > Subject: Re: [R] Reasons to Use R > > Greg, > As far as I understand, SAS is more efficient handling large > data probably than S+/R. Do you have any idea why? > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Bi-Info > > > (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of > memory like R. > > > I don't get the point. Isn't there enough swap space for S? (Who > > > cares > > > anyway: it works, isn't it?) Or are there any problems with S and > > > large datasets? I don't get it. You use them, Greg. So you might > > > discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start > > swapping memory to disk, but the OS does not know what > parts of memory > > correspond to which objects, so it is entirely possible > that the chunk > > swapped to disk contains parts of different data objects, > so when you > > need one of those objects again, everything needs to be > swapped back > > in. This is very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since > it does some > > of its own swapping to disk it can be more efficient by swapping > > single data objects (data frames, etc.). Also, since S-PLUS is > > already saving everything to disk, it does not actually > need to do a > > full swap, it can just look and see that a particular data > frame has > > not been used for a while, know that it is already saved on > the disk, > > and unload it from memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping > > data on the disk until needed. > > > > The better approach for large data sets is to only have some of the > > data in memory at a time and to automatically read just the > parts that > > you need. So for big datasets it is recommended to have the actual > > data stored in a database and use one of the database connection > >
Re: [R] Reasons to Use R
Greg, As far as I understand, SAS is more efficient handling large data probably than S+/R. Do you have any idea why? On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > -Original Message- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of > > Bi-Info (http://members.home.nl/bi-info) > > Sent: Monday, April 09, 2007 4:23 PM > > To: Gabor Grothendieck > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > Subject: Re: [R] Reasons to Use R > > [snip] > > > So what's the big deal about S using files instead of memory > > like R. I don't get the point. Isn't there enough swap space > > for S? (Who cares > > anyway: it works, isn't it?) Or are there any problems with S > > and large datasets? I don't get it. You use them, Greg. So > > you might discuss that issue. > > > > Wilfred > > > > > > This is my understanding of the issue (not anything official). > > If you use up all the memory while in R, then the OS will start swapping > memory to disk, but the OS does not know what parts of memory correspond > to which objects, so it is entirely possible that the chunk swapped to > disk contains parts of different data objects, so when you need one of > those objects again, everything needs to be swapped back in. This is > very inefficient. > > S-PLUS occasionally runs into the same problem, but since it does some > of its own swapping to disk it can be more efficient by swapping single > data objects (data frames, etc.). Also, since S-PLUS is already saving > everything to disk, it does not actually need to do a full swap, it can > just look and see that a particular data frame has not been used for a > while, know that it is already saved on the disk, and unload it from > memory without having to write it to disk first. > > The g.data package for R has some of this functionality of keeping data > on the disk until needed. > > The better approach for large data sets is to only have some of the data > in memory at a time and to automatically read just the parts that you > need. So for big datasets it is recommended to have the actual data > stored in a database and use one of the database connection packages to > only read in the subset that you need. The SQLiteDF package for R is > working on automating this process for R. There are also the bigdata > module for S-PLUS and the biglm package for R have ways of doing some of > the common analyses using chunks of data at a time. This idea is not > new. There was a program in the late 1970s and 80s called Rummage by > Del Scott (I guess technically it still exists, I have a copy on a 5.25" > floppy somewhere) that used the approach of specify the model you wanted > to fit first, then specify the data file. Rummage would then figure out > which sufficient statistics were needed and read the data in chunks, > compute the sufficient statistics on the fly, and not keep more than a > couple of lines of the data in memory at once. Unfortunately it did not > have much of a user interface, so when memory was cheap and datasets > only medium sized it did not compete well, I guess it was just a bit too > ahead of its time. > > Hope this helps, > > > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > [EMAIL PROTECTED] > (801) 408-8111 > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Taylor, Z Todd wrote: > On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote: > >> So what's the big deal about S using files instead of memory >> like R. I don't get the point. Isn't there enough swap space >> for S? (Who cares anyway: it works, isn't it?) Or are there >> any problems with S and large datasets? I don't get it. You >> use them, Greg. So you might discuss that issue. > > S's one-to-one correspondence between S objects and filesystem > objects is the single remaining reason I haven't completely > converted over to R. With S I can manage my objects via > makefiles. Corrections to raw data or changes to analysis > scripts get applied to all objects in the project (and there > are often thousands of them) by simply typing 'make'. That > includes everything right down to the graphics that will go > in the report. > > How do people live without that? Personally I'd rather have R's save( ) and load( ). Frank > > --Todd -- Frank E Harrell Jr Professor and Chair School of Medicine Department of Biostatistics Vanderbilt University __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Hi Todd, I guess I don't see the difference between that strategy and using make to look after scripts, raw data, Sweave files, and (if necessary) images. I find that I can get pretty fine-grained control over what parts of a project need to be rerun by breaking the analysis into chapters. I suppose it depends on whether one takes a script-centric or an object-centric view of a data analysis project. A script-centric view is nicer for version control. I think that make is centric-neutral :). Cheers, Andrew On Tue, Apr 10, 2007 at 04:23:54PM -0700, Taylor, Z Todd wrote: > On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote: > > > So what's the big deal about S using files instead of memory > > like R. I don't get the point. Isn't there enough swap space > > for S? (Who cares anyway: it works, isn't it?) Or are there > > any problems with S and large datasets? I don't get it. You > > use them, Greg. So you might discuss that issue. > > S's one-to-one correspondence between S objects and filesystem > objects is the single remaining reason I haven't completely > converted over to R. With S I can manage my objects via > makefiles. Corrections to raw data or changes to analysis > scripts get applied to all objects in the project (and there > are often thousands of them) by simply typing 'make'. That > includes everything right down to the graphics that will go > in the report. > > How do people live without that? > > --Todd > -- > Why is 'abbreviation' such a long word? > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Andrew Robinson Department of Mathematics and StatisticsTel: +61-3-8344-9763 University of Melbourne, VIC 3010 Australia Fax: +61-3-8344-4599 http://www.ms.unimelb.edu.au/~andrewpr http://blogs.mbs.edu/fishing-in-the-bay/ __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On Monday, April 09, 2007 3:23 PM, someone named Wilfred wrote: > So what's the big deal about S using files instead of memory > like R. I don't get the point. Isn't there enough swap space > for S? (Who cares anyway: it works, isn't it?) Or are there > any problems with S and large datasets? I don't get it. You > use them, Greg. So you might discuss that issue. S's one-to-one correspondence between S objects and filesystem objects is the single remaining reason I haven't completely converted over to R. With S I can manage my objects via makefiles. Corrections to raw data or changes to analysis scripts get applied to all objects in the project (and there are often thousands of them) by simply typing 'make'. That includes everything right down to the graphics that will go in the report. How do people live without that? --Todd -- Why is 'abbreviation' such a long word? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
I think SAS was developed at a time when computer memory was much smaller than it is now and the legacy of that is its better usage of computer resources. On 4/10/07, Wensui Liu <[EMAIL PROTECTED]> wrote: > Greg, > As far as I understand, SAS is more efficient handling large data > probably than S+/R. Do you have any idea why? > > On 4/10/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of > > > Bi-Info (http://members.home.nl/bi-info) > > > Sent: Monday, April 09, 2007 4:23 PM > > > To: Gabor Grothendieck > > > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > > > Subject: Re: [R] Reasons to Use R > > > > [snip] > > > > > So what's the big deal about S using files instead of memory > > > like R. I don't get the point. Isn't there enough swap space > > > for S? (Who cares > > > anyway: it works, isn't it?) Or are there any problems with S > > > and large datasets? I don't get it. You use them, Greg. So > > > you might discuss that issue. > > > > > > Wilfred > > > > > > > > > > This is my understanding of the issue (not anything official). > > > > If you use up all the memory while in R, then the OS will start swapping > > memory to disk, but the OS does not know what parts of memory correspond > > to which objects, so it is entirely possible that the chunk swapped to > > disk contains parts of different data objects, so when you need one of > > those objects again, everything needs to be swapped back in. This is > > very inefficient. > > > > S-PLUS occasionally runs into the same problem, but since it does some > > of its own swapping to disk it can be more efficient by swapping single > > data objects (data frames, etc.). Also, since S-PLUS is already saving > > everything to disk, it does not actually need to do a full swap, it can > > just look and see that a particular data frame has not been used for a > > while, know that it is already saved on the disk, and unload it from > > memory without having to write it to disk first. > > > > The g.data package for R has some of this functionality of keeping data > > on the disk until needed. > > > > The better approach for large data sets is to only have some of the data > > in memory at a time and to automatically read just the parts that you > > need. So for big datasets it is recommended to have the actual data > > stored in a database and use one of the database connection packages to > > only read in the subset that you need. The SQLiteDF package for R is > > working on automating this process for R. There are also the bigdata > > module for S-PLUS and the biglm package for R have ways of doing some of > > the common analyses using chunks of data at a time. This idea is not > > new. There was a program in the late 1970s and 80s called Rummage by > > Del Scott (I guess technically it still exists, I have a copy on a 5.25" > > floppy somewhere) that used the approach of specify the model you wanted > > to fit first, then specify the data file. Rummage would then figure out > > which sufficient statistics were needed and read the data in chunks, > > compute the sufficient statistics on the fly, and not keep more than a > > couple of lines of the data in memory at once. Unfortunately it did not > > have much of a user interface, so when memory was cheap and datasets > > only medium sized it did not compete well, I guess it was just a bit too > > ahead of its time. > > > > Hope this helps, > > > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > [EMAIL PROTECTED] > > (801) 408-8111 > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > WenSui Liu > A lousy statistician who happens to know a little programming > (http://spaces.msn.com/statcompute/blog) > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
> -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of > Bi-Info (http://members.home.nl/bi-info) > Sent: Monday, April 09, 2007 4:23 PM > To: Gabor Grothendieck > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > Subject: Re: [R] Reasons to Use R [snip] > So what's the big deal about S using files instead of memory > like R. I don't get the point. Isn't there enough swap space > for S? (Who cares > anyway: it works, isn't it?) Or are there any problems with S > and large datasets? I don't get it. You use them, Greg. So > you might discuss that issue. > > Wilfred > > This is my understanding of the issue (not anything official). If you use up all the memory while in R, then the OS will start swapping memory to disk, but the OS does not know what parts of memory correspond to which objects, so it is entirely possible that the chunk swapped to disk contains parts of different data objects, so when you need one of those objects again, everything needs to be swapped back in. This is very inefficient. S-PLUS occasionally runs into the same problem, but since it does some of its own swapping to disk it can be more efficient by swapping single data objects (data frames, etc.). Also, since S-PLUS is already saving everything to disk, it does not actually need to do a full swap, it can just look and see that a particular data frame has not been used for a while, know that it is already saved on the disk, and unload it from memory without having to write it to disk first. The g.data package for R has some of this functionality of keeping data on the disk until needed. The better approach for large data sets is to only have some of the data in memory at a time and to automatically read just the parts that you need. So for big datasets it is recommended to have the actual data stored in a database and use one of the database connection packages to only read in the subset that you need. The SQLiteDF package for R is working on automating this process for R. There are also the bigdata module for S-PLUS and the biglm package for R have ways of doing some of the common analyses using chunks of data at a time. This idea is not new. There was a program in the late 1970s and 80s called Rummage by Del Scott (I guess technically it still exists, I have a copy on a 5.25" floppy somewhere) that used the approach of specify the model you wanted to fit first, then specify the data file. Rummage would then figure out which sufficient statistics were needed and read the data in chunks, compute the sufficient statistics on the fly, and not keep more than a couple of lines of the data in memory at once. Unfortunately it did not have much of a user interface, so when memory was cheap and datasets only medium sized it did not compete well, I guess it was just a bit too ahead of its time. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
For a previous version of SAS we had parts installed on each computer where it was used, but there were key pieces located on a network drive (not internet, but local network) such that if you tried to start SAS while someone else was using it you would get an error message. We had troubles with the network, so now we have a full version installed on each computer, but the person in the company that is the contact between us and SAS (my group has 1 licence, but the company as a whole has several) checks up on us from time to time to make sure that we stick within the 1 at a time guidelines (not hard, we mostly use other things) or pay for additional licences. S-PLUS has also had similar types of licences, I was teaching in a computer lab where all the computers could run S-PLUS, but once 5 people had started S-PLUS, noone else could until someone else quite out of it (So we used R for that Class). For S-PLUS 7 when I upgraded my computer and installed my licenced copy on the new computer, it disabled the copy on my old computer. This may have changed somewhat, because I remember there being some complaints from people who legitimately installed it on their laptop, but it would not work when the laptop was not connected to the internet. There are a lot of different ways to try to enforce licence conditions on software (and doing so is important for companies that want to make a profit these days), unfortunately the current pendulum swing is making thing more inconvienient for the common user (at home I have some software that we use to program my wife's sewing machines that can be installed on any computer, but only works if a hardware key is plugged into a usb port). -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -Original Message- > From: Charilaos Skiadas [mailto:[EMAIL PROTECTED] > Sent: Monday, April 09, 2007 3:24 PM > To: Greg Snow > Cc: Gabor Grothendieck; Lorenzo Isella; R-Help list > Subject: Re: [R] Reasons to Use R > > On Apr 9, 2007, at 1:45 PM, Greg Snow wrote: > > > The licences keep changing, some have in the past but don't > now, some > > you can get an additional licence for home at a discounted > price. Some > > it depends on the type of licence you have at work > (currently our SAS > > licence is such that the 3 people in my group can all have it > > installed, but at most 1 can be using it at any 1 time, how > does that > > affect installing/using it at home). > > Hm, this intrigues me, it would seem to me that the only way > for SAS to check that only one of your colleagues uses it at > any given time would be to contact some sort of online > server. Does that mean that SAS can only be run when you have > internet access? > > Or is it simply a clause on the license, without any "runtime checks"? > > Haris Skiadas > Department of Mathematics and Computer Science Hanover College > > > > > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
"halldor bjornsson" <[EMAIL PROTECTED]> writes: > ... > Now, R does not have everything we want. One thing missing is a decent > R-DB2 connection, for windows the excellent RODBC works fine, but ODBC > support on Linux is a hassle. > A hassle? I use RODBC on Linux to read data from a mainframe DB2 database. I had to create the file .odbc.ini in my home directory with lines like this: [m1db2p] Driver = DB2 Servername = NameOfOurMainframe Database = fdrp UserName = "NachoBizness" TraceFile = /home/NachoBizness/.odbc.log and then to connect I do this: Sys.putenv(DB2INSTANCE = "db2inst") myConnection <- odbcConnect(dsn = "m1db2p", uid = uid, pwd = pwd, case = "toupper") with 'uid' and 'pwd' set to my mainframe uid and password. Now, I am not the sysadmin for our Linux machines, but I don't think they had to do much beyond the standard rpm installation to get this working. -- Jeff __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R [Broadcast]
Andy, I totally agree with you. Money should be spent on the people working hard instead of on the fancy software. But in real life, it is the opposite. ^_^. On 4/9/07, Liaw, Andy <[EMAIL PROTECTED]> wrote: > I've probably been away from SAS for too long... we've recently tried to > get SAS on our 64-bit Linux boxes (because SAS on PC is not sufficient > for some of my colleagues who need it). I was shocked by the quote for > our 28-core Scyld cluster--- the annual fee was a few times the total > cost of our hardware. We ended up buying a new quad 3GHz Opterons box > with 32GB ram just so that the fee for SAS on such a box would be more > tolerable. It just boggles my mind that the right to use SAS for a year > is about the price of a nice four-bedroom house (near SAS Institute!). > I don't understand people who rather pay that kind of price for the > software, instead of spending the money on state-of-the-art hardware and > save more than a bundle. > > Just my $0.02... > Andy > > From: Jorge Cornejo-Donoso > > > > I have a Dell with 2 Intel XEON 3.0 procesors and 2GB of ram > > The problem is the DB size. > > > > -Mensaje original- > > De: Gabor Grothendieck [mailto:[EMAIL PROTECTED] > > Enviado el: Lunes, 09 de Abril de 2007 11:28 > > Para: Jorge Cornejo-Donoso > > CC: r-help@stat.math.ethz.ch > > Asunto: Re: [R] Reasons to Use R > > > > Have you tried 64 bit machines with larger memory or do you > > mean that you can't use R on your current machines? > > > > Also have you tried S-Plus? Will that work for you? The > > transition from that to R would be less than from SAS to R. > > > > On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > > > tha s9ze of db is an issue with R. We are still using SAS because R > > > can't handle own db, and of couse we don't want to sacrify > > resolution, > > > because the data collection is expensive (at least in fisheries and > > > oceagraphy), so.. I think that R need to improve the use of big DBs. > > > Now I only can use R for graph preparation and some data > > analisis, but > > > we can't do the main work on R, abd that is really sad. > > > > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > > > -- > Notice: This e-mail message, together with any attachments,...{{dropped}} > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- WenSui Liu A lousy statistician who happens to know a little programming (http://spaces.msn.com/statcompute/blog) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R [Broadcast]
I've probably been away from SAS for too long... we've recently tried to get SAS on our 64-bit Linux boxes (because SAS on PC is not sufficient for some of my colleagues who need it). I was shocked by the quote for our 28-core Scyld cluster--- the annual fee was a few times the total cost of our hardware. We ended up buying a new quad 3GHz Opterons box with 32GB ram just so that the fee for SAS on such a box would be more tolerable. It just boggles my mind that the right to use SAS for a year is about the price of a nice four-bedroom house (near SAS Institute!). I don't understand people who rather pay that kind of price for the software, instead of spending the money on state-of-the-art hardware and save more than a bundle. Just my $0.02... Andy From: Jorge Cornejo-Donoso > > I have a Dell with 2 Intel XEON 3.0 procesors and 2GB of ram > The problem is the DB size. > > -Mensaje original- > De: Gabor Grothendieck [mailto:[EMAIL PROTECTED] > Enviado el: Lunes, 09 de Abril de 2007 11:28 > Para: Jorge Cornejo-Donoso > CC: r-help@stat.math.ethz.ch > Asunto: Re: [R] Reasons to Use R > > Have you tried 64 bit machines with larger memory or do you > mean that you can't use R on your current machines? > > Also have you tried S-Plus? Will that work for you? The > transition from that to R would be less than from SAS to R. > > On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > > tha s9ze of db is an issue with R. We are still using SAS because R > > can't handle own db, and of couse we don't want to sacrify > resolution, > > because the data collection is expensive (at least in fisheries and > > oceagraphy), so.. I think that R need to improve the use of big DBs. > > Now I only can use R for graph preparation and some data > analisis, but > > we can't do the main work on R, abd that is really sad. > > > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > > -- Notice: This e-mail message, together with any attachments,...{{dropped}} __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Licensing is a big issue in software. The way I prefer it is an easy license, a license which makes it possible that I can work on another PC, without paying a lot of money. R produces quite good results and is widely used. That makes it a statistical package that I want. The other thing is that working with large datasets requires "some" effort by software makers to get it working. I doubt if R has the capability of working consistently with large datasets. That is an issue I think. I have done some comparisons between SPSS and R, and R seems to be performing allright, so I can do computations with it. Nonetheless: the data handling is not quite as good I think in comparison with SAS. When I started doing statistics there were about three packages: SPSS, SAS and BMDP (at least: these were available). On a PC you were required to use SPSS. Nowadays there are hundreds, some with excellent database facilities, or you can compute the newest statistical tests, or an exotic one. I haven't got a clue how to work with new database facilities. dBase was my only database education and everything has changed. So I cannot answer if R is capable of working with large datasets in relation to databases. I really don't know. The only thing I know that if I compute a ChiSq, it works on a relatively large dataset (not Fisher tests by the way). The same with a likelihood procedure, or tabulations including non-parametrics or factor analysis. But databases are an issue I've been told by a guy who works with R. SAS was a better option he told me. So what's the big deal about S using files instead of memory like R. I don't get the point. Isn't there enough swap space for S? (Who cares anyway: it works, isn't it?) Or are there any problems with S and large datasets? I don't get it. You use them, Greg. So you might discuss that issue. Wilfred The licences keep changing, some have in the past but don't now, some you can get an additional licence for home at a discounted price. Some it depends on the type of licence you have at work (currently our SAS licence is such that the 3 people in my group can all have it installed, but at most 1 can be using it at any 1 time, how does that affect installing/using it at home). I may be able to install some of the software at home also, but for most of them I have given up trying to figure out the legality of it and so I have not installed them at home to be on the safe side. Some of the doctors I work with who are also affiliated with the local university have mentioned that they can get a discounted academic version of SAS and could use that, but my interpretation of the academic licence that one showed me (probably not the most recent) said (in my interpretation, I am not a lawyer) that if they published the results without paying a licence upgrade fee, they would be violating the licence (the academic version was intended for teaching only). The R licence on the other hand is pretty clear that I can install it and use it pretty much anywhere I want. You are right in correcting me, R is not the only package that can be used on multiple computers. I do think it is the most straight forward of the good ones. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -Original Message- > From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] > Sent: Monday, April 09, 2007 10:44 AM > To: Greg Snow > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > Subject: Re: [R] Reasons to Use R > > I might be wrong about this but I thought that the licenses > for at least some of the commercial packages do let you make > a copy of the one you have at work for home use. > > On 4/9/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > Here are a couple more thougts to add to what you have > already received: > > > > You mentioned that price is not at issue, but there are other costs > > than money that you may want to look at. On my work > machine I have R, > > S-PLUS, SAS, SPSS, and a couple of other stats programs; on > my laptop > > and home computers I have R installed. So, if a deadline > is looming > > and I am working on a project mainly in R, it is easy to > work on it on > > the bus or at home (or in a boring meeting), the same does not work > > for a SAS or SPSS project (Hmm, thinking about this now, > maybe I need > > to do less in R :-). > > > > R and S-PLUS are very flexible/customizable, if you have a certain > > plot that you make often you can write your own > function/script to do > > it automatically, most other programs will give you their standard, > > then you have to modify it to meet your specifications. > With sweave > > (and the odf and html e
Re: [R] Reasons to Use R
On Apr 9, 2007, at 1:45 PM, Greg Snow wrote: > The licences keep changing, some have in the past but don't now, some > you can get an additional licence for home at a discounted price. Some > it depends on the type of licence you have at work (currently our SAS > licence is such that the 3 people in my group can all have it > installed, > but at most 1 can be using it at any 1 time, how does that affect > installing/using it at home). Hm, this intrigues me, it would seem to me that the only way for SAS to check that only one of your colleagues uses it at any given time would be to contact some sort of online server. Does that mean that SAS can only be run when you have internet access? Or is it simply a clause on the license, without any "runtime checks"? Haris Skiadas Department of Mathematics and Computer Science Hanover College __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Dear Lorenzo, Thanks for starting a great thread here. Like others, I would like to hear a summary if you make one. My institute uses R for internal data processing and analyzing. Below are some of our reasons, and yes cost (or lack thereof) is not the only one. First, prior to the rise of R we already had a number of people using Splus, and our main compute server had licenses for Splus. As the institution moved from Sun Unix servers to Linux workstations and servers, the licensing issue became important. Having to service many licenses (one per workstation, and several on the servers) is time consuming for overworked IT staff. Furthermore, our Splus programs that ran routinely on the servers could all be easily made run on R. Hence, this was really a no-brainer. Second, R runs on both windows and linux (and solaris and macs,- although the last one is not really an issue for us). We have made some user programs that are tailor-made for the work we do, these we bundle into R packages, that then can be used on both windows and linux. This was a very important consideration for us. Third, user community. Even with commercial solutions (such as Matlab) the quality of the user community is very important, - if we had felt that R did not have an active and responsive community we probably would have been more hesitant. Needless to say R has an incredibly active community which makes it an attractive environment. Furthermore, other institutions in our field are also adopting R, at least in the research departments. Fourth, R is a good choice for many of the things that we do (data analysis of varying complexity, good graphics, maptools [working with shapefiles] etc). It was therefore an obvious candiate for us from the start. Now, R does not have everything we want. One thing missing is a decent R-DB2 connection, for windows the excellent RODBC works fine, but ODBC support on Linux is a hassle. The big file issue is there, but many of our files are GRIB which is a format that is generally not supported by anyone Furthermore, object graphics, ala pythons matplotlib (and of course Matlab) is not there, but would be very handy. However, that being said, it is easy to make publication (print and web) quality graphics with R. And of course as always with Open Source if you miss something bad enough why not do it (or have it done) yourself and add it to the package. We have not used R much for large NetCDF datasets, there are other tools (such as the CDO package, which also supports GRIB) that are better oriented for this. We have used R on solaris, Linux (several different flavours) and Windows (since W98). We currently use it on our primary production servers (RedHat Enterprise Edition), but we have not used it in a parallel setting. We have not used R for making on-the-fly calculations and graphics for the web, although this is clearly possible. I hope this helps, I have found this thread to be a good one. Sincerely, Halldór On 4/5/07, Lorenzo Isella <[EMAIL PROTECTED]> wrote: > Dear All, > The institute I work for is organizing an internal workshop for High > Performance Computing (HPC). > I am planning to attend it and talk a bit about fluid dynamics, but > there is also quite a lot of interest devoted to data post-processing > and management of huge data sets. > A lot of people are interested in image processing/pattern recognition > and statistic applied to geography/ecology, but I would like not to > post this on too many lists. > The final aim of the workshop is understanding hardware requirements > and drafting a list of the equipment we would like to buy. I think > this could be the venue to talk about R as well. > Therefore, even if it is not exactly a typical mailing list question, > I would like to have suggestions about where to collect info about: > (1)Institutions (not only academia) using R > (2)Hardware requirements, possibly benchmarks > (3)R & clusters, R & multiple CPU machines, R performance on different > hardware. > (4)finally, a list of the advantages for using R over commercial > statistical packages. The money-saving in itself is not a reason good > enough and some people are scared by the lack of professional support, > though this mailing list is simply wonderful. > > Kind Regards > > Lorenzo Isella > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Halldór Björnsson Deildarstj. Ranns. & Þróun Veðursvið Veðurstofu Íslands Halldór Bjornsson Weatherservice R & D Icelandic Met. Office __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-cont
Re: [R] Reasons to Use R
The licences keep changing, some have in the past but don't now, some you can get an additional licence for home at a discounted price. Some it depends on the type of licence you have at work (currently our SAS licence is such that the 3 people in my group can all have it installed, but at most 1 can be using it at any 1 time, how does that affect installing/using it at home). I may be able to install some of the software at home also, but for most of them I have given up trying to figure out the legality of it and so I have not installed them at home to be on the safe side. Some of the doctors I work with who are also affiliated with the local university have mentioned that they can get a discounted academic version of SAS and could use that, but my interpretation of the academic licence that one showed me (probably not the most recent) said (in my interpretation, I am not a lawyer) that if they published the results without paying a licence upgrade fee, they would be violating the licence (the academic version was intended for teaching only). The R licence on the other hand is pretty clear that I can install it and use it pretty much anywhere I want. You are right in correcting me, R is not the only package that can be used on multiple computers. I do think it is the most straight forward of the good ones. -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -Original Message- > From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] > Sent: Monday, April 09, 2007 10:44 AM > To: Greg Snow > Cc: Lorenzo Isella; r-help@stat.math.ethz.ch > Subject: Re: [R] Reasons to Use R > > I might be wrong about this but I thought that the licenses > for at least some of the commercial packages do let you make > a copy of the one you have at work for home use. > > On 4/9/07, Greg Snow <[EMAIL PROTECTED]> wrote: > > Here are a couple more thougts to add to what you have > already received: > > > > You mentioned that price is not at issue, but there are other costs > > than money that you may want to look at. On my work > machine I have R, > > S-PLUS, SAS, SPSS, and a couple of other stats programs; on > my laptop > > and home computers I have R installed. So, if a deadline > is looming > > and I am working on a project mainly in R, it is easy to > work on it on > > the bus or at home (or in a boring meeting), the same does not work > > for a SAS or SPSS project (Hmm, thinking about this now, > maybe I need > > to do less in R :-). > > > > R and S-PLUS are very flexible/customizable, if you have a certain > > plot that you make often you can write your own > function/script to do > > it automatically, most other programs will give you their standard, > > then you have to modify it to meet your specifications. > With sweave > > (and the odf and html extensions) you can automate whole > reports, very > > useful for things that you do month after month. > > > > And what I think is the biggest advantage of R and S-PLUS > is that they > > strongly encourage you to think about your data. Other > programs (at > > least that I am familiar with) tend to have 1 specific way > of treating > > your data, and expect you to modify your data to fit that programs > > model. These models can be overrestrictive (force you to > restructure > > your data to fit their model) or underrestrictive (allow > things that > > should really be separate data objects to be combined into a single > > "dataset") and sometimes both. S on the other hand allows many > > different ways to store and work with your data, and as you analyze > > the data, different branches of new analysis open up depending on > > early results rather than just getting stock output for a > procedure. > > If all you want is a black box where data goes in one end and a > > specific answer comes out the other, then most programs > will work; but > > if you want to really understand what your data has to tell > you, then > > R/S-PLUS makes this easy and natural. > > > > Hope this helps, > > > > > > -- > > Gregory (Greg) L. Snow Ph.D. > > Statistical Data Center > > Intermountain Healthcare > > [EMAIL PROTECTED] > > (801) 408-8111 > > > > > > > > > -Original Message- > > > From: [EMAIL PROTECTED] > > > [mailto:[EMAIL PROTECTED] On Behalf Of Lorenzo > > > Isella > > > Sent: Thursday, April 05, 2007 9:02 AM > > > To: r-help@stat.math.ethz.ch > > > Subject: [R] Reasons to Use R > > > > > &
Re: [R] Reasons to Use R
I might be wrong about this but I thought that the licenses for at least some of the commercial packages do let you make a copy of the one you have at work for home use. On 4/9/07, Greg Snow <[EMAIL PROTECTED]> wrote: > Here are a couple more thougts to add to what you have already received: > > You mentioned that price is not at issue, but there are other costs than > money that you may want to look at. On my work machine I have R, > S-PLUS, SAS, SPSS, and a couple of other stats programs; on my laptop > and home computers I have R installed. So, if a deadline is looming and > I am working on a project mainly in R, it is easy to work on it on the > bus or at home (or in a boring meeting), the same does not work for a > SAS or SPSS project (Hmm, thinking about this now, maybe I need to do > less in R :-). > > R and S-PLUS are very flexible/customizable, if you have a certain plot > that you make often you can write your own function/script to do it > automatically, most other programs will give you their standard, then > you have to modify it to meet your specifications. With sweave (and the > odf and html extensions) you can automate whole reports, very useful for > things that you do month after month. > > And what I think is the biggest advantage of R and S-PLUS is that they > strongly encourage you to think about your data. Other programs (at > least that I am familiar with) tend to have 1 specific way of treating > your data, and expect you to modify your data to fit that programs > model. These models can be overrestrictive (force you to restructure > your data to fit their model) or underrestrictive (allow things that > should really be separate data objects to be combined into a single > "dataset") and sometimes both. S on the other hand allows many > different ways to store and work with your data, and as you analyze the > data, different branches of new analysis open up depending on early > results rather than just getting stock output for a procedure. If all > you want is a black box where data goes in one end and a specific answer > comes out the other, then most programs will work; but if you want to > really understand what your data has to tell you, then R/S-PLUS makes > this easy and natural. > > Hope this helps, > > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > [EMAIL PROTECTED] > (801) 408-8111 > > > > > -Original Message- > > From: [EMAIL PROTECTED] > > [mailto:[EMAIL PROTECTED] On Behalf Of Lorenzo Isella > > Sent: Thursday, April 05, 2007 9:02 AM > > To: r-help@stat.math.ethz.ch > > Subject: [R] Reasons to Use R > > > > Dear All, > > The institute I work for is organizing an internal workshop > > for High Performance Computing (HPC). > > I am planning to attend it and talk a bit about fluid > > dynamics, but there is also quite a lot of interest devoted > > to data post-processing and management of huge data sets. > > A lot of people are interested in image processing/pattern > > recognition and statistic applied to geography/ecology, but I > > would like not to post this on too many lists. > > The final aim of the workshop is understanding hardware > > requirements and drafting a list of the equipment we would > > like to buy. I think this could be the venue to talk about R as well. > > Therefore, even if it is not exactly a typical mailing list > > question, I would like to have suggestions about where to > > collect info about: > > (1)Institutions (not only academia) using R (2)Hardware > > requirements, possibly benchmarks (3)R & clusters, R & > > multiple CPU machines, R performance on different hardware. > > (4)finally, a list of the advantages for using R over > > commercial statistical packages. The money-saving in itself > > is not a reason good enough and some people are scared by the > > lack of professional support, though this mailing list is > > simply wonderful. > > > > Kind Regards > > > > Lorenzo Isella > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Here are a couple more thougts to add to what you have already received: You mentioned that price is not at issue, but there are other costs than money that you may want to look at. On my work machine I have R, S-PLUS, SAS, SPSS, and a couple of other stats programs; on my laptop and home computers I have R installed. So, if a deadline is looming and I am working on a project mainly in R, it is easy to work on it on the bus or at home (or in a boring meeting), the same does not work for a SAS or SPSS project (Hmm, thinking about this now, maybe I need to do less in R :-). R and S-PLUS are very flexible/customizable, if you have a certain plot that you make often you can write your own function/script to do it automatically, most other programs will give you their standard, then you have to modify it to meet your specifications. With sweave (and the odf and html extensions) you can automate whole reports, very useful for things that you do month after month. And what I think is the biggest advantage of R and S-PLUS is that they strongly encourage you to think about your data. Other programs (at least that I am familiar with) tend to have 1 specific way of treating your data, and expect you to modify your data to fit that programs model. These models can be overrestrictive (force you to restructure your data to fit their model) or underrestrictive (allow things that should really be separate data objects to be combined into a single "dataset") and sometimes both. S on the other hand allows many different ways to store and work with your data, and as you analyze the data, different branches of new analysis open up depending on early results rather than just getting stock output for a procedure. If all you want is a black box where data goes in one end and a specific answer comes out the other, then most programs will work; but if you want to really understand what your data has to tell you, then R/S-PLUS makes this easy and natural. Hope this helps, -- Gregory (Greg) L. Snow Ph.D. Statistical Data Center Intermountain Healthcare [EMAIL PROTECTED] (801) 408-8111 > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Lorenzo Isella > Sent: Thursday, April 05, 2007 9:02 AM > To: r-help@stat.math.ethz.ch > Subject: [R] Reasons to Use R > > Dear All, > The institute I work for is organizing an internal workshop > for High Performance Computing (HPC). > I am planning to attend it and talk a bit about fluid > dynamics, but there is also quite a lot of interest devoted > to data post-processing and management of huge data sets. > A lot of people are interested in image processing/pattern > recognition and statistic applied to geography/ecology, but I > would like not to post this on too many lists. > The final aim of the workshop is understanding hardware > requirements and drafting a list of the equipment we would > like to buy. I think this could be the venue to talk about R as well. > Therefore, even if it is not exactly a typical mailing list > question, I would like to have suggestions about where to > collect info about: > (1)Institutions (not only academia) using R (2)Hardware > requirements, possibly benchmarks (3)R & clusters, R & > multiple CPU machines, R performance on different hardware. > (4)finally, a list of the advantages for using R over > commercial statistical packages. The money-saving in itself > is not a reason good enough and some people are scared by the > lack of professional support, though this mailing list is > simply wonderful. > > Kind Regards > > Lorenzo Isella > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
What about the S-Plus question? S-Plus stores objects in files whereas R stores them in memory. On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > I have a Dell with 2 Intel XEON 3.0 procesors and 2GB of ram > The problem is the DB size. > > -Mensaje original- > De: Gabor Grothendieck [mailto:[EMAIL PROTECTED] > Enviado el: Lunes, 09 de Abril de 2007 11:28 > Para: Jorge Cornejo-Donoso > CC: r-help@stat.math.ethz.ch > Asunto: Re: [R] Reasons to Use R > > Have you tried 64 bit machines with larger memory or do you mean that you > can't use R on your current machines? > > Also have you tried S-Plus? Will that work for you? The transition from > that to R would be less than from SAS to R. > > On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > > tha s9ze of db is an issue with R. We are still using SAS because R > > can't handle own db, and of couse we don't want to sacrify resolution, > > because the data collection is expensive (at least in fisheries and > > oceagraphy), so.. I think that R need to improve the use of big DBs. > > Now I only can use R for graph preparation and some data analisis, but > > we can't do the main work on R, abd that is really sad. > > > > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
I have a Dell with 2 Intel XEON 3.0 procesors and 2GB of ram The problem is the DB size. -Mensaje original- De: Gabor Grothendieck [mailto:[EMAIL PROTECTED] Enviado el: Lunes, 09 de Abril de 2007 11:28 Para: Jorge Cornejo-Donoso CC: r-help@stat.math.ethz.ch Asunto: Re: [R] Reasons to Use R Have you tried 64 bit machines with larger memory or do you mean that you can't use R on your current machines? Also have you tried S-Plus? Will that work for you? The transition from that to R would be less than from SAS to R. On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > tha s9ze of db is an issue with R. We are still using SAS because R > can't handle own db, and of couse we don't want to sacrify resolution, > because the data collection is expensive (at least in fisheries and > oceagraphy), so.. I think that R need to improve the use of big DBs. > Now I only can use R for graph preparation and some data analisis, but > we can't do the main work on R, abd that is really sad. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Have you tried 64 bit machines with larger memory or do you mean that you can't use R on your current machines? Also have you tried S-Plus? Will that work for you? The transition from that to R would be less than from SAS to R. On 4/9/07, Jorge Cornejo-Donoso <[EMAIL PROTECTED]> wrote: > tha s9ze of db is an issue with R. We are still using SAS because R > can't handle own db, and of couse we don't want to sacrify resolution, > because the data collection is expensive (at least in fisheries and > oceagraphy), so.. I think that R need to improve the use of big DBs. Now > I only can use R for graph preparation and some data analisis, but we > can't do the main work on R, abd that is really sad. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Dear Johann and Gabor, It's what amounts to large datasets. There are hundreds of datasets R can't handle, probably thousands or more. I noticed on my computer (which is nothing more that an average PC) that R breaks down after 250 MB of memory. I also note that SPSS breaks down, Matlab, etc. I'm not a SAS user, but I have worked in the past with SAS. It's very good as a remember, but it's ten years ago. And it's a "dollar machine" I've been told: you add dollars to SAS as you add dollars to a Porsche. I haven't got it and for most statistical applications it isn't necessary I've been told. R is sufficient for that. The datasets I use are often not that big (the way I like it). About three years ago I spoke to somebody who has worked with it and said "it's database system is excellent and statistical profound". Someone with a PhD, so probably he is right. Monte-Carlo simulations are computationally time-consuming, but probably these can be done in R. I haven't seen any libaries for it (they might be there). It has been done with S (the commercial counterpart of R), so probably with R too. If you tie Monte Carlo simulaton with large datasets you probably run into problems with a conventional R system. What I've been told in those instances is "buy a new computer" / "add memory and buy a new processor"... and don't smoke hashiesh. That wasn't a good advice because the guy who told me smoked hashiesh like hell and drank Pastis (blue liqor) like water. I kicked him out. But that's another story. Cheers, Wilfred (I drink wine and tailor made beer, and only on occasions. That's why. His simulations were good I've been told.) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On 4/8/07, Johann Hibschman <[EMAIL PROTECTED]> wrote: > R's pass-by-value semantics also make it harder than it should be to > deal with where it's crucial that you not make a copy of the data > frame, for fear of running out of memory. Pass-by-reference would > make implementing data transformations so much easier that I don't > really understand how pass-by-value became the standard. (If there's > a trick to doing in-place transformations, I've not found it.) Because R processes objects in memory I also would not rate it as as strong as some other packages on very large data sets but you can use databases which may make it less important in some cases and you can get a certain amount of mileage out of R environments and as 64 bit computers become commonplace and memory sizes grow larger and larger data sets will become easy to handle. Regarding environments, also available are proto objects from the proto package which are environments with slightly different semantics. Even if you don't intend to use the proto package its got quite a bit of documentation and supporting information that might be helpful: - home page: http://code.google.com/p/r-proto/ - overview (click on Wiki tab at home page) which includes article links that discuss OO and environments - tutorial, reference card, reference manual, vignette (see Links box) __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
On 4/6/07, Wilfred Zegwaard <[EMAIL PROTECTED]> wrote: > I'm not a programmer, but I have the experience that R is good for > processing large datasets, especially in combination with specialised > statistics. This I find a little surprising, but maybe it's just a sign that I'm not experienced enough with R yet. I can't use R for big datasets. At all. Big datasets take forever to load with read.table, R frequently runs out of memory, and nlm or gnlm never seem to actually converge to answers. By comparison, I can point SAS and NLIN at this data without problem. (Of course, SAS is running on a pretty powerful dedicated machine with a big ram disk, so that may be part of the problem.) R's pass-by-value semantics also make it harder than it should be to deal with where it's crucial that you not make a copy of the data frame, for fear of running out of memory. Pass-by-reference would make implementing data transformations so much easier that I don't really understand how pass-by-value became the standard. (If there's a trick to doing in-place transformations, I've not found it.) Right now, I'm considering starting on a project involving some big Monte Carlo integrations over the complicated posterior parameter distributions of a nonlinear regression model, and I have the strong feeling that R will just choke. R's great for small projects, but as soon as you even a few hundred megs of data, it seems to break down. If I'm doing things wrong, please tell me. :-) SAS is a beast to work with. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Regarding (2), I wonder if this information is too outdated or not relevant when scaled up to larger problems... http://www.sciviews.org/benchmark/index.html --- Ramon Diaz-Uriarte <[EMAIL PROTECTED]> wrote: > Dear Lorenzo, > > I'll try not to repeat what other have answered before. > > On 4/5/07, Lorenzo Isella <[EMAIL PROTECTED]> wrote: > > The institute I work for is organizing an internal workshop for High > > Performance Computing (HPC). > (...) > > > (1)Institutions (not only academia) using R > > You can count my institution too. Several groups. (I can provide more > details off-list if you want). > > > (2)Hardware requirements, possibly benchmarks > > (3)R & clusters, R & multiple CPU machines, R performance on different > hardware. > > We do use R in commodity off-the shelf clusters; our two clusters are > running Debian GNU/Linux; both 32-bit machines ---Xeons--- and 64-bit > machines ---dual-core AMD Opterons. We use parallelization quite a > bit, with MPI (via Rmpi and papply packages mainly). One convenient > feature is that (once the lam universe is up and running) whether we > are using the 4 cores in a single box, or the max available 120, is > completeley transparent. Using R and MPI is, really, a piece of cake. > That said, there are things that I miss; in particular, oftentimes I > wish R were Erlang or Oz because of the straightforward fault-tolerant > distributed computing and the built-in abstractions for distribution > and concurrency. The issue of multithreading has come up several times > in this list and is something that some people miss. > > I am not sure how much R is used in the usual HPC realms. It is my > understanding that the "traditional HPC" is still dominated by things > such as HPF, and C with MPI, OpenMP, or UPC or Cilk. The usual answer > to "but R is too slow" is "but you can write Fortran or C code for the > bottlenecks and call it from R". I guess you could use, say, UPC in > that C that is linked to R, but I have no experience. And I think this > code can become a pain to write and maintain (specially if you want to > play around with what you try to parallelize, etc). My feeling (based > on no information or documentation whatsoever) is that how far R can > be stretched or extended into HPC is still an open question. > > > > (4)finally, a list of the advantages for using R over commercial > > statistical packages. The money-saving in itself is not a reason good > > enough and some people are scared by the lack of professional support, > > though this mailing list is simply wonderful. > > > > (In addition to all the already mentioned answers) > Complete source code availability. Being able to look at the C source > code for a few things has been invaluable for me. > And, of course, and extremely active, responsive, and vibrant > community that, among other things, has contributed packages and code > for an incredible range of problems. > > > Best, > > R. > > P.S. I'd be interested in hearing about the responses you get to your > presentation. > > > > Kind Regards > > > > Lorenzo Isella > > > > __ > > R-help@stat.math.ethz.ch mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > -- > Ramon Diaz-Uriarte > Statistical Computing Team > Structural Biology and Biocomputing Programme > Spanish National Cancer Centre (CNIO) > http://ligarto.org/rdiaz > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > TV dinner still cooling? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Dear Lorenzo and Steven, I'm not a programmer, but I have the experience that R is good for processing large datasets, especially in combination with specialised statistics. There are some limits to that, but R handles large datasets / complicated computation a lot better that SPSS for example. I cannot speak of Fortran, but I have the experience of Pascal. I prefer R, because in Pascal you become easily confused an endless programming effort which has nothing to do with the problem. I do like Pascal, it's the only programming language I actually learned, but it isn't an adequate replacement of R. The experience I have is that the SPSS language, and menu-driven package, is far easier to handle than R, but when it comes to specific computations, SPSS loses it, by far. Non-parametrics is good in R, e.g. Dataset handling is adequate (my SPSS ports can be read), I noticed that R has good numerical routines like optimisation (even mixed integer programming), good procedures for regression (GLM, which is not an SPSS standard). Try to compute a Kendall-W statistic in SPSS. It's relatively easy in R. The only thing that I DON'T like about R is dataset computations and it's syntax. When I have a dataset with only non-parametric content which is also "dirty" (dataset is incomplete / wrong value), I have to call in almost a technician how to do that. To be honest: I use a spreadsheet for these dataset computations, and then export it to R. But I noted in R there are several solutions for that. With SciViews I could get a basic feeling for it. Pascal is basically the only programming language that I syntactically understood. It had a kind of logical mathematical structure to it. The logic of Fortran (and to some extent R): I completely miss it. Statistically: R is my choice, and luckely most procedures in R are easily accessible. And my experience with computations in R are... good. I have done in the past simulations, especially with time-series, but I cannot recommend R for it (arima.sim is not sufficient for these types of simulations). I still would prefer Pascal for it. There is also an excellent open source package for Pascal: Free Pascal, but I hardly use it. I do have some good experiences with computations in C, but little experience. Instead of C I would prefer R, I believe. Cheers, Wilfred __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Dear Lorenzo, I'll try not to repeat what other have answered before. On 4/5/07, Lorenzo Isella <[EMAIL PROTECTED]> wrote: > The institute I work for is organizing an internal workshop for High > Performance Computing (HPC). (...) > (1)Institutions (not only academia) using R You can count my institution too. Several groups. (I can provide more details off-list if you want). > (2)Hardware requirements, possibly benchmarks > (3)R & clusters, R & multiple CPU machines, R performance on different > hardware. We do use R in commodity off-the shelf clusters; our two clusters are running Debian GNU/Linux; both 32-bit machines ---Xeons--- and 64-bit machines ---dual-core AMD Opterons. We use parallelization quite a bit, with MPI (via Rmpi and papply packages mainly). One convenient feature is that (once the lam universe is up and running) whether we are using the 4 cores in a single box, or the max available 120, is completeley transparent. Using R and MPI is, really, a piece of cake. That said, there are things that I miss; in particular, oftentimes I wish R were Erlang or Oz because of the straightforward fault-tolerant distributed computing and the built-in abstractions for distribution and concurrency. The issue of multithreading has come up several times in this list and is something that some people miss. I am not sure how much R is used in the usual HPC realms. It is my understanding that the "traditional HPC" is still dominated by things such as HPF, and C with MPI, OpenMP, or UPC or Cilk. The usual answer to "but R is too slow" is "but you can write Fortran or C code for the bottlenecks and call it from R". I guess you could use, say, UPC in that C that is linked to R, but I have no experience. And I think this code can become a pain to write and maintain (specially if you want to play around with what you try to parallelize, etc). My feeling (based on no information or documentation whatsoever) is that how far R can be stretched or extended into HPC is still an open question. > (4)finally, a list of the advantages for using R over commercial > statistical packages. The money-saving in itself is not a reason good > enough and some people are scared by the lack of professional support, > though this mailing list is simply wonderful. > (In addition to all the already mentioned answers) Complete source code availability. Being able to look at the C source code for a few things has been invaluable for me. And, of course, and extremely active, responsive, and vibrant community that, among other things, has contributed packages and code for an incredible range of problems. Best, R. P.S. I'd be interested in hearing about the responses you get to your presentation. > Kind Regards > > Lorenzo Isella > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Ramon Diaz-Uriarte Statistical Computing Team Structural Biology and Biocomputing Programme Spanish National Cancer Centre (CNIO) http://ligarto.org/rdiaz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Hi Lorenzo, On 4/5/07, Lorenzo Isella <[EMAIL PROTECTED]> wrote: > > I would like to have suggestions about where to collect info about: > (1)Institutions (not only academia) using R A starting point might be to look at the R-project homepage and look at the members and donors list. This is, of course, not a comprehensive list; but at least it can give an overview in which diverse backgrounds people are using R --- even if it is only the tip of the iceberg. (2)Hardware requirements, possibly benchmarks Maybe you should also mention that you can run just from a USB stick if you want (See R for Windows FAQ 2.6). (3)R & clusters, R & multiple CPU machines, R performance on different > hardware. Have a look a the 'R Administration and Installation' manual; it gives a nice overview on how many platforms are is running. Best, Roland [[alternative HTML version deleted]] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
> (1)Institutions (not only academia) using R http://www.r-project.org/useR-2006/participants.html > (2)Hardware requirements, possibly benchmarks Since you mention huge data sets, GNU/Linux running on 64-bit machines with as much RAM as your budget allows. > (3)R & clusters, R & multiple CPU machines, > R performance on different hardware. OpenMosix, Quantian for clusters; the archive for multiple CPUs (this was asked quite a few times). It may be best to measure R performance on different hardware by yourself, using your own data and code. > (4)finally, a list of the advantages for using R over > commercial statistical packages. I'd say it's not R vs. commercial packages, but S vs. the rest of the world. Check http://www.insightful.com/ , much of what they say is applicable to R. Make the case that S is vastly superior directly, not just through a list of reasons: take a few data sets and show how they can be analyzed with S compared to other choices. Both R and S-Plus are likely to significantly outperform most other software, depending on the kind of work that needs to be done. > -Original Message- > From: [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] On Behalf Of Lorenzo Isella > Sent: Thursday, April 05, 2007 11:02 AM > To: r-help@stat.math.ethz.ch > Subject: [R] Reasons to Use R > > Dear All, > The institute I work for is organizing an internal workshop for High > Performance Computing (HPC). > I am planning to attend it and talk a bit about fluid dynamics, but > there is also quite a lot of interest devoted to data post-processing > and management of huge data sets. > A lot of people are interested in image processing/pattern recognition > and statistic applied to geography/ecology, but I would like not to > post this on too many lists. > The final aim of the workshop is understanding hardware requirements > and drafting a list of the equipment we would like to buy. I think > this could be the venue to talk about R as well. > Therefore, even if it is not exactly a typical mailing list question, > I would like to have suggestions about where to collect info about: > (1)Institutions (not only academia) using R > (2)Hardware requirements, possibly benchmarks > (3)R & clusters, R & multiple CPU machines, R performance on > different hardware. > (4)finally, a list of the advantages for using R over commercial > statistical packages. The money-saving in itself is not a reason good > enough and some people are scared by the lack of professional support, > though this mailing list is simply wonderful. > > Kind Regards > > Lorenzo Isella > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Hi Lorenzo, I don't think I'm qualified to provide solid information on the first three questions, but I'd like to drop a few thoughts on (4). While there are no shortage of language advocates out there, I'd like to join in for this once. My background is in chemical engineering and atmospheric science; I've done simulation on a smaller scale but spend much of my time analyzing large sets of experimental data. I am comfortable programming in Matlab, R, Python, C, Fortran, Igor Pro, and I also know a little IDL but have not programmed in it extensively. As you are probably aware, I would count among these, Matlab, R, Python, and IDL as good candidates for processing large data sets, as they are high-level languages and can communicate with netCDF files (which I imagine will be used to transfer data). Each language boasts an impressive array of libraries, but what I think gives R the advantage for analyzing data is the level of abstraction in the language. I am extremely impressed with the objects available to represent data sets, and the functions support them very well - it requires that I carry around a fewer number of objects to hold information about my data (and I don't have to "unpack" them to feed them into functions). The language is also very "expressive" in that it lets you write a procedure in many different ways, some shorter, some more readable, depending on what your situation requires. System commands and text processing are integrated into the language, and the input/output facilities are excellent, in terms of data and graphics. Once I have my data object I am only a few keystrokes to split, sort, and visualize multivariate data; even after several years I keep discovering new functions for basic things like manipulation of data objects and descriptive statistics, and plotting - truly, an analyst's needs have been well anticipated. And this is a recent obsession of mine, which I was introduced to through Python, but the functional programming support for R is amazing. By using higher-order functions like lapply(), I infrequently rely on FOR-LOOPS, which have often caused me trouble in the past because I had forgotten to re-initialize a variable, or incremented the wrong variable, etc. Though I'm definitely not militant about functional programming, in general I try to write functions and then apply them to the data (if the functions don't exist in R already), often through higher-order functions such as lapply(). This approach keeps most variables out of the global namespace and so I am less likely to reassign a value to a variable that I had intended to keep. It also makes my code more modular so that I can re-use bits of my code as my analysis inevitably grows much larger than I had originally intended. Furthermore, my code in R ends up being much, much shorter than code I imagine writing in other languages to accomplish the same task; I believe this leads to fewer places for errors to occur, and the nature of the code is immediately comprehensible (though a series of nested functions can get pretty hard to read at times), not to mention it takes less effort to write. This also makes it easier to interact with the data, I think, because after making a plot I can set up for the next plot with only a few function calls instead of setting out to write a block of code with loops, etc. I have actually recommended R to colleagues who needed to analyze the information from large-scale air quality/ global climate simulations, and they are extremely pleased. I think the capability for statistics and graphics is well-established enough that I don't need to do a hard-sell on that so much, but R's language is something I get very excited about. I do appreciate all the contributors who have made this available. Best regards, ST --- Lorenzo Isella <[EMAIL PROTECTED]> wrote: > Dear All, > The institute I work for is organizing an internal workshop for High > Performance Computing (HPC). > I am planning to attend it and talk a bit about fluid dynamics, but > there is also quite a lot of interest devoted to data post-processing > and management of huge data sets. > A lot of people are interested in image processing/pattern recognition > and statistic applied to geography/ecology, but I would like not to > post this on too many lists. > The final aim of the workshop is understanding hardware requirements > and drafting a list of the equipment we would like to buy. I think > this could be the venue to talk about R as well. > Therefore, even if it is not exactly a typical mailing list question, > I would like to have suggestions about where to collect info about: > (1)Institutions (not only academia) using R > (2)Hardware requirements, possibly benchmarks > (3)R & clusters, R & multiple CPU machines, R performance on different > hardware. > (4)finally, a list of the advantages for using R over commercial > statistical packages. The money-saving in itself is not a reason good > enough and some peop
Re: [R] Reasons to Use R
As to my knowledge the core of R is considered "adequate" and "good" by the statisticians. That's sufficient isn't it? Last year I read some documentation about R and most routines were considered "good", but "some very bad". That is a benchmark somehow. There must be some benchmarks you want. R is widely used and there must be people around who can provide you with the adequate stuff. CRAN is a way to that, or the project page. The core is free by the way and you can participate in the development. People can provide you there with the information you want. R is quite well documented (not everybody thinks it's well doc'ed, but... you know... opinions do vary). There is one simple reason to use R. It's free that's for one. If you have the money commercial software is sufficient. That doesn't mean that R is the poor mans software. It works quite well actually (but you... know... opinions vary, especially about statistical software). I think that's the usual reason to use it: it works quite well, and it's documentation is widely available. A LOT of statistical procedures are available. R crashed about 2 times last year on my computer and that's a better than SPSS, and there are a lot of user interfaces available which make working with R easier. Personally I don't like SPSS, but I do know that the R core is used in commercial applications. So at least one person has done some benchmarks. Wilfred __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
John Kane wrote: > --- Lorenzo Isella <[EMAIL PROTECTED]> wrote: > > >> (4)finally, a list of the advantages for using R >> over commercial >> statistical packages. The money-saving in itself is >> not a reason good >> enough and some people are scared by the lack of >> professional support, >> though this mailing list is simply wonderful. >> >> > Given that I can do as much if not more with R (in > most cases) than with commercial software, as an > independent consultant, 'cost' is a very significant > factor. > > A very major advantage of R is the money-saving. Have > a look at > http://www.spss.com/stores/1/Software_Full_Version_C2.cfm > > and convince me that cost ( for an independent > contractor) is not a good reason. > > __ > Do You Yahoo!? > Tired of spam? Yahoo! Mail has the best spam protection around > http://mail.yahoo.com > > Hello, No doubt that for an independent contractor money is a significant issue, but we are talking about the case of a large organization for which spending a few thousand euros on software is routine. To avoid misunderstandings: I am myself an R user and I have no intention to pay a cent for statistical software, but in order to speak up for R vs any commercial software for data analysis and postprocessing, I need technical details (benchmarks, etc...) rather than the fact that it helps saving money. Kind Regards Lorenzo __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
--- Lorenzo Isella <[EMAIL PROTECTED]> wrote: > > (4)finally, a list of the advantages for using R > over commercial > statistical packages. The money-saving in itself is > not a reason good > enough and some people are scared by the lack of > professional support, > though this mailing list is simply wonderful. > Given that I can do as much if not more with R (in most cases) than with commercial software, as an independent consultant, 'cost' is a very significant factor. A very major advantage of R is the money-saving. Have a look at http://www.spss.com/stores/1/Software_Full_Version_C2.cfm and convince me that cost ( for an independent contractor) is not a good reason. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Reasons to Use R
Lorenzo Isella writes: > (4)finally, a list of the advantages for using R over commercial > statistical packages. Here's my entry on the list, as this was a topic of conversation over lunch: it's better than the proprietary statistical software I use most of the time. By better I mean that the language is consistent, the features are all well-documented and none of it appears to have been rushed out onto the market. The proprietary software that I use most of the time at work seems hurriedly cobbled together and R (nor LaTeX nor Emacs nor Linux nor...) doesn't give me that feeling. > The money-saving in itself is not a reason good > enough Interesting ;) I know what you mean -- it may even make them suspicious. Joel -- Joel J. Adamson Biostatistician Pediatric Psychopharmacology Research Unit Massachusetts General Hospital Boston, MA 02114 (617) 643-1432 (303) 880-3109 The information transmitted in this electronic communication is intended only for the person or entity to whom it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of or taking of any action in reliance upon this information by persons or entities other than the intended recipient is prohibited. If you received this information in error, please contact the Compliance HelpLine at 800-856-1983 and properly dispose of this information. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Reasons to Use R
Dear Mr. Isella, I just started my PhD Thesis. I need to work with R. Good sources are Bioconductor (www.bioconductor.org). It is a DB based on R-programming. Another institute which has good experiences with R is the HKI in Jena, Germany. Perhaps you can contact Mrs. Radke to get more information or speakers for your workshop. Both parties are mainly for bioinformatics methods but perhaps can help you. A good reason to use R is that computations are much quicker and you can import/export from many other programs or languages files. Happy Easter, C.Schmitt ** Corinna Schmitt, Dipl.Inf.(Bioinformatik) Fraunhofer Institut für Grenzflächen- & Bioverfahrenstechnik Nobelstrasse 12, B 3.24 70569 Stuttgart Germany phone: +49 711 9704044 fax: +49 711 9704200 e-mail: [EMAIL PROTECTED] http://www.igb.fraunhofer.de -Ursprüngliche Nachricht- Von: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Im Auftrag von Lorenzo Isella Gesendet: Donnerstag, 5. April 2007 17:02 An: r-help@stat.math.ethz.ch Betreff: [R] Reasons to Use R Dear All, The institute I work for is organizing an internal workshop for High Performance Computing (HPC). I am planning to attend it and talk a bit about fluid dynamics, but there is also quite a lot of interest devoted to data post-processing and management of huge data sets. A lot of people are interested in image processing/pattern recognition and statistic applied to geography/ecology, but I would like not to post this on too many lists. The final aim of the workshop is understanding hardware requirements and drafting a list of the equipment we would like to buy. I think this could be the venue to talk about R as well. Therefore, even if it is not exactly a typical mailing list question, I would like to have suggestions about where to collect info about: (1)Institutions (not only academia) using R (2)Hardware requirements, possibly benchmarks (3)R & clusters, R & multiple CPU machines, R performance on different hardware. (4)finally, a list of the advantages for using R over commercial statistical packages. The money-saving in itself is not a reason good enough and some people are scared by the lack of professional support, though this mailing list is simply wonderful. Kind Regards Lorenzo Isella __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Reasons to Use R
Dear All, The institute I work for is organizing an internal workshop for High Performance Computing (HPC). I am planning to attend it and talk a bit about fluid dynamics, but there is also quite a lot of interest devoted to data post-processing and management of huge data sets. A lot of people are interested in image processing/pattern recognition and statistic applied to geography/ecology, but I would like not to post this on too many lists. The final aim of the workshop is understanding hardware requirements and drafting a list of the equipment we would like to buy. I think this could be the venue to talk about R as well. Therefore, even if it is not exactly a typical mailing list question, I would like to have suggestions about where to collect info about: (1)Institutions (not only academia) using R (2)Hardware requirements, possibly benchmarks (3)R & clusters, R & multiple CPU machines, R performance on different hardware. (4)finally, a list of the advantages for using R over commercial statistical packages. The money-saving in itself is not a reason good enough and some people are scared by the lack of professional support, though this mailing list is simply wonderful. Kind Regards Lorenzo Isella __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.