RE: [R] Statistical analysis of a large database
I thought that maybe authors of books on R should be allowed (encouraged ?) to announce availability/revisions of their books via the R-packages list? For example I'd be very interested to have another look at Dr. Torgo's book when it becomes more complete and I'd appreciate a revision notice via the list. Just a suggestion. Thanks, Vadim -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Luis Torgo Sent: Wednesday, October 13, 2004 12:03 PM To: Prof Brian Ripley Cc: Vito Ricci; [EMAIL PROTECTED] Subject: Re: [R] Statistical analysis of a large database On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote: Luís Torgo, Data Mining with R. Learning by case studies, Maggio 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ Please note that that reference is not about large datasets, nor about `data mining' in the generally used sense. It has two studies, one incomplete, on linear regression (with 200 samples) and on time series. I would like to add a few information on these incomplete comments on the book I'm writing. The book is unfinished as mentioned on its Web page. It has currently two reasonably finished chapters: an introduction to R and MySQL and a case study. As mentioned in the book, the first case study is small by data mining standards (200 observations) and has the goal of illustrating techniques that are shared by data mining and other disciplines as well as smoothly introducing the reader to R and its power. It addresses data pre-processing techniques, data visualization, model construction (yes, linear regression but also regression trees), and model evaluation, selection and combination, so I think it is a bit incorrect to say that it is about linear regression that corresponds to 5 of the 50 pages of that chapter. The third (unfinished) chapter (2nd case study) is about financial trading. It includes topics like connections to data bases as well as many other components of a knowledge discovery process. Among those components it includes model construction that involves obviously time series models given the nature of the data. The chapter will include other steps like issues concerning moving from predictions into actions, creation of variables from the original time series, etc.. It is currently being re-written and I expect to upload soon a new revised version of this chapter. The book will include at least two further cases studies that will be larger. Still, I would note that the financial trading case study is potentially very large, as it is a problem where data is constantly growing. The final version of that chapter addresses this issue of having a system that is online in the sense that it is receiving new data in real time (also known as mining data streams in the data mining field). I'm sorry for being so long, but I think it is dangerous to try to resume around 200 pages of an unfinished work in two lines of text. Still, all comments on this on going project are very well welcome and I would like to take this opportunity to thank all people that have been sending me encouraging comments/emails. Luis Torgo -- Luis Torgo FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 Machine Learning Group Fax : (+351) 22 600 36 54 R. Campo Alegre, 823 email : [EMAIL PROTECTED] 4150 PORTO - PORTUGAL WWW : http://www.liacc.up.pt/~ltorgo __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Statistical analysis of a large database
On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote: Lus Torgo, Data Mining with R. Learning by case studies, Maggio 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ Please note that that reference is not about large datasets, nor about `data mining' in the generally used sense. It has two studies, one incomplete, on linear regression (with 200 samples) and on time series. I would like to add a few information on these incomplete comments on the book I'm writing. The book is unfinished as mentioned on its Web page. It has currently two reasonably finished chapters: an introduction to R and MySQL and a case study. As mentioned in the book, the first case study is small by data mining standards (200 observations) and has the goal of illustrating techniques that are shared by data mining and other disciplines as well as smoothly introducing the reader to R and its power. It addresses data pre-processing techniques, data visualization, model construction (yes, linear regression but also regression trees), and model evaluation, selection and combination, so I think it is a bit incorrect to say that it is about linear regression that corresponds to 5 of the 50 pages of that chapter. The third (unfinished) chapter (2nd case study) is about financial trading. It includes topics like connections to data bases as well as many other components of a knowledge discovery process. Among those components it includes model construction that involves obviously time series models given the nature of the data. The chapter will include other steps like issues concerning moving from predictions into actions, creation of variables from the original time series, etc.. It is currently being re-written and I expect to upload soon a new revised version of this chapter. The book will include at least two further cases studies that will be larger. Still, I would note that the financial trading case study is potentially very large, as it is a problem where data is constantly growing. The final version of that chapter addresses this issue of having a system that is online in the sense that it is receiving new data in real time (also known as mining data streams in the data mining field). I'm sorry for being so long, but I think it is dangerous to try to resume around 200 pages of an unfinished work in two lines of text. Still, all comments on this on going project are very well welcome and I would like to take this opportunity to thank all people that have been sending me encouraging comments/emails. Luis Torgo -- Luis Torgo FEP/LIACC, University of Porto Phone : (+351) 22 607 88 30 Machine Learning Group Fax : (+351) 22 600 36 54 R. Campo Alegre, 823 email : [EMAIL PROTECTED] 4150 PORTO - PORTUGAL WWW : http://www.liacc.up.pt/~ltorgo __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical analysis of a large database
Deall all, We need to perform a statistical analysis of a large database (40,000 entries with approximately 500 fields in each entry) currently handled in Oracle. The data contains categorical variables only. At the current stage we suggest classification and clustering analysis. We are planning to perform the analysis in R and would be very grateful for any recommendations/suggestions/references regarding the packages/tools appropriate for this task. Thank you in advance for your attention, Vicky Landsman. [[alternative HTML version deleted]] __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Statistical analysis of a large database
Hi, for your analysis use the package: ROracle Oracle database interface for R http://microarrays.unife.it/CRAN/src/contrib/Descriptions/ROracle.html see also: Diego Kuonen, Introduction au data mining avec R : vers la reconquête du `knowledge discovery in databases' par les statisticiens. Bulletin of the Swiss Statistical Society, 40:3-7, 2001. http://www.statoo.com/en/publications/2001.R.SSS.40/ Diego Kuonen and Reinhard Furrer, Data mining avec R dans un monde libre. Flash Informatique Spécial Été, pages 45-50, sep 2001. http://sawww.epfl.ch/SIC/SA/publications/FI01/fi-sp-1/sp-1-page45.html R Development Core Team, R Data Import/Export, versione 1.9.0, aprile 2004, pagg. 11-18 http://cran.r-project.org/doc/manuals/R-data.pdf Brian D. Ripley, Datamining: Large Databases and Methods, in Proceedings of useR! 2004 - The R User Conference, maggio 2004 http://www.ci.tuwien.ac.at/Conferences/useR-2004/Keynotes/Ripley.pdf Brian D. Ripley, Using Databases with R, R News, Gennaio 2001, pagg. 18-20 http://cran.r-project.org/doc/Rnews/Rnews_2001-1.pdf B. D. Ripley, R. M. Ripley, Applications of R Clients and Servers in Proceedings of the Distributed Statistical Computing 2001 Workshop, 2001, Vienna University of Technology. http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/Ripley.pdf Torsten Hothorn, David A. James, Brian D. Ripley, R/S Interfaces to Databases in Proceedings of the Distributed Statistical Computing 2001 Workshop, 2001,Vienna University of Technology. http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/HothornJamesRipley.pdf Luís Torgo, Data Mining with R. Learning by case studies, Maggio 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ I hope I give you a little help. Best Vito You wrote: Deall all, We need to perform a statistical analysis of a large database (40,000 entries with approximately 500 fields in each entry) currently handled in Oracle. The data contains categorical variables only. At the current stage we suggest classification and clustering analysis. We are planning to perform the analysis in R and would be very grateful for any recommendations/suggestions/references regarding the packages/tools appropriate for this task. Thank you in advance for your attention, Vicky Landsman = Diventare costruttori di soluzioni The business of the statistician is to catalyze the scientific learning process. George E. P. Box Visitate il portale http://www.modugno.it/ e in particolare la sezione su Palese http://www.modugno.it/archivio/cat_palese.shtml __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Statistical analysis of a large database
I believe this is a reply to a posting -- since this is by no means the first time this has happened, please - use the Reply function of your mailer, or at least use Re: in the subject line and include the relevant part of the original posting, and - send the reply to the questioner, as well as possibly to the list. On Tue, 12 Oct 2004, Vito Ricci wrote: [...] Luís Torgo, Data Mining with R. Learning by case studies, Maggio 2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/ Please note that that reference is not about large datasets, nor about `data mining' in the generally used sense. It has two studies, one incomplete, on linear regression (with 200 samples) and on time series. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ [EMAIL PROTECTED] mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html