RE: [R] Statistical analysis of a large database

2004-10-14 Thread Vadim Ogranovich
I thought that maybe authors of books on R should be allowed (encouraged ?) to 
announce availability/revisions of their books via the R-packages list?
For example I'd be very interested to have another look at Dr. Torgo's book when it 
becomes more complete and I'd appreciate a revision notice via the list.

Just a suggestion. Thanks, Vadim


 -Original Message-
 From: [EMAIL PROTECTED] 
 [mailto:[EMAIL PROTECTED] On Behalf Of Luis Torgo
 Sent: Wednesday, October 13, 2004 12:03 PM
 To: Prof Brian Ripley
 Cc: Vito Ricci; [EMAIL PROTECTED]
 Subject: Re: [R] Statistical analysis of a large database
 
 On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote:
   Luís Torgo, Data Mining with R. Learning by case studies, Maggio 
   2003 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/
  
  Please note that that reference is not about large 
 datasets, nor about 
  `data mining' in the generally used sense.  It has two studies, one 
  incomplete, on linear regression (with 200 samples) and on 
 time series.
 
 I would like to add a few information on these incomplete 
 comments on the book I'm writing. The book is unfinished as 
 mentioned on its Web page. It has currently two reasonably 
 finished chapters: an introduction to R and MySQL and a case 
 study. As mentioned in the book, the first case study is 
 small by data mining standards (200 observations) and has the 
 goal of illustrating techniques that are shared by data 
 mining and other disciplines as well as smoothly introducing 
 the reader to R and its power. It addresses data 
 pre-processing techniques, data visualization, model 
 construction (yes, linear regression but also regression 
 trees), and model evaluation, selection and combination, so I 
 think it is a bit incorrect to say that it is about linear 
 regression that corresponds to 5 of the 50 pages of that chapter.
  
 The third (unfinished) chapter (2nd case study) is about 
 financial trading. It includes topics like connections to 
 data bases as well as many other components of a knowledge 
 discovery process. Among those components it includes model 
 construction that involves obviously time series models given 
 the nature of the data. The chapter will include other steps 
 like issues concerning moving from predictions into actions, 
 creation of variables from the original time series, etc.. It 
 is currently being re-written and I expect to upload soon a 
 new revised version of this chapter.
 
 The book will include at least two further cases studies that 
 will be larger. Still, I would note that the financial 
 trading case study is potentially very large, as it is a 
 problem where data is constantly growing. The final version 
 of that chapter addresses this issue of having a system that 
 is online in the sense that it is receiving new data in real 
 time (also known as mining data streams in the data mining field).
 
 I'm sorry for being so long, but I think it is dangerous to 
 try to resume around 200 pages of an unfinished work in two 
 lines of text.
 
 Still, all comments on this on going project are very well 
 welcome and I would like to take this opportunity to thank 
 all people that have been sending me encouraging comments/emails.
 
 Luis Torgo
 
 --
 Luis Torgo
   FEP/LIACC, University of Porto   Phone : (+351) 22 607 88 30
   Machine Learning Group   Fax   : (+351) 22 600 36 54
   R. Campo Alegre, 823 email : [EMAIL PROTECTED]
   4150 PORTO   -  PORTUGAL WWW   : 
 http://www.liacc.up.pt/~ltorgo
 
 __
 [EMAIL PROTECTED] mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html


__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Statistical analysis of a large database

2004-10-13 Thread Luis Torgo
On Tue, 2004-10-12 at 08:36, Prof Brian Ripley wrote:
  Lus Torgo, Data Mining with R. Learning by case
  studies, Maggio 2003
  http://www.liacc.up.pt/~ltorgo/DataMiningWithR/
 
 Please note that that reference is not about large datasets, nor about 
 `data mining' in the generally used sense.  It has two studies, one 
 incomplete, on linear regression (with 200 samples) and on time series.

I would like to add a few information on these incomplete comments on
the book I'm writing. The book is unfinished as mentioned on its Web
page. It has currently two reasonably finished chapters: an introduction
to R and MySQL and a case study. As mentioned in the book, the first
case study is small by data mining standards (200 observations) and has
the goal of illustrating techniques that are shared by data mining and
other disciplines as well as smoothly introducing the reader to R and
its power. It addresses data pre-processing techniques, data
visualization, model construction (yes, linear regression but also
regression trees), and model evaluation, selection and combination, so I
think it is a bit incorrect to say that it is about linear regression
that corresponds to 5 of the 50 pages of that chapter.
 
The third (unfinished) chapter (2nd case study) is about financial
trading. It includes topics like connections to data bases as well as
many other components of a knowledge discovery process. Among those
components it includes model construction that involves obviously time
series models given the nature of the data. The chapter will include
other steps like issues concerning moving from predictions into actions,
creation of variables from the original time series, etc.. It is
currently being re-written and I expect to upload soon a new revised
version of this chapter.

The book will include at least two further cases studies that will be
larger. Still, I would note that the financial trading case study is
potentially very large, as it is a problem where data is constantly
growing. The final version of that chapter addresses this issue of
having a system that is online in the sense that it is receiving new
data in real time (also known as mining data streams in the data mining
field).

I'm sorry for being so long, but I think it is dangerous to try to
resume around 200 pages of an unfinished work in two lines of text.

Still, all comments on this on going project are very well welcome and I
would like to take this opportunity to thank all people that have been
sending me encouraging comments/emails.

Luis Torgo

-- 
Luis Torgo
  FEP/LIACC, University of Porto   Phone : (+351) 22 607 88 30
  Machine Learning Group   Fax   : (+351) 22 600 36 54
  R. Campo Alegre, 823 email : [EMAIL PROTECTED]
  4150 PORTO   -  PORTUGAL WWW   : http://www.liacc.up.pt/~ltorgo

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Statistical analysis of a large database

2004-10-12 Thread Victoria Landsman
Deall all, 
We need to perform a statistical analysis of a large database (40,000 entries with 
approximately 500 fields in each entry) currently handled in Oracle. The data contains 
categorical variables only. 
At the current stage we suggest classification and clustering analysis. 
We are planning to perform the analysis in R  and would be very grateful for any 
recommendations/suggestions/references regarding the packages/tools appropriate for 
this task. 
Thank you in advance for your attention, 
Vicky Landsman.  
[[alternative HTML version deleted]]

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


[R] Statistical analysis of a large database

2004-10-12 Thread Vito Ricci
Hi,

for your analysis use the package:

ROracle Oracle database interface for R

http://microarrays.unife.it/CRAN/src/contrib/Descriptions/ROracle.html

see also:

Diego Kuonen, Introduction au data mining avec R :
vers la reconquête du `knowledge discovery in
databases' par les statisticiens. Bulletin of the
Swiss Statistical Society, 40:3-7, 2001.
http://www.statoo.com/en/publications/2001.R.SSS.40/

Diego Kuonen and Reinhard Furrer, Data mining avec R
dans un monde libre. Flash Informatique Spécial Été,
pages 45-50, sep 2001.
http://sawww.epfl.ch/SIC/SA/publications/FI01/fi-sp-1/sp-1-page45.html


R Development Core Team, R Data Import/Export,
versione 1.9.0, aprile 2004, pagg. 11-18
http://cran.r-project.org/doc/manuals/R-data.pdf

Brian D. Ripley, Datamining: Large Databases and
Methods, in Proceedings  of “useR! 2004 - The R User
Conference”, maggio 2004
http://www.ci.tuwien.ac.at/Conferences/useR-2004/Keynotes/Ripley.pdf

Brian D. Ripley, Using Databases with R, R News,
Gennaio 2001, pagg. 18-20
http://cran.r-project.org/doc/Rnews/Rnews_2001-1.pdf

B. D. Ripley, R. M. Ripley,  Applications of R Clients
and Servers in Proceedings of the Distributed
Statistical Computing 2001 Workshop, 2001, Vienna
University of Technology.
http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/Ripley.pdf


Torsten Hothorn, David A. James, Brian D. Ripley,  R/S
Interfaces to Databases  in Proceedings of the
Distributed Statistical Computing 2001 Workshop,
2001,Vienna University of Technology.
http://www.ci.tuwien.ac.at/Conferences/DSC-2001/Proceedings/HothornJamesRipley.pdf

Luís Torgo, Data Mining with R. Learning by case
studies, Maggio 2003
http://www.liacc.up.pt/~ltorgo/DataMiningWithR/

I hope I give you a little help.
Best
Vito




You wrote:

Deall all, 
We need to perform a statistical analysis of a large
database (40,000 entries with approximately 500 fields
in each entry) currently handled in Oracle. The data
contains categorical variables only. 
At the current stage we suggest classification and
clustering analysis. 
We are planning to perform the analysis in R  and
would be very grateful for any
recommendations/suggestions/references regarding the
packages/tools appropriate for this task. 
Thank you in advance for your attention, 
Vicky Landsman


=
Diventare costruttori di soluzioni

The business of the statistician is to catalyze 
the scientific learning process.  
George E. P. Box


Visitate il portale http://www.modugno.it/
e in particolare la sezione su Palese http://www.modugno.it/archivio/cat_palese.shtml

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Statistical analysis of a large database

2004-10-12 Thread Prof Brian Ripley
I believe this is a reply to a posting -- since this is by no means the 
first time this has happened, please

- use the Reply function of your mailer, or at least use Re: in the 
subject line and include the relevant part of the original posting, and
- send the reply to the questioner, as well as possibly to the list.

On Tue, 12 Oct 2004, Vito Ricci wrote:

[...]

 Luís Torgo, Data Mining with R. Learning by case
 studies, Maggio 2003
 http://www.liacc.up.pt/~ltorgo/DataMiningWithR/

Please note that that reference is not about large datasets, nor about 
`data mining' in the generally used sense.  It has two studies, one 
incomplete, on linear regression (with 200 samples) and on time series.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
[EMAIL PROTECTED] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html