[R] SQL-select using native R methods ?

2010-03-22 Thread Jan . Sunde
Hi
I have a problem in R that I have been trying to solve but without 
success.
I am trying to join two tables on two variables : an ID and a date 
(optional) that will be common between the two tables

In SQL (and SAS PROC SQL) I am a frequent user of the select command and 
I am used to the following nomenclature : 

select a.*, b.c, b.y, b.z from table1 a, table2 b where a.date=b.date and 
a.id=b.id

I tried this in R (using sqldf) but it takes wy too long to get the 
result. My data sets are 1 gb each.
Is there any way this can be done by merge() or any other more R-like 
way ? 
I have tried making a compound variable with paste(id, as.character(date), 
sep=) and merge() on that but since the date more often than not is NA
this doesnt seem to work. I'm a bit stuck and the documentation is of 
little help to me in this case. 
If anybody would be kind to point me in the right direction I would be 
grateful.

Jan


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SQL-select using native R methods ?

2010-03-22 Thread Gabor Grothendieck
On Mon, Mar 22, 2010 at 6:37 AM,  jan.su...@biokapital.no wrote:
 Hi
 I have a problem in R that I have been trying to solve but without
 success.
 I am trying to join two tables on two variables : an ID and a date
 (optional) that will be common between the two tables

 In SQL (and SAS PROC SQL) I am a frequent user of the select command and
 I am used to the following nomenclature :

 select a.*, b.c, b.y, b.z from table1 a, table2 b where a.date=b.date and
 a.id=b.id

 I tried this in R (using sqldf) but it takes wy too long to get the
 result. My data sets are 1 gb each.

Try adding indexes to the join columns.  I was able to do a join
between two 1 million row tables in 13 seconds (under a minute
including the time to add the indexes) on a laptop (not a particularly
fast machine).

 set.seed(1)
 n - 100
 DF1 - data.frame(a = sample(n, n), b1 = runif(n))
 DF2 - data.frame(a = sample(n, n), b2 = runif(n))
 library(sqldf)

 sqldf()
SQLiteConnection:(2708,1)
 system.time(sqldf(create index ai1 on DF1(a)))
   user  system elapsed
  16.630.09   16.85
 system.time(sqldf(create index ai2 on DF2(a)))
   user  system elapsed
  16.060.04   16.30
 system.time(sqldf(select * from main.DF1 natural join main.DF2))
   user  system elapsed
  12.720.21   12.93
 sqldf()


 Is there any way this can be done by merge() or any other more R-like
 way ?
 I have tried making a compound variable with paste(id, as.character(date),
 sep=) and merge() on that but since the date more often than not is NA
 this doesnt seem to work. I'm a bit stuck and the documentation is of
 little help to me in this case.
 If anybody would be kind to point me in the right direction I would be
 grateful.

 Jan


        [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.