[R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
Hi,

I was wondering if there is a direct approach for lining up 2-column
matrices according to the values of the first column.  An example and a
brute-force approach is given below:

x - cbind(1:10, runif(10))
y - cbind(5:14, runif(10))
z - cbind((-4):5, runif(10))

xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1)
w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) 

w[ xx = x[1,1]  xx = x[10,1], 2 ] - x[,2]
w[ xx = y[1,1]  xx = y[10,1], 3 ] - y[,2]
w[ xx = z[1,1]  xx = z[10,1], 4 ] - z[,2]

w 

I appreciate any pointers.

Thanks.
 
Christos Hatzis, Ph.D.
Nuvera Biosciences, Inc.
400 West Cummings Park
Suite 5350
Woburn, MA 01801
Tel: 781-938-3830
www.nuverabio.com

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
On Thu, 2007-02-01 at 15:05 -0500, Christos Hatzis wrote:
 Hi,
 
 I was wondering if there is a direct approach for lining up 2-column
 matrices according to the values of the first column.  An example and a
 brute-force approach is given below:
 
 x - cbind(1:10, runif(10))
 y - cbind(5:14, runif(10))
 z - cbind((-4):5, runif(10))
 
 xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1)
 w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) 
 
 w[ xx = x[1,1]  xx = x[10,1], 2 ] - x[,2]
 w[ xx = y[1,1]  xx = y[10,1], 3 ] - y[,2]
 w[ xx = z[1,1]  xx = z[10,1], 4 ] - z[,2]
 
 w 
 
 I appreciate any pointers.
 
 Thanks.

How about this:

x - cbind(1:10, runif(10))
y - cbind(5:14, runif(10))
z - cbind((-4):5, runif(10))

colnames(x) - c(X, Y)
colnames(y) - c(X, Y)
colnames(z) - c(X, Y)

xy - merge(x, y, by = X, all = TRUE)
xyz - merge(xy, z, by = X, all = TRUE)

xyz[is.na(xyz)] - 0

 xyz
X   Y.x   Y.y Y
1  -4 0.000 0.000 0.3969099
2  -3 0.000 0.000 0.8943127
3  -2 0.000 0.000 0.4882819
4  -1 0.000 0.000 0.0275787
5   0 0.000 0.000 0.7562341
6   1 0.6873130 0.000 0.6185218
7   2 0.1930880 0.000 0.2318025
8   3 0.1164783 0.000 0.7336057
9   4 0.7408532 0.000 0.3006347
10  5 0.7112887 0.6383823 0.8515126
11  6 0.2719079 0.5952721 0.000
12  7 0.2067017 0.8178048 0.000
13  8 0.2085043 0.5714917 0.000
14  9 0.2251435 0.4032660 0.000
15 10 0.3471888 0.5247478 0.000
16 11 0.000 0.6899197 0.000
17 12 0.000 0.7188912 0.000
18 13 0.000 0.9133252 0.000
19 14 0.000 0.9186001 0.000

Note that 'xyz' will be a data frame, so just use as.matrix(xyz) to
coerce back to a numeric matrix if needed.

See ?merge

HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
Thanks Marc and Phil.

My dataset actually consists of 50+ individual files, so I will have to do
this one column at a time in a loop...
I might look into SQL and outer joints as an alternative to avoid looping.

Thanks again.
-Christos 

-Original Message-
From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 3:29 PM
To: [EMAIL PROTECTED]
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 15:05 -0500, Christos Hatzis wrote:
 Hi,
 
 I was wondering if there is a direct approach for lining up 2-column 
 matrices according to the values of the first column.  An example and 
 a brute-force approach is given below:
 
 x - cbind(1:10, runif(10))
 y - cbind(5:14, runif(10))
 z - cbind((-4):5, runif(10))
 
 xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w 
 - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3))
 
 w[ xx = x[1,1]  xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1]  xx = 
 y[10,1], 3 ] - y[,2] w[ xx = z[1,1]  xx = z[10,1], 4 ] - z[,2]
 
 w
 
 I appreciate any pointers.
 
 Thanks.

How about this:

x - cbind(1:10, runif(10))
y - cbind(5:14, runif(10))
z - cbind((-4):5, runif(10))

colnames(x) - c(X, Y)
colnames(y) - c(X, Y)
colnames(z) - c(X, Y)

xy - merge(x, y, by = X, all = TRUE)
xyz - merge(xy, z, by = X, all = TRUE)

xyz[is.na(xyz)] - 0

 xyz
X   Y.x   Y.y Y
1  -4 0.000 0.000 0.3969099
2  -3 0.000 0.000 0.8943127
3  -2 0.000 0.000 0.4882819
4  -1 0.000 0.000 0.0275787
5   0 0.000 0.000 0.7562341
6   1 0.6873130 0.000 0.6185218
7   2 0.1930880 0.000 0.2318025
8   3 0.1164783 0.000 0.7336057
9   4 0.7408532 0.000 0.3006347
10  5 0.7112887 0.6383823 0.8515126
11  6 0.2719079 0.5952721 0.000
12  7 0.2067017 0.8178048 0.000
13  8 0.2085043 0.5714917 0.000
14  9 0.2251435 0.4032660 0.000
15 10 0.3471888 0.5247478 0.000
16 11 0.000 0.6899197 0.000
17 12 0.000 0.7188912 0.000
18 13 0.000 0.9133252 0.000
19 14 0.000 0.9186001 0.000

Note that 'xyz' will be a data frame, so just use as.matrix(xyz) to coerce
back to a numeric matrix if needed.

See ?merge

HTH,

Marc Schwartz

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
 Thanks Marc and Phil.
 
 My dataset actually consists of 50+ individual files, so I will have to do
 this one column at a time in a loop...
 I might look into SQL and outer joints as an alternative to avoid looping.
 
 Thanks again.
 -Christos 

If the files conform to some naming convention and/or are all located in
a common sub-directory, you can use list.files() to get the file names
into a vector.  If not, you could use file.choose() interactively.

Then use either a for() loop or sapply() to loop over the filenames,
read them in to data frames using read.table() and merge them together
in the same loop.

When it comes to basic data manipulation like this, loops are not a bad
thing. The overhead of a loop is typically outweighed by the file I/O
and related considerations.

HTH,

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
Christos,

Haccording to the Value section in ?merge:

A data frame. The rows are by default lexicographically sorted on the
common columns, but for sort=FALSE are in an unspecified order.


Looking at the code, while there is a lot of time spent on matching
things, the key sort() code seems to be near the end of the function:

  if (sort) 
res - res[if (all.x || all.y) 
do.call(order, x[, 1:l.b, drop = FALSE])
else sort.list(bx[m$xi]), , drop = FALSE]

I wonder if you could create a local version of merge(), say my.merge(),
without that code and without breaking things. A quick glance suggests
that as long as you are not merging on the rownames, I think that you
might be OK. You would want to test that hypothesis however.

HTH,

Marc

On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote:
 [Sorry I meant to reply to the list]
 
 Thanks, Marc.
 
 That's what I have done.
 However, there seems to be a penalty from using merge repeatedly as it
 appears to internally re-sort the datasets.  In my case the datasets are
 long (~35K rows) and already sorted so this step adds considerable and
 unnecessary overhead.  There doesn't seem to be an option for disabling
 sorting. Setting 'sort=F' only affects sorting of the final data.frame.
 
  system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
  by=V1, all=T, sort=T))
 [1] 6.96 0.00 7.24   NA   NA
  system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
  by=V1, all=T, sort=F))
 [1] 6.82 0.00 7.14   NA   NA
  
 
 I was wondering if perhaps there is a parallel between this problem and
 methods for linining up time-series data, since such data are also usually
 sorted on the time dimension. 
 
 -Christos  
 
 -Original Message-
 From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, February 01, 2007 4:21 PM
 To: [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Lining up x-y datasets based on values of x
 
 On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
  Thanks Marc and Phil.
  
  My dataset actually consists of 50+ individual files, so I will have 
  to do this one column at a time in a loop...
  I might look into SQL and outer joints as an alternative to avoid looping.
  
  Thanks again.
  -Christos
 
 If the files conform to some naming convention and/or are all located in a
 common sub-directory, you can use list.files() to get the file names into a
 vector.  If not, you could use file.choose() interactively.
 
 Then use either a for() loop or sapply() to loop over the filenames, read
 them in to data frames using read.table() and merge them together in the
 same loop.
 
 When it comes to basic data manipulation like this, loops are not a bad
 thing. The overhead of a loop is typically outweighed by the file I/O and
 related considerations.
 
 HTH,
 
 Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
[Sorry I meant to reply to the list]

Thanks, Marc.

That's what I have done.
However, there seems to be a penalty from using merge repeatedly as it
appears to internally re-sort the datasets.  In my case the datasets are
long (~35K rows) and already sorted so this step adds considerable and
unnecessary overhead.  There doesn't seem to be an option for disabling
sorting. Setting 'sort=F' only affects sorting of the final data.frame.

 system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
 by=V1, all=T, sort=T))
[1] 6.96 0.00 7.24   NA   NA
 system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], 
 by=V1, all=T, sort=F))
[1] 6.82 0.00 7.14   NA   NA
 

I was wondering if perhaps there is a parallel between this problem and
methods for linining up time-series data, since such data are also usually
sorted on the time dimension. 

-Christos  

-Original Message-
From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 4:21 PM
To: [EMAIL PROTECTED]
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
 Thanks Marc and Phil.
 
 My dataset actually consists of 50+ individual files, so I will have 
 to do this one column at a time in a loop...
 I might look into SQL and outer joints as an alternative to avoid looping.
 
 Thanks again.
 -Christos

If the files conform to some naming convention and/or are all located in a
common sub-directory, you can use list.files() to get the file names into a
vector.  If not, you could use file.choose() interactively.

Then use either a for() loop or sapply() to loop over the filenames, read
them in to data frames using read.table() and merge them together in the
same loop.

When it comes to basic data manipulation like this, loops are not a bad
thing. The overhead of a loop is typically outweighed by the file I/O and
related considerations.

HTH,

Marc

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Prof Brian Ripley
On Thu, 1 Feb 2007, Marc Schwartz wrote:

 Christos,

 Haccording to the Value section in ?merge:

 A data frame. The rows are by default lexicographically sorted on the
 common columns, but for sort=FALSE are in an unspecified order.

There is also a sort in the .Internal code.  But I am not buying 
that this is a major part of the time without detailed evidence from 
profiling.  Sorting 35k numbers should take a few milliseconds, and 
less if they are already sorted.

 x - rnorm(35000)
 system.time(y - sort(x, method=quick))
[1] 0.003 0.001 0.004 0.000 0.000
 system.time(sort(y, method=quick))
[1] 0.002 0.000 0.001 0.000 0.000



 Looking at the code, while there is a lot of time spent on matching
 things, the key sort() code seems to be near the end of the function:

  if (sort)
res - res[if (all.x || all.y)
do.call(order, x[, 1:l.b, drop = FALSE])
else sort.list(bx[m$xi]), , drop = FALSE]

 I wonder if you could create a local version of merge(), say my.merge(),
 without that code and without breaking things. A quick glance suggests
 that as long as you are not merging on the rownames, I think that you
 might be OK. You would want to test that hypothesis however.

 HTH,

 Marc

 On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote:
 [Sorry I meant to reply to the list]

 Thanks, Marc.

 That's what I have done.
 However, there seems to be a penalty from using merge repeatedly as it
 appears to internally re-sort the datasets.  In my case the datasets are
 long (~35K rows) and already sorted so this step adds considerable and
 unnecessary overhead.  There doesn't seem to be an option for disabling
 sorting. Setting 'sort=F' only affects sorting of the final data.frame.

 system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
 by=V1, all=T, sort=T))
 [1] 6.96 0.00 7.24   NA   NA
 system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]],
 by=V1, all=T, sort=F))
 [1] 6.82 0.00 7.14   NA   NA


 I was wondering if perhaps there is a parallel between this problem and
 methods for linining up time-series data, since such data are also usually
 sorted on the time dimension.

 -Christos

 -Original Message-
 From: Marc Schwartz [mailto:[EMAIL PROTECTED]
 Sent: Thursday, February 01, 2007 4:21 PM
 To: [EMAIL PROTECTED]
 Cc: r-help@stat.math.ethz.ch
 Subject: Re: [R] Lining up x-y datasets based on values of x

 On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote:
 Thanks Marc and Phil.

 My dataset actually consists of 50+ individual files, so I will have
 to do this one column at a time in a loop...
 I might look into SQL and outer joints as an alternative to avoid looping.

 Thanks again.
 -Christos

 If the files conform to some naming convention and/or are all located in a
 common sub-directory, you can use list.files() to get the file names into a
 vector.  If not, you could use file.choose() interactively.

 Then use either a for() loop or sapply() to loop over the filenames, read
 them in to data frames using read.table() and merge them together in the
 same loop.

 When it comes to basic data manipulation like this, loops are not a bad
 thing. The overhead of a loop is typically outweighed by the file I/O and
 related considerations.

 HTH,

 Marc

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Gabor Grothendieck
The zoo package has a multiway merge with optional zero fill.
Here are two ways:

library(zoo)
merge(x = zoo(x[,2], x[,1]),
  y = zoo(y[,2], y[,1]),
  z = zoo(z[,2], z[,1]),
  fill = 0)

# or

library(zoo)
X - list(x = x, y = y, z = z)
merge0 - function(..., fill = 0) merge(..., fill = fill)
do.call(merge0, lapply(X, function(x) zoo(x[,2], x[,1])))

To get more info on zoo try:

vignette(zoo)

On 2/1/07, Christos Hatzis [EMAIL PROTECTED] wrote:
 Hi,

 I was wondering if there is a direct approach for lining up 2-column
 matrices according to the values of the first column.  An example and a
 brute-force approach is given below:

 x - cbind(1:10, runif(10))
 y - cbind(5:14, runif(10))
 z - cbind((-4):5, runif(10))

 xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1)
 w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3))

 w[ xx = x[1,1]  xx = x[10,1], 2 ] - x[,2]
 w[ xx = y[1,1]  xx = y[10,1], 3 ] - y[,2]
 w[ xx = z[1,1]  xx = z[10,1], 4 ] - z[,2]

 w

 I appreciate any pointers.

 Thanks.

 Christos Hatzis, Ph.D.
 Nuvera Biosciences, Inc.
 400 West Cummings Park
 Suite 5350
 Woburn, MA 01801
 Tel: 781-938-3830
 www.nuverabio.com

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
On Thu, 2007-02-01 at 23:34 +, Prof Brian Ripley wrote:
 On Thu, 1 Feb 2007, Marc Schwartz wrote:
 
  Christos,
 
  Haccording to the Value section in ?merge:
 
  A data frame. The rows are by default lexicographically sorted on the
  common columns, but for sort=FALSE are in an unspecified order.
 
 There is also a sort in the .Internal code.  But I am not buying 
 that this is a major part of the time without detailed evidence from 
 profiling.  Sorting 35k numbers should take a few milliseconds, and 
 less if they are already sorted.
 
  x - rnorm(35000)
  system.time(y - sort(x, method=quick))
 [1] 0.003 0.001 0.004 0.000 0.000
  system.time(sort(y, method=quick))
 [1] 0.002 0.000 0.001 0.000 0.000

Having had a chance to mock up some examples, I would have to agree with
Prof. Ripley on this point.

Presuming that we are not missing something about the nature of
Christos' data sets, here are 4 examples, with rows sorted in ascending
order, descending order, reversed sort order and random order. In
theory, the descending order example should, I believe, represent a
worst cast scenario, since reverse sorting a sorted list is typically
slowest. However, note that there is not much time variation below and
running each of the examples several times resulted in material
differences across runs.


1. Ascending order

DF.X - data.frame(X = 1:35000, Y = runif(35000))
DF.Y - data.frame(X = 1:35000, Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 0.249 0.004 0.264 0.000 0.000


2. Descending order

DF.X - data.frame(X = 35000:1, Y = runif(35000))
DF.Y - data.frame(X = 35000:1, Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 0.300 0.007 0.309 0.000 0.000


3. Reversed sort order

DF.X - data.frame(X = 35000:1, Y = runif(35000))
DF.Y - data.frame(X = 1:35000, Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 0.236 0.008 0.245 0.000 0.000


4. Random order

DF.X - data.frame(X = sample(35000), Y = runif(35000))
DF.Y - data.frame(X = sample(35000), Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 0.339 0.016 0.357 0.000 0.000



Spending some time looking at profiling the descending order example, we
get:

 summaryRprof()
$by.self
 self.time self.pct total.time total.pct
duplicated.default  0.16 38.1   0.16  38.1
match   0.08 19.0   0.08  19.0
sort.list   0.08 19.0   0.08  19.0
[.data.frame0.04  9.5   0.24  57.1
merge.data.frame0.02  4.8   0.42 100.0
names.default   0.02  4.8   0.02   4.8
seq_len 0.02  4.8   0.02   4.8
merge   0.00  0.0   0.42 100.0
[   0.00  0.0   0.24  57.1
any 0.00  0.0   0.18  42.9
duplicated  0.00  0.0   0.18  42.9
cbind   0.00  0.0   0.04   9.5
data.frame  0.00  0.0   0.04   9.5
data.row.names  0.00  0.0   0.02   4.8
names   0.00  0.0   0.02   4.8
row.names- 0.00  0.0   0.02   4.8
row.names-.data.frame  0.00  0.0   0.02   4.8

$by.total
 total.time total.pct self.time self.pct
merge.data.frame 0.42 100.0  0.02  4.8
merge0.42 100.0  0.00  0.0
[.data.frame 0.24  57.1  0.04  9.5
[0.24  57.1  0.00  0.0
any  0.18  42.9  0.00  0.0
duplicated   0.18  42.9  0.00  0.0
duplicated.default   0.16  38.1  0.16 38.1
match0.08  19.0  0.08 19.0
sort.list0.08  19.0  0.08 19.0
cbind0.04   9.5  0.00  0.0
data.frame   0.04   9.5  0.00  0.0
names.default0.02   4.8  0.02  4.8
seq_len  0.02   4.8  0.02  4.8
data.row.names   0.02   4.8  0.00  0.0
names0.02   4.8  0.00  0.0
row.names-  0.02   4.8  0.00  0.0
row.names-.data.frame   0.02   4.8  0.00  0.0

$sampling.time
[1] 0.42



The above suggests that a meaningful amount of time is spent in checking
for and dealing with duplicates in the common ('by') columns. To that
end:

DF.X - data.frame(X = sample(1, 35000, replace = TRUE), Y = runif(35000))
DF.Y - data.frame(X = sample(1, 35000, replace = TRUE), Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 3.316 0.148 

Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
Marc,

I don't think the issue is duplicates in the matching columns.  The data
were generated by an instrument (NMR spectrometer), processed by the
instrument's software through an FFT transform and other transformations and
finally reported as a sequence of chemical shift (x) vs intensity (y) pairs.
So all x values are unique.  For the example that I reported earlier:

 length(nmr.spectra.serum[[1]]$V1)
[1] 32768
 length(unique(nmr.spectra.serum[[1]]$V1))
[1] 32768
 length(nmr.spectra.serum[[2]]$V1)
[1] 32768
 length(unique(nmr.spectra.serum[[2]]$V1))
[1] 32768

And most of the x-values are common
 sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
[1] 32625

For this reason, merge is probably an overkill for this problem and my
initial thought was to align the datasets through some simple index-shifting
operation. 

Profiling of the merge code in my case shows that most of the time is spent
on data frame subsetting operations and on internal merge and rbind calls
secondarily (if I read the summary output correctly).  So even if most of
the time in the internal merge function is spent on sorting (haven't checked
the source code), this is in the worst case a rather minor effect, as
suggested by Prof. Ripley.
  
 Rprof(merge.out)
 zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1,
all=T, sort=T)
 Rprof(NULL)
 summaryRprof(merge.out)

$by.self
   self.time self.pct total.time total.pct
merge.data.frame6.56 50.0  11.84  90.2
[.data.frame2.42 18.4   3.68  28.0
merge   1.28  9.8  13.12 100.0
rbind   1.24  9.5   1.36  10.4
names-.default 1.16  8.8   1.16   8.8
row.names-.data.frame  0.12  0.9   0.18   1.4
duplicated.default  0.12  0.9   0.12   0.9
make.unique 0.10  0.8   0.10   0.8
data.frame  0.02  0.2   0.04   0.3
*   0.02  0.2   0.02   0.2
is.na   0.02  0.2   0.02   0.2
match   0.02  0.2   0.02   0.2
order   0.02  0.2   0.02   0.2
unclass 0.02  0.2   0.02   0.2
[   0.00  0.0   3.68  28.0
do.call 0.00  0.0   1.18   9.0
names- 0.00  0.0   1.16   8.8
row.names- 0.00  0.0   0.18   1.4
any 0.00  0.0   0.14   1.1
duplicated  0.00  0.0   0.12   0.9
cbind   0.00  0.0   0.04   0.3
as.vector   0.00  0.0   0.02   0.2
seq 0.00  0.0   0.02   0.2
seq.default 0.00  0.0   0.02   0.2

$by.total
   total.time total.pct self.time self.pct
merge   13.12 100.0  1.28  9.8
merge.data.frame11.84  90.2  6.56 50.0
[.data.frame 3.68  28.0  2.42 18.4
[3.68  28.0  0.00  0.0
rbind1.36  10.4  1.24  9.5
do.call  1.18   9.0  0.00  0.0
names-.default  1.16   8.8  1.16  8.8
names-  1.16   8.8  0.00  0.0
row.names-.data.frame   0.18   1.4  0.12  0.9
row.names-  0.18   1.4  0.00  0.0
any  0.14   1.1  0.00  0.0
duplicated.default   0.12   0.9  0.12  0.9
duplicated   0.12   0.9  0.00  0.0
make.unique  0.10   0.8  0.10  0.8
data.frame   0.04   0.3  0.02  0.2
cbind0.04   0.3  0.00  0.0
*0.02   0.2  0.02  0.2
is.na0.02   0.2  0.02  0.2
match0.02   0.2  0.02  0.2
order0.02   0.2  0.02  0.2
unclass  0.02   0.2  0.02  0.2
as.vector0.02   0.2  0.00  0.0
seq  0.02   0.2  0.00  0.0
seq.default  0.02   0.2  0.00  0.0

$sampling.time
[1] 13.12


Thanks again for your time in looking into this.
-Christos

-Original Message-
From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 9:59 PM
To: Prof Brian 

Ripley
Cc: r-help@stat.math.ethz.ch; [EMAIL PROTECTED]
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 23:34 +, Prof Brian Ripley wrote:
 On Thu, 1 Feb 2007, Marc Schwartz wrote:
 
  Christos

Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
Thanks Gabor.

This is along the lines of what I was looking for.  In fact the merge
function for zoo objects (ordered) turns out to be almost an order of
magnitude faster than the generic merge function for my problem:

 system.time(
+ zz - merge( spec.1 = zoo(nmr.spectra.serum[[1]]$V2,
nmr.spectra.serum[[1]]$V1),
+spec.2 = zoo(nmr.spectra.serum[[2]]$V2, nmr.spectra.serum[[2]]$V1),
fill=NA )
+ )
[1] 0.74 0.07 0.82   NA   NA
 system.time(
+ ww - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1,
all=T, sort=T)
+ )
[1] 6.85 0.05 6.94   NA   NA
 head(zz)
spec.1 spec.2
-1322.2 -0.651 NA
-1321.9 -0.266 NA
-1321.7 -0.962 NA
-1321.4 -0.602 NA
-1321.2  0.753 NA
-1320.9  1.212 NA
 head(ww)
   V1   V2.x V2.y
1 -1322.2 -0.651   NA
2 -1321.9 -0.266   NA
3 -1321.7 -0.962   NA
4 -1321.4 -0.602   NA
5 -1321.2  0.753   NA
6 -1320.9  1.212   NA
 

Thanks again.
-Christos 

-Original Message-
From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 7:25 PM
To: [EMAIL PROTECTED]
Cc: r-help@stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

The zoo package has a multiway merge with optional zero fill.
Here are two ways:

library(zoo)
merge(x = zoo(x[,2], x[,1]),
  y = zoo(y[,2], y[,1]),
  z = zoo(z[,2], z[,1]),
  fill = 0)

# or

library(zoo)
X - list(x = x, y = y, z = z)
merge0 - function(..., fill = 0) merge(..., fill = fill) do.call(merge0,
lapply(X, function(x) zoo(x[,2], x[,1])))

To get more info on zoo try:

vignette(zoo)

On 2/1/07, Christos Hatzis [EMAIL PROTECTED] wrote:
 Hi,

 I was wondering if there is a direct approach for lining up 2-column 
 matrices according to the values of the first column.  An example and 
 a brute-force approach is given below:

 x - cbind(1:10, runif(10))
 y - cbind(5:14, runif(10))
 z - cbind((-4):5, runif(10))

 xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w 
 - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3))

 w[ xx = x[1,1]  xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1]  xx = 
 y[10,1], 3 ] - y[,2] w[ xx = z[1,1]  xx = z[10,1], 4 ] - z[,2]

 w

 I appreciate any pointers.

 Thanks.

 Christos Hatzis, Ph.D.
 Nuvera Biosciences, Inc.
 400 West Cummings Park
 Suite 5350
 Woburn, MA 01801
 Tel: 781-938-3830
 www.nuverabio.com

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
 Marc,
 
 I don't think the issue is duplicates in the matching columns.  The data
 were generated by an instrument (NMR spectrometer), processed by the
 instrument's software through an FFT transform and other transformations and
 finally reported as a sequence of chemical shift (x) vs intensity (y) pairs.
 So all x values are unique.  For the example that I reported earlier:
 
  length(nmr.spectra.serum[[1]]$V1)
 [1] 32768
  length(unique(nmr.spectra.serum[[1]]$V1))
 [1] 32768
  length(nmr.spectra.serum[[2]]$V1)
 [1] 32768
  length(unique(nmr.spectra.serum[[2]]$V1))
 [1] 32768
 
 And most of the x-values are common
  sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
 [1] 32625
 
 For this reason, merge is probably an overkill for this problem and my
 initial thought was to align the datasets through some simple index-shifting
 operation. 
 
 Profiling of the merge code in my case shows that most of the time is spent
 on data frame subsetting operations and on internal merge and rbind calls
 secondarily (if I read the summary output correctly).  So even if most of
 the time in the internal merge function is spent on sorting (haven't checked
 the source code), this is in the worst case a rather minor effect, as
 suggested by Prof. Ripley.
   
  Rprof(merge.out)
  zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1,
 all=T, sort=T)
  Rprof(NULL)
  summaryRprof(merge.out)
 
 $by.self
self.time self.pct total.time total.pct
 merge.data.frame6.56 50.0  11.84  90.2
 [.data.frame2.42 18.4   3.68  28.0
 merge   1.28  9.8  13.12 100.0
 rbind   1.24  9.5   1.36  10.4
 names-.default 1.16  8.8   1.16   8.8
 row.names-.data.frame  0.12  0.9   0.18   1.4
 duplicated.default  0.12  0.9   0.12   0.9
 make.unique 0.10  0.8   0.10   0.8
 data.frame  0.02  0.2   0.04   0.3
 *   0.02  0.2   0.02   0.2
 is.na   0.02  0.2   0.02   0.2
 match   0.02  0.2   0.02   0.2
 order   0.02  0.2   0.02   0.2
 unclass 0.02  0.2   0.02   0.2
 [   0.00  0.0   3.68  28.0
 do.call 0.00  0.0   1.18   9.0
 names- 0.00  0.0   1.16   8.8
 row.names- 0.00  0.0   0.18   1.4
 any 0.00  0.0   0.14   1.1
 duplicated  0.00  0.0   0.12   0.9
 cbind   0.00  0.0   0.04   0.3
 as.vector   0.00  0.0   0.02   0.2
 seq 0.00  0.0   0.02   0.2
 seq.default 0.00  0.0   0.02   0.2
 
 $by.total
total.time total.pct self.time self.pct
 merge   13.12 100.0  1.28  9.8
 merge.data.frame11.84  90.2  6.56 50.0
 [.data.frame 3.68  28.0  2.42 18.4
 [3.68  28.0  0.00  0.0
 rbind1.36  10.4  1.24  9.5
 do.call  1.18   9.0  0.00  0.0
 names-.default  1.16   8.8  1.16  8.8
 names-  1.16   8.8  0.00  0.0
 row.names-.data.frame   0.18   1.4  0.12  0.9
 row.names-  0.18   1.4  0.00  0.0
 any  0.14   1.1  0.00  0.0
 duplicated.default   0.12   0.9  0.12  0.9
 duplicated   0.12   0.9  0.00  0.0
 make.unique  0.10   0.8  0.10  0.8
 data.frame   0.04   0.3  0.02  0.2
 cbind0.04   0.3  0.00  0.0
 *0.02   0.2  0.02  0.2
 is.na0.02   0.2  0.02  0.2
 match0.02   0.2  0.02  0.2
 order0.02   0.2  0.02  0.2
 unclass  0.02   0.2  0.02  0.2
 as.vector0.02   0.2  0.00  0.0
 seq  0.02   0.2  0.00  0.0
 seq.default  0.02   0.2  0.00  0.0
 
 $sampling.time
 [1] 13.12
 
 
 Thanks again for your time in looking into this.
 -Christos

Christos,

Thanks for the follow up.  Thought I had something, but apparently not.

Question: What is the actual structure of the nmr.spectra.serum objects?
The indexing approach that you have suggests they are 

Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Christos Hatzis
Marc,

The data structure is a list of data frames generated from read.table:

 class(nmr.spectra.serum)
[1] list
 class(nmr.spectra.serum[[1]])
[1] data.frame 
 dim(nmr.spectra.serum[[1]])
[1] 32768 2

Converting the data.frames to matrices does not have much of an effect on
timing.

-Christos

-Original Message-
From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
Sent: Thursday, February 01, 2007 11:06 PM
To: [EMAIL PROTECTED]
Cc: 'Prof Brian Ripley'; r-help@stat.math.ethz.ch
Subject: Re: [R] Lining up x-y datasets based on values of x

On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
 Marc,
 
 I don't think the issue is duplicates in the matching columns.  The 
 data were generated by an instrument (NMR spectrometer), processed by 
 the instrument's software through an FFT transform and other 
 transformations and finally reported as a sequence of chemical shift (x)
vs intensity (y) pairs.
 So all x values are unique.  For the example that I reported earlier:
 
  length(nmr.spectra.serum[[1]]$V1)
 [1] 32768
  length(unique(nmr.spectra.serum[[1]]$V1))
 [1] 32768
  length(nmr.spectra.serum[[2]]$V1)
 [1] 32768
  length(unique(nmr.spectra.serum[[2]]$V1))
 [1] 32768
 
 And most of the x-values are common
  sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
 [1] 32625
 
 For this reason, merge is probably an overkill for this problem and my 
 initial thought was to align the datasets through some simple 
 index-shifting operation.
 
 Profiling of the merge code in my case shows that most of the time is 
 spent on data frame subsetting operations and on internal merge and 
 rbind calls secondarily (if I read the summary output correctly).  So 
 even if most of the time in the internal merge function is spent on 
 sorting (haven't checked the source code), this is in the worst case a 
 rather minor effect, as suggested by Prof. Ripley.
   
  Rprof(merge.out)
  zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1,
 all=T, sort=T)
  Rprof(NULL)
  summaryRprof(merge.out)
 
 $by.self
self.time self.pct total.time total.pct
 merge.data.frame6.56 50.0  11.84  90.2
 [.data.frame2.42 18.4   3.68  28.0
 merge   1.28  9.8  13.12 100.0
 rbind   1.24  9.5   1.36  10.4
 names-.default 1.16  8.8   1.16   8.8
 row.names-.data.frame  0.12  0.9   0.18   1.4
 duplicated.default  0.12  0.9   0.12   0.9
 make.unique 0.10  0.8   0.10   0.8
 data.frame  0.02  0.2   0.04   0.3
 *   0.02  0.2   0.02   0.2
 is.na   0.02  0.2   0.02   0.2
 match   0.02  0.2   0.02   0.2
 order   0.02  0.2   0.02   0.2
 unclass 0.02  0.2   0.02   0.2
 [   0.00  0.0   3.68  28.0
 do.call 0.00  0.0   1.18   9.0
 names- 0.00  0.0   1.16   8.8
 row.names- 0.00  0.0   0.18   1.4
 any 0.00  0.0   0.14   1.1
 duplicated  0.00  0.0   0.12   0.9
 cbind   0.00  0.0   0.04   0.3
 as.vector   0.00  0.0   0.02   0.2
 seq 0.00  0.0   0.02   0.2
 seq.default 0.00  0.0   0.02   0.2
 
 $by.total
total.time total.pct self.time self.pct
 merge   13.12 100.0  1.28  9.8
 merge.data.frame11.84  90.2  6.56 50.0
 [.data.frame 3.68  28.0  2.42 18.4
 [3.68  28.0  0.00  0.0
 rbind1.36  10.4  1.24  9.5
 do.call  1.18   9.0  0.00  0.0
 names-.default  1.16   8.8  1.16  8.8
 names-  1.16   8.8  0.00  0.0
 row.names-.data.frame   0.18   1.4  0.12  0.9
 row.names-  0.18   1.4  0.00  0.0
 any  0.14   1.1  0.00  0.0
 duplicated.default   0.12   0.9  0.12  0.9
 duplicated   0.12   0.9  0.00  0.0
 make.unique  0.10   0.8  0.10  0.8
 data.frame   0.04   0.3  0.02  0.2
 cbind0.04   0.3  0.00  0.0
 *0.02   0.2  0.02  0.2
 is.na0.02   0.2  0.02  0.2
 match0.02   0.2  0.02  0.2
 order0.02   0.2  0.02  0.2

Re: [R] Lining up x-y datasets based on values of x

2007-02-01 Thread Marc Schwartz
Christos,

At least on my system, this does not appear to increase timing:

DF.X - data.frame(X = 35000:1, Y = runif(35000))
DF.Y - data.frame(X = 35000:1, Y = runif(35000))

 system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE))
[1] 0.238 0.012 0.256 0.000 0.000


compared to:

DF.list - list(DF.X, DF.Y)

 str(DF.list)
List of 2
 $ :'data.frame':   35000 obs. of  2 variables:
  ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 
34991 ...
  ..$ Y: num [1:35000] 0.720 0.855 0.216 0.817 0.534 ...
 $ :'data.frame':   35000 obs. of  2 variables:
  ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 
34991 ...
  ..$ Y: num [1:35000] 0.68090 0.00694 0.64235 0.15728 0.27436 ...


 system.time(DF.XY.L - merge(DF.list[[1]], DF.list[[2]], by = X, all = 
 TRUE))
[1] 0.251 0.005 0.262 0.000 0.000


So I am still confuzzled as to why it is taking 13 seconds on your
system.  I am missing something here.

However, I did note that using merge.zoo() appears to be helpful.

Regards,

Marc

On Thu, 2007-02-01 at 23:36 -0500, Christos Hatzis wrote:
 Marc,
 
 The data structure is a list of data frames generated from read.table:
 
  class(nmr.spectra.serum)
 [1] list
  class(nmr.spectra.serum[[1]])
 [1] data.frame 
  dim(nmr.spectra.serum[[1]])
 [1] 32768 2
 
 Converting the data.frames to matrices does not have much of an effect on
 timing.
 
 -Christos
 
 -Original Message-
 From: Marc Schwartz [mailto:[EMAIL PROTECTED] 
 Sent: Thursday, February 01, 2007 11:06 PM
 To: [EMAIL PROTECTED]
 Cc: 'Prof Brian Ripley'; r-help@stat.math.ethz.ch
 Subject: Re: [R] Lining up x-y datasets based on values of x
 
 On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote:
  Marc,
  
  I don't think the issue is duplicates in the matching columns.  The 
  data were generated by an instrument (NMR spectrometer), processed by 
  the instrument's software through an FFT transform and other 
  transformations and finally reported as a sequence of chemical shift (x)
 vs intensity (y) pairs.
  So all x values are unique.  For the example that I reported earlier:
  
   length(nmr.spectra.serum[[1]]$V1)
  [1] 32768
   length(unique(nmr.spectra.serum[[1]]$V1))
  [1] 32768
   length(nmr.spectra.serum[[2]]$V1)
  [1] 32768
   length(unique(nmr.spectra.serum[[2]]$V1))
  [1] 32768
  
  And most of the x-values are common
   sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1)
  [1] 32625
  
  For this reason, merge is probably an overkill for this problem and my 
  initial thought was to align the datasets through some simple 
  index-shifting operation.
  
  Profiling of the merge code in my case shows that most of the time is 
  spent on data frame subsetting operations and on internal merge and 
  rbind calls secondarily (if I read the summary output correctly).  So 
  even if most of the time in the internal merge function is spent on 
  sorting (haven't checked the source code), this is in the worst case a 
  rather minor effect, as suggested by Prof. Ripley.

   Rprof(merge.out)
   zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1,
  all=T, sort=T)
   Rprof(NULL)
   summaryRprof(merge.out)
  

snip

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.