[R] Lining up x-y datasets based on values of x
Hi, I was wondering if there is a direct approach for lining up 2-column matrices according to the values of the first column. An example and a brute-force approach is given below: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) w[ xx = x[1,1] xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1] xx = y[10,1], 3 ] - y[,2] w[ xx = z[1,1] xx = z[10,1], 4 ] - z[,2] w I appreciate any pointers. Thanks. Christos Hatzis, Ph.D. Nuvera Biosciences, Inc. 400 West Cummings Park Suite 5350 Woburn, MA 01801 Tel: 781-938-3830 www.nuverabio.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
On Thu, 2007-02-01 at 15:05 -0500, Christos Hatzis wrote: Hi, I was wondering if there is a direct approach for lining up 2-column matrices according to the values of the first column. An example and a brute-force approach is given below: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) w[ xx = x[1,1] xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1] xx = y[10,1], 3 ] - y[,2] w[ xx = z[1,1] xx = z[10,1], 4 ] - z[,2] w I appreciate any pointers. Thanks. How about this: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) colnames(x) - c(X, Y) colnames(y) - c(X, Y) colnames(z) - c(X, Y) xy - merge(x, y, by = X, all = TRUE) xyz - merge(xy, z, by = X, all = TRUE) xyz[is.na(xyz)] - 0 xyz X Y.x Y.y Y 1 -4 0.000 0.000 0.3969099 2 -3 0.000 0.000 0.8943127 3 -2 0.000 0.000 0.4882819 4 -1 0.000 0.000 0.0275787 5 0 0.000 0.000 0.7562341 6 1 0.6873130 0.000 0.6185218 7 2 0.1930880 0.000 0.2318025 8 3 0.1164783 0.000 0.7336057 9 4 0.7408532 0.000 0.3006347 10 5 0.7112887 0.6383823 0.8515126 11 6 0.2719079 0.5952721 0.000 12 7 0.2067017 0.8178048 0.000 13 8 0.2085043 0.5714917 0.000 14 9 0.2251435 0.4032660 0.000 15 10 0.3471888 0.5247478 0.000 16 11 0.000 0.6899197 0.000 17 12 0.000 0.7188912 0.000 18 13 0.000 0.9133252 0.000 19 14 0.000 0.9186001 0.000 Note that 'xyz' will be a data frame, so just use as.matrix(xyz) to coerce back to a numeric matrix if needed. See ?merge HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
Thanks Marc and Phil. My dataset actually consists of 50+ individual files, so I will have to do this one column at a time in a loop... I might look into SQL and outer joints as an alternative to avoid looping. Thanks again. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 3:29 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 15:05 -0500, Christos Hatzis wrote: Hi, I was wondering if there is a direct approach for lining up 2-column matrices according to the values of the first column. An example and a brute-force approach is given below: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) w[ xx = x[1,1] xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1] xx = y[10,1], 3 ] - y[,2] w[ xx = z[1,1] xx = z[10,1], 4 ] - z[,2] w I appreciate any pointers. Thanks. How about this: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) colnames(x) - c(X, Y) colnames(y) - c(X, Y) colnames(z) - c(X, Y) xy - merge(x, y, by = X, all = TRUE) xyz - merge(xy, z, by = X, all = TRUE) xyz[is.na(xyz)] - 0 xyz X Y.x Y.y Y 1 -4 0.000 0.000 0.3969099 2 -3 0.000 0.000 0.8943127 3 -2 0.000 0.000 0.4882819 4 -1 0.000 0.000 0.0275787 5 0 0.000 0.000 0.7562341 6 1 0.6873130 0.000 0.6185218 7 2 0.1930880 0.000 0.2318025 8 3 0.1164783 0.000 0.7336057 9 4 0.7408532 0.000 0.3006347 10 5 0.7112887 0.6383823 0.8515126 11 6 0.2719079 0.5952721 0.000 12 7 0.2067017 0.8178048 0.000 13 8 0.2085043 0.5714917 0.000 14 9 0.2251435 0.4032660 0.000 15 10 0.3471888 0.5247478 0.000 16 11 0.000 0.6899197 0.000 17 12 0.000 0.7188912 0.000 18 13 0.000 0.9133252 0.000 19 14 0.000 0.9186001 0.000 Note that 'xyz' will be a data frame, so just use as.matrix(xyz) to coerce back to a numeric matrix if needed. See ?merge HTH, Marc Schwartz __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote: Thanks Marc and Phil. My dataset actually consists of 50+ individual files, so I will have to do this one column at a time in a loop... I might look into SQL and outer joints as an alternative to avoid looping. Thanks again. -Christos If the files conform to some naming convention and/or are all located in a common sub-directory, you can use list.files() to get the file names into a vector. If not, you could use file.choose() interactively. Then use either a for() loop or sapply() to loop over the filenames, read them in to data frames using read.table() and merge them together in the same loop. When it comes to basic data manipulation like this, loops are not a bad thing. The overhead of a loop is typically outweighed by the file I/O and related considerations. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
Christos, Haccording to the Value section in ?merge: A data frame. The rows are by default lexicographically sorted on the common columns, but for sort=FALSE are in an unspecified order. Looking at the code, while there is a lot of time spent on matching things, the key sort() code seems to be near the end of the function: if (sort) res - res[if (all.x || all.y) do.call(order, x[, 1:l.b, drop = FALSE]) else sort.list(bx[m$xi]), , drop = FALSE] I wonder if you could create a local version of merge(), say my.merge(), without that code and without breaking things. A quick glance suggests that as long as you are not merging on the rownames, I think that you might be OK. You would want to test that hypothesis however. HTH, Marc On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote: [Sorry I meant to reply to the list] Thanks, Marc. That's what I have done. However, there seems to be a penalty from using merge repeatedly as it appears to internally re-sort the datasets. In my case the datasets are long (~35K rows) and already sorted so this step adds considerable and unnecessary overhead. There doesn't seem to be an option for disabling sorting. Setting 'sort=F' only affects sorting of the final data.frame. system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T)) [1] 6.96 0.00 7.24 NA NA system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=F)) [1] 6.82 0.00 7.14 NA NA I was wondering if perhaps there is a parallel between this problem and methods for linining up time-series data, since such data are also usually sorted on the time dimension. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 4:21 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote: Thanks Marc and Phil. My dataset actually consists of 50+ individual files, so I will have to do this one column at a time in a loop... I might look into SQL and outer joints as an alternative to avoid looping. Thanks again. -Christos If the files conform to some naming convention and/or are all located in a common sub-directory, you can use list.files() to get the file names into a vector. If not, you could use file.choose() interactively. Then use either a for() loop or sapply() to loop over the filenames, read them in to data frames using read.table() and merge them together in the same loop. When it comes to basic data manipulation like this, loops are not a bad thing. The overhead of a loop is typically outweighed by the file I/O and related considerations. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
[Sorry I meant to reply to the list] Thanks, Marc. That's what I have done. However, there seems to be a penalty from using merge repeatedly as it appears to internally re-sort the datasets. In my case the datasets are long (~35K rows) and already sorted so this step adds considerable and unnecessary overhead. There doesn't seem to be an option for disabling sorting. Setting 'sort=F' only affects sorting of the final data.frame. system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T)) [1] 6.96 0.00 7.24 NA NA system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=F)) [1] 6.82 0.00 7.14 NA NA I was wondering if perhaps there is a parallel between this problem and methods for linining up time-series data, since such data are also usually sorted on the time dimension. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 4:21 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote: Thanks Marc and Phil. My dataset actually consists of 50+ individual files, so I will have to do this one column at a time in a loop... I might look into SQL and outer joints as an alternative to avoid looping. Thanks again. -Christos If the files conform to some naming convention and/or are all located in a common sub-directory, you can use list.files() to get the file names into a vector. If not, you could use file.choose() interactively. Then use either a for() loop or sapply() to loop over the filenames, read them in to data frames using read.table() and merge them together in the same loop. When it comes to basic data manipulation like this, loops are not a bad thing. The overhead of a loop is typically outweighed by the file I/O and related considerations. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
On Thu, 1 Feb 2007, Marc Schwartz wrote: Christos, Haccording to the Value section in ?merge: A data frame. The rows are by default lexicographically sorted on the common columns, but for sort=FALSE are in an unspecified order. There is also a sort in the .Internal code. But I am not buying that this is a major part of the time without detailed evidence from profiling. Sorting 35k numbers should take a few milliseconds, and less if they are already sorted. x - rnorm(35000) system.time(y - sort(x, method=quick)) [1] 0.003 0.001 0.004 0.000 0.000 system.time(sort(y, method=quick)) [1] 0.002 0.000 0.001 0.000 0.000 Looking at the code, while there is a lot of time spent on matching things, the key sort() code seems to be near the end of the function: if (sort) res - res[if (all.x || all.y) do.call(order, x[, 1:l.b, drop = FALSE]) else sort.list(bx[m$xi]), , drop = FALSE] I wonder if you could create a local version of merge(), say my.merge(), without that code and without breaking things. A quick glance suggests that as long as you are not merging on the rownames, I think that you might be OK. You would want to test that hypothesis however. HTH, Marc On Thu, 2007-02-01 at 16:48 -0500, Christos Hatzis wrote: [Sorry I meant to reply to the list] Thanks, Marc. That's what I have done. However, there seems to be a penalty from using merge repeatedly as it appears to internally re-sort the datasets. In my case the datasets are long (~35K rows) and already sorted so this step adds considerable and unnecessary overhead. There doesn't seem to be an option for disabling sorting. Setting 'sort=F' only affects sorting of the final data.frame. system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T)) [1] 6.96 0.00 7.24 NA NA system.time(merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=F)) [1] 6.82 0.00 7.14 NA NA I was wondering if perhaps there is a parallel between this problem and methods for linining up time-series data, since such data are also usually sorted on the time dimension. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 4:21 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 15:45 -0500, Christos Hatzis wrote: Thanks Marc and Phil. My dataset actually consists of 50+ individual files, so I will have to do this one column at a time in a loop... I might look into SQL and outer joints as an alternative to avoid looping. Thanks again. -Christos If the files conform to some naming convention and/or are all located in a common sub-directory, you can use list.files() to get the file names into a vector. If not, you could use file.choose() interactively. Then use either a for() loop or sapply() to loop over the filenames, read them in to data frames using read.table() and merge them together in the same loop. When it comes to basic data manipulation like this, loops are not a bad thing. The overhead of a loop is typically outweighed by the file I/O and related considerations. HTH, Marc __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
The zoo package has a multiway merge with optional zero fill. Here are two ways: library(zoo) merge(x = zoo(x[,2], x[,1]), y = zoo(y[,2], y[,1]), z = zoo(z[,2], z[,1]), fill = 0) # or library(zoo) X - list(x = x, y = y, z = z) merge0 - function(..., fill = 0) merge(..., fill = fill) do.call(merge0, lapply(X, function(x) zoo(x[,2], x[,1]))) To get more info on zoo try: vignette(zoo) On 2/1/07, Christos Hatzis [EMAIL PROTECTED] wrote: Hi, I was wondering if there is a direct approach for lining up 2-column matrices according to the values of the first column. An example and a brute-force approach is given below: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) w[ xx = x[1,1] xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1] xx = y[10,1], 3 ] - y[,2] w[ xx = z[1,1] xx = z[10,1], 4 ] - z[,2] w I appreciate any pointers. Thanks. Christos Hatzis, Ph.D. Nuvera Biosciences, Inc. 400 West Cummings Park Suite 5350 Woburn, MA 01801 Tel: 781-938-3830 www.nuverabio.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
On Thu, 2007-02-01 at 23:34 +, Prof Brian Ripley wrote: On Thu, 1 Feb 2007, Marc Schwartz wrote: Christos, Haccording to the Value section in ?merge: A data frame. The rows are by default lexicographically sorted on the common columns, but for sort=FALSE are in an unspecified order. There is also a sort in the .Internal code. But I am not buying that this is a major part of the time without detailed evidence from profiling. Sorting 35k numbers should take a few milliseconds, and less if they are already sorted. x - rnorm(35000) system.time(y - sort(x, method=quick)) [1] 0.003 0.001 0.004 0.000 0.000 system.time(sort(y, method=quick)) [1] 0.002 0.000 0.001 0.000 0.000 Having had a chance to mock up some examples, I would have to agree with Prof. Ripley on this point. Presuming that we are not missing something about the nature of Christos' data sets, here are 4 examples, with rows sorted in ascending order, descending order, reversed sort order and random order. In theory, the descending order example should, I believe, represent a worst cast scenario, since reverse sorting a sorted list is typically slowest. However, note that there is not much time variation below and running each of the examples several times resulted in material differences across runs. 1. Ascending order DF.X - data.frame(X = 1:35000, Y = runif(35000)) DF.Y - data.frame(X = 1:35000, Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 0.249 0.004 0.264 0.000 0.000 2. Descending order DF.X - data.frame(X = 35000:1, Y = runif(35000)) DF.Y - data.frame(X = 35000:1, Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 0.300 0.007 0.309 0.000 0.000 3. Reversed sort order DF.X - data.frame(X = 35000:1, Y = runif(35000)) DF.Y - data.frame(X = 1:35000, Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 0.236 0.008 0.245 0.000 0.000 4. Random order DF.X - data.frame(X = sample(35000), Y = runif(35000)) DF.Y - data.frame(X = sample(35000), Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 0.339 0.016 0.357 0.000 0.000 Spending some time looking at profiling the descending order example, we get: summaryRprof() $by.self self.time self.pct total.time total.pct duplicated.default 0.16 38.1 0.16 38.1 match 0.08 19.0 0.08 19.0 sort.list 0.08 19.0 0.08 19.0 [.data.frame0.04 9.5 0.24 57.1 merge.data.frame0.02 4.8 0.42 100.0 names.default 0.02 4.8 0.02 4.8 seq_len 0.02 4.8 0.02 4.8 merge 0.00 0.0 0.42 100.0 [ 0.00 0.0 0.24 57.1 any 0.00 0.0 0.18 42.9 duplicated 0.00 0.0 0.18 42.9 cbind 0.00 0.0 0.04 9.5 data.frame 0.00 0.0 0.04 9.5 data.row.names 0.00 0.0 0.02 4.8 names 0.00 0.0 0.02 4.8 row.names- 0.00 0.0 0.02 4.8 row.names-.data.frame 0.00 0.0 0.02 4.8 $by.total total.time total.pct self.time self.pct merge.data.frame 0.42 100.0 0.02 4.8 merge0.42 100.0 0.00 0.0 [.data.frame 0.24 57.1 0.04 9.5 [0.24 57.1 0.00 0.0 any 0.18 42.9 0.00 0.0 duplicated 0.18 42.9 0.00 0.0 duplicated.default 0.16 38.1 0.16 38.1 match0.08 19.0 0.08 19.0 sort.list0.08 19.0 0.08 19.0 cbind0.04 9.5 0.00 0.0 data.frame 0.04 9.5 0.00 0.0 names.default0.02 4.8 0.02 4.8 seq_len 0.02 4.8 0.02 4.8 data.row.names 0.02 4.8 0.00 0.0 names0.02 4.8 0.00 0.0 row.names- 0.02 4.8 0.00 0.0 row.names-.data.frame 0.02 4.8 0.00 0.0 $sampling.time [1] 0.42 The above suggests that a meaningful amount of time is spent in checking for and dealing with duplicates in the common ('by') columns. To that end: DF.X - data.frame(X = sample(1, 35000, replace = TRUE), Y = runif(35000)) DF.Y - data.frame(X = sample(1, 35000, replace = TRUE), Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 3.316 0.148
Re: [R] Lining up x-y datasets based on values of x
Marc, I don't think the issue is duplicates in the matching columns. The data were generated by an instrument (NMR spectrometer), processed by the instrument's software through an FFT transform and other transformations and finally reported as a sequence of chemical shift (x) vs intensity (y) pairs. So all x values are unique. For the example that I reported earlier: length(nmr.spectra.serum[[1]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[1]]$V1)) [1] 32768 length(nmr.spectra.serum[[2]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[2]]$V1)) [1] 32768 And most of the x-values are common sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) [1] 32625 For this reason, merge is probably an overkill for this problem and my initial thought was to align the datasets through some simple index-shifting operation. Profiling of the merge code in my case shows that most of the time is spent on data frame subsetting operations and on internal merge and rbind calls secondarily (if I read the summary output correctly). So even if most of the time in the internal merge function is spent on sorting (haven't checked the source code), this is in the worst case a rather minor effect, as suggested by Prof. Ripley. Rprof(merge.out) zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T) Rprof(NULL) summaryRprof(merge.out) $by.self self.time self.pct total.time total.pct merge.data.frame6.56 50.0 11.84 90.2 [.data.frame2.42 18.4 3.68 28.0 merge 1.28 9.8 13.12 100.0 rbind 1.24 9.5 1.36 10.4 names-.default 1.16 8.8 1.16 8.8 row.names-.data.frame 0.12 0.9 0.18 1.4 duplicated.default 0.12 0.9 0.12 0.9 make.unique 0.10 0.8 0.10 0.8 data.frame 0.02 0.2 0.04 0.3 * 0.02 0.2 0.02 0.2 is.na 0.02 0.2 0.02 0.2 match 0.02 0.2 0.02 0.2 order 0.02 0.2 0.02 0.2 unclass 0.02 0.2 0.02 0.2 [ 0.00 0.0 3.68 28.0 do.call 0.00 0.0 1.18 9.0 names- 0.00 0.0 1.16 8.8 row.names- 0.00 0.0 0.18 1.4 any 0.00 0.0 0.14 1.1 duplicated 0.00 0.0 0.12 0.9 cbind 0.00 0.0 0.04 0.3 as.vector 0.00 0.0 0.02 0.2 seq 0.00 0.0 0.02 0.2 seq.default 0.00 0.0 0.02 0.2 $by.total total.time total.pct self.time self.pct merge 13.12 100.0 1.28 9.8 merge.data.frame11.84 90.2 6.56 50.0 [.data.frame 3.68 28.0 2.42 18.4 [3.68 28.0 0.00 0.0 rbind1.36 10.4 1.24 9.5 do.call 1.18 9.0 0.00 0.0 names-.default 1.16 8.8 1.16 8.8 names- 1.16 8.8 0.00 0.0 row.names-.data.frame 0.18 1.4 0.12 0.9 row.names- 0.18 1.4 0.00 0.0 any 0.14 1.1 0.00 0.0 duplicated.default 0.12 0.9 0.12 0.9 duplicated 0.12 0.9 0.00 0.0 make.unique 0.10 0.8 0.10 0.8 data.frame 0.04 0.3 0.02 0.2 cbind0.04 0.3 0.00 0.0 *0.02 0.2 0.02 0.2 is.na0.02 0.2 0.02 0.2 match0.02 0.2 0.02 0.2 order0.02 0.2 0.02 0.2 unclass 0.02 0.2 0.02 0.2 as.vector0.02 0.2 0.00 0.0 seq 0.02 0.2 0.00 0.0 seq.default 0.02 0.2 0.00 0.0 $sampling.time [1] 13.12 Thanks again for your time in looking into this. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 9:59 PM To: Prof Brian Ripley Cc: r-help@stat.math.ethz.ch; [EMAIL PROTECTED] Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 23:34 +, Prof Brian Ripley wrote: On Thu, 1 Feb 2007, Marc Schwartz wrote: Christos
Re: [R] Lining up x-y datasets based on values of x
Thanks Gabor. This is along the lines of what I was looking for. In fact the merge function for zoo objects (ordered) turns out to be almost an order of magnitude faster than the generic merge function for my problem: system.time( + zz - merge( spec.1 = zoo(nmr.spectra.serum[[1]]$V2, nmr.spectra.serum[[1]]$V1), +spec.2 = zoo(nmr.spectra.serum[[2]]$V2, nmr.spectra.serum[[2]]$V1), fill=NA ) + ) [1] 0.74 0.07 0.82 NA NA system.time( + ww - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T) + ) [1] 6.85 0.05 6.94 NA NA head(zz) spec.1 spec.2 -1322.2 -0.651 NA -1321.9 -0.266 NA -1321.7 -0.962 NA -1321.4 -0.602 NA -1321.2 0.753 NA -1320.9 1.212 NA head(ww) V1 V2.x V2.y 1 -1322.2 -0.651 NA 2 -1321.9 -0.266 NA 3 -1321.7 -0.962 NA 4 -1321.4 -0.602 NA 5 -1321.2 0.753 NA 6 -1320.9 1.212 NA Thanks again. -Christos -Original Message- From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 7:25 PM To: [EMAIL PROTECTED] Cc: r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x The zoo package has a multiway merge with optional zero fill. Here are two ways: library(zoo) merge(x = zoo(x[,2], x[,1]), y = zoo(y[,2], y[,1]), z = zoo(z[,2], z[,1]), fill = 0) # or library(zoo) X - list(x = x, y = y, z = z) merge0 - function(..., fill = 0) merge(..., fill = fill) do.call(merge0, lapply(X, function(x) zoo(x[,2], x[,1]))) To get more info on zoo try: vignette(zoo) On 2/1/07, Christos Hatzis [EMAIL PROTECTED] wrote: Hi, I was wondering if there is a direct approach for lining up 2-column matrices according to the values of the first column. An example and a brute-force approach is given below: x - cbind(1:10, runif(10)) y - cbind(5:14, runif(10)) z - cbind((-4):5, runif(10)) xx - seq( min(c(x[,1],y[,1],z[,1])), max(c(x[,1],y[,1],z[,1])), 1) w - cbind(xx, matrix(rep(0, 3*length(xx)), ncol=3)) w[ xx = x[1,1] xx = x[10,1], 2 ] - x[,2] w[ xx = y[1,1] xx = y[10,1], 3 ] - y[,2] w[ xx = z[1,1] xx = z[10,1], 4 ] - z[,2] w I appreciate any pointers. Thanks. Christos Hatzis, Ph.D. Nuvera Biosciences, Inc. 400 West Cummings Park Suite 5350 Woburn, MA 01801 Tel: 781-938-3830 www.nuverabio.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Lining up x-y datasets based on values of x
On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote: Marc, I don't think the issue is duplicates in the matching columns. The data were generated by an instrument (NMR spectrometer), processed by the instrument's software through an FFT transform and other transformations and finally reported as a sequence of chemical shift (x) vs intensity (y) pairs. So all x values are unique. For the example that I reported earlier: length(nmr.spectra.serum[[1]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[1]]$V1)) [1] 32768 length(nmr.spectra.serum[[2]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[2]]$V1)) [1] 32768 And most of the x-values are common sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) [1] 32625 For this reason, merge is probably an overkill for this problem and my initial thought was to align the datasets through some simple index-shifting operation. Profiling of the merge code in my case shows that most of the time is spent on data frame subsetting operations and on internal merge and rbind calls secondarily (if I read the summary output correctly). So even if most of the time in the internal merge function is spent on sorting (haven't checked the source code), this is in the worst case a rather minor effect, as suggested by Prof. Ripley. Rprof(merge.out) zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T) Rprof(NULL) summaryRprof(merge.out) $by.self self.time self.pct total.time total.pct merge.data.frame6.56 50.0 11.84 90.2 [.data.frame2.42 18.4 3.68 28.0 merge 1.28 9.8 13.12 100.0 rbind 1.24 9.5 1.36 10.4 names-.default 1.16 8.8 1.16 8.8 row.names-.data.frame 0.12 0.9 0.18 1.4 duplicated.default 0.12 0.9 0.12 0.9 make.unique 0.10 0.8 0.10 0.8 data.frame 0.02 0.2 0.04 0.3 * 0.02 0.2 0.02 0.2 is.na 0.02 0.2 0.02 0.2 match 0.02 0.2 0.02 0.2 order 0.02 0.2 0.02 0.2 unclass 0.02 0.2 0.02 0.2 [ 0.00 0.0 3.68 28.0 do.call 0.00 0.0 1.18 9.0 names- 0.00 0.0 1.16 8.8 row.names- 0.00 0.0 0.18 1.4 any 0.00 0.0 0.14 1.1 duplicated 0.00 0.0 0.12 0.9 cbind 0.00 0.0 0.04 0.3 as.vector 0.00 0.0 0.02 0.2 seq 0.00 0.0 0.02 0.2 seq.default 0.00 0.0 0.02 0.2 $by.total total.time total.pct self.time self.pct merge 13.12 100.0 1.28 9.8 merge.data.frame11.84 90.2 6.56 50.0 [.data.frame 3.68 28.0 2.42 18.4 [3.68 28.0 0.00 0.0 rbind1.36 10.4 1.24 9.5 do.call 1.18 9.0 0.00 0.0 names-.default 1.16 8.8 1.16 8.8 names- 1.16 8.8 0.00 0.0 row.names-.data.frame 0.18 1.4 0.12 0.9 row.names- 0.18 1.4 0.00 0.0 any 0.14 1.1 0.00 0.0 duplicated.default 0.12 0.9 0.12 0.9 duplicated 0.12 0.9 0.00 0.0 make.unique 0.10 0.8 0.10 0.8 data.frame 0.04 0.3 0.02 0.2 cbind0.04 0.3 0.00 0.0 *0.02 0.2 0.02 0.2 is.na0.02 0.2 0.02 0.2 match0.02 0.2 0.02 0.2 order0.02 0.2 0.02 0.2 unclass 0.02 0.2 0.02 0.2 as.vector0.02 0.2 0.00 0.0 seq 0.02 0.2 0.00 0.0 seq.default 0.02 0.2 0.00 0.0 $sampling.time [1] 13.12 Thanks again for your time in looking into this. -Christos Christos, Thanks for the follow up. Thought I had something, but apparently not. Question: What is the actual structure of the nmr.spectra.serum objects? The indexing approach that you have suggests they are
Re: [R] Lining up x-y datasets based on values of x
Marc, The data structure is a list of data frames generated from read.table: class(nmr.spectra.serum) [1] list class(nmr.spectra.serum[[1]]) [1] data.frame dim(nmr.spectra.serum[[1]]) [1] 32768 2 Converting the data.frames to matrices does not have much of an effect on timing. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 11:06 PM To: [EMAIL PROTECTED] Cc: 'Prof Brian Ripley'; r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote: Marc, I don't think the issue is duplicates in the matching columns. The data were generated by an instrument (NMR spectrometer), processed by the instrument's software through an FFT transform and other transformations and finally reported as a sequence of chemical shift (x) vs intensity (y) pairs. So all x values are unique. For the example that I reported earlier: length(nmr.spectra.serum[[1]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[1]]$V1)) [1] 32768 length(nmr.spectra.serum[[2]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[2]]$V1)) [1] 32768 And most of the x-values are common sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) [1] 32625 For this reason, merge is probably an overkill for this problem and my initial thought was to align the datasets through some simple index-shifting operation. Profiling of the merge code in my case shows that most of the time is spent on data frame subsetting operations and on internal merge and rbind calls secondarily (if I read the summary output correctly). So even if most of the time in the internal merge function is spent on sorting (haven't checked the source code), this is in the worst case a rather minor effect, as suggested by Prof. Ripley. Rprof(merge.out) zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T) Rprof(NULL) summaryRprof(merge.out) $by.self self.time self.pct total.time total.pct merge.data.frame6.56 50.0 11.84 90.2 [.data.frame2.42 18.4 3.68 28.0 merge 1.28 9.8 13.12 100.0 rbind 1.24 9.5 1.36 10.4 names-.default 1.16 8.8 1.16 8.8 row.names-.data.frame 0.12 0.9 0.18 1.4 duplicated.default 0.12 0.9 0.12 0.9 make.unique 0.10 0.8 0.10 0.8 data.frame 0.02 0.2 0.04 0.3 * 0.02 0.2 0.02 0.2 is.na 0.02 0.2 0.02 0.2 match 0.02 0.2 0.02 0.2 order 0.02 0.2 0.02 0.2 unclass 0.02 0.2 0.02 0.2 [ 0.00 0.0 3.68 28.0 do.call 0.00 0.0 1.18 9.0 names- 0.00 0.0 1.16 8.8 row.names- 0.00 0.0 0.18 1.4 any 0.00 0.0 0.14 1.1 duplicated 0.00 0.0 0.12 0.9 cbind 0.00 0.0 0.04 0.3 as.vector 0.00 0.0 0.02 0.2 seq 0.00 0.0 0.02 0.2 seq.default 0.00 0.0 0.02 0.2 $by.total total.time total.pct self.time self.pct merge 13.12 100.0 1.28 9.8 merge.data.frame11.84 90.2 6.56 50.0 [.data.frame 3.68 28.0 2.42 18.4 [3.68 28.0 0.00 0.0 rbind1.36 10.4 1.24 9.5 do.call 1.18 9.0 0.00 0.0 names-.default 1.16 8.8 1.16 8.8 names- 1.16 8.8 0.00 0.0 row.names-.data.frame 0.18 1.4 0.12 0.9 row.names- 0.18 1.4 0.00 0.0 any 0.14 1.1 0.00 0.0 duplicated.default 0.12 0.9 0.12 0.9 duplicated 0.12 0.9 0.00 0.0 make.unique 0.10 0.8 0.10 0.8 data.frame 0.04 0.3 0.02 0.2 cbind0.04 0.3 0.00 0.0 *0.02 0.2 0.02 0.2 is.na0.02 0.2 0.02 0.2 match0.02 0.2 0.02 0.2 order0.02 0.2 0.02 0.2
Re: [R] Lining up x-y datasets based on values of x
Christos, At least on my system, this does not appear to increase timing: DF.X - data.frame(X = 35000:1, Y = runif(35000)) DF.Y - data.frame(X = 35000:1, Y = runif(35000)) system.time(DF.XY - merge(DF.X, DF.Y, by = X, all = TRUE)) [1] 0.238 0.012 0.256 0.000 0.000 compared to: DF.list - list(DF.X, DF.Y) str(DF.list) List of 2 $ :'data.frame': 35000 obs. of 2 variables: ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ... ..$ Y: num [1:35000] 0.720 0.855 0.216 0.817 0.534 ... $ :'data.frame': 35000 obs. of 2 variables: ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ... ..$ Y: num [1:35000] 0.68090 0.00694 0.64235 0.15728 0.27436 ... system.time(DF.XY.L - merge(DF.list[[1]], DF.list[[2]], by = X, all = TRUE)) [1] 0.251 0.005 0.262 0.000 0.000 So I am still confuzzled as to why it is taking 13 seconds on your system. I am missing something here. However, I did note that using merge.zoo() appears to be helpful. Regards, Marc On Thu, 2007-02-01 at 23:36 -0500, Christos Hatzis wrote: Marc, The data structure is a list of data frames generated from read.table: class(nmr.spectra.serum) [1] list class(nmr.spectra.serum[[1]]) [1] data.frame dim(nmr.spectra.serum[[1]]) [1] 32768 2 Converting the data.frames to matrices does not have much of an effect on timing. -Christos -Original Message- From: Marc Schwartz [mailto:[EMAIL PROTECTED] Sent: Thursday, February 01, 2007 11:06 PM To: [EMAIL PROTECTED] Cc: 'Prof Brian Ripley'; r-help@stat.math.ethz.ch Subject: Re: [R] Lining up x-y datasets based on values of x On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote: Marc, I don't think the issue is duplicates in the matching columns. The data were generated by an instrument (NMR spectrometer), processed by the instrument's software through an FFT transform and other transformations and finally reported as a sequence of chemical shift (x) vs intensity (y) pairs. So all x values are unique. For the example that I reported earlier: length(nmr.spectra.serum[[1]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[1]]$V1)) [1] 32768 length(nmr.spectra.serum[[2]]$V1) [1] 32768 length(unique(nmr.spectra.serum[[2]]$V1)) [1] 32768 And most of the x-values are common sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) [1] 32625 For this reason, merge is probably an overkill for this problem and my initial thought was to align the datasets through some simple index-shifting operation. Profiling of the merge code in my case shows that most of the time is spent on data frame subsetting operations and on internal merge and rbind calls secondarily (if I read the summary output correctly). So even if most of the time in the internal merge function is spent on sorting (haven't checked the source code), this is in the worst case a rather minor effect, as suggested by Prof. Ripley. Rprof(merge.out) zz - merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by=V1, all=T, sort=T) Rprof(NULL) summaryRprof(merge.out) snip __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.