Re: [R] memory-efficient column aggregation of a sparse matrix
On 1/31/07, Jon Stearley [EMAIL PROTECTED] wrote: I need to sum the columns of a sparse matrix according to a factor - ie given a sparse matrix X and a factor fac of length ncol(X), sum the elements by column factors and return the sparse matrix Y of size nrow(X) by nlevels(f). The appended code does the job, but is unacceptably memory-bound because tapply() uses a non-sparse representation. Can anyone suggest a more memory and cpu efficient approach? Eg, a sparse matrix tapply method? Thanks. This is the sort of operation that is much more easily performed in the triplet representation of a sparse matrix where each nonzero element is represented by its row index, column index and value. Using that representation you could map the column indices according to the factor then convert back to one of the other representations. The only question would be what to do about nonzeros in different columns of the original matrix that get mapped to the same element in the result. It turns out that in the sparse matrix code used by the Matrix package the triplet representation allows for duplicate index positions with the convention that the resulting value at a position is the sum of the values of any triplets with that index pair. If you decide to use this approach please be aware that the indices for the triplet representation in the Matrix package are 0-based (as in C code) not 1-based (as in R code). (I imagine that Martin is thinking we really should change that as he reads this part.) -- +--+ | Jon Stearley (505) 845-7571 (FAX 844-9297) | | Sandia National Laboratories Scalable Systems Integration | +--+ # x and y are of SparseM class matrix.csr aggregate.csr - function(x, fac) { # make a vector indicating the row of each nonzero rows - integer(length=length([EMAIL PROTECTED])) [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row rows - as.integer(cumsum(rows)) # and finish with a cumsum # make a vector indicating the column factor of each nonzero f - [EMAIL PROTECTED] # aggregate by row,f y - tapply([EMAIL PROTECTED], list(rows,f), sum) # sparsify it y[is.na(y)] - 0 # change tapply NAs to as.matrix.csr 0s y - as.matrix.csr(y) y } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory-efficient column aggregation of a sparse matrix
Doug is right, I think, that this would be easier with full indexing using the matrix.coo classe, if you want to use SparseM. But then the tapply seems to be the way to go. url:www.econ.uiuc.edu/~rogerRoger Koenker email[EMAIL PROTECTED]Department of Economics vox: 217-333-4558University of Illinois fax: 217-244-6678Champaign, IL 61820 On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote: On 1/31/07, Jon Stearley [EMAIL PROTECTED] wrote: I need to sum the columns of a sparse matrix according to a factor - ie given a sparse matrix X and a factor fac of length ncol(X), sum the elements by column factors and return the sparse matrix Y of size nrow(X) by nlevels(f). The appended code does the job, but is unacceptably memory-bound because tapply() uses a non-sparse representation. Can anyone suggest a more memory and cpu efficient approach? Eg, a sparse matrix tapply method? Thanks. This is the sort of operation that is much more easily performed in the triplet representation of a sparse matrix where each nonzero element is represented by its row index, column index and value. Using that representation you could map the column indices according to the factor then convert back to one of the other representations. The only question would be what to do about nonzeros in different columns of the original matrix that get mapped to the same element in the result. It turns out that in the sparse matrix code used by the Matrix package the triplet representation allows for duplicate index positions with the convention that the resulting value at a position is the sum of the values of any triplets with that index pair. If you decide to use this approach please be aware that the indices for the triplet representation in the Matrix package are 0-based (as in C code) not 1-based (as in R code). (I imagine that Martin is thinking we really should change that as he reads this part.) -- +--+ | Jon Stearley (505) 845-7571 (FAX 844-9297) | | Sandia National Laboratories Scalable Systems Integration | +--+ # x and y are of SparseM class matrix.csr aggregate.csr - function(x, fac) { # make a vector indicating the row of each nonzero rows - integer(length=length([EMAIL PROTECTED])) [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row rows - as.integer(cumsum(rows)) # and finish with a cumsum # make a vector indicating the column factor of each nonzero f - [EMAIL PROTECTED] # aggregate by row,f y - tapply([EMAIL PROTECTED], list(rows,f), sum) # sparsify it y[is.na(y)] - 0 # change tapply NAs to as.matrix.csr 0s y - as.matrix.csr(y) y } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] memory-efficient column aggregation of a sparse matrix
On Feb 1, 2007, at 6:22 AM, Douglas Bates wrote: It turns out that in the sparse matrix code used by the Matrix package the triplet representation allows for duplicate index positions with the convention that the resulting value at a position is the sum of the values of any triplets with that index pair. Very handy! I suggest adding this nugget near the (possibly redundant) triplets phrase in Matrix.pdf. If you decide to use this approach please be aware that the indices for the triplet representation in the Matrix package are 0-based (as in C code) not 1-based (as in R code). (I imagine that Martin is thinking we really should change that as he reads this part.) The Value of the appended function is equivalent to my previous version, but it runs in 1/10'th the time, uses vastly less memory, and is fewer lines of code to boot! Sure it's tricky, but it does the trick. THANK YOU SO MUCH! -jon NEWaggregate.csr - function(x,fac) { # cast into handy Matrix sparse Triplet form x.T - as(as(x, dgRMatrix), dgTMatrix) # factor column indexes (compensating for 0 vs 1 indexing) [EMAIL PROTECTED] - as.integer(as.integer([EMAIL PROTECTED])-1) # cast back, magically computing factor sums along the way :) y - as(x.T, matrix.csr) # and fix the dimension (doing this on x.T bus errors!) [EMAIL PROTECTED] - as.integer(c(nrow(y),nlevels(fac))) y } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] memory-efficient column aggregation of a sparse matrix
I need to sum the columns of a sparse matrix according to a factor - ie given a sparse matrix X and a factor fac of length ncol(X), sum the elements by column factors and return the sparse matrix Y of size nrow(X) by nlevels(f). The appended code does the job, but is unacceptably memory-bound because tapply() uses a non-sparse representation. Can anyone suggest a more memory and cpu efficient approach? Eg, a sparse matrix tapply method? Thanks. -- +--+ | Jon Stearley (505) 845-7571 (FAX 844-9297) | | Sandia National Laboratories Scalable Systems Integration | +--+ # x and y are of SparseM class matrix.csr aggregate.csr - function(x, fac) { # make a vector indicating the row of each nonzero rows - integer(length=length([EMAIL PROTECTED])) [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row rows - as.integer(cumsum(rows)) # and finish with a cumsum # make a vector indicating the column factor of each nonzero f - [EMAIL PROTECTED] # aggregate by row,f y - tapply([EMAIL PROTECTED], list(rows,f), sum) # sparsify it y[is.na(y)] - 0 # change tapply NAs to as.matrix.csr 0s y - as.matrix.csr(y) y } __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.