Re: [R] memory-efficient column aggregation of a sparse matrix

2007-02-01 Thread Douglas Bates
On 1/31/07, Jon Stearley [EMAIL PROTECTED] wrote:
 I need to sum the columns of a sparse matrix according to a factor -
 ie given a sparse matrix X and a factor fac of length ncol(X), sum
 the elements by column factors and return the sparse matrix Y of size
 nrow(X) by nlevels(f).  The appended code does the job, but is
 unacceptably memory-bound because tapply() uses a non-sparse
 representation.  Can anyone suggest a more memory and cpu efficient
 approach?  Eg, a sparse matrix tapply method?  Thanks.

This is the sort of operation that is much more easily performed in
the triplet representation of a sparse matrix where each nonzero
element is represented by its row index, column index and value.
Using that representation you could map the column indices according
to the factor then convert back to one of the other representations.
The only question would be what to do about nonzeros in different
columns of the original matrix that get mapped to the same element in
the result.  It turns out that in the sparse matrix code used by the
Matrix package the triplet representation allows for duplicate index
positions with the convention that the resulting value at a position
is the sum of the values of any triplets with that index pair.

If you decide to use this approach please be aware that the indices
for the triplet representation in the Matrix package are 0-based (as
in C code) not 1-based (as in R code).  (I imagine that Martin is
thinking we really should change that as he reads this part.)


 --
 +--+
 | Jon Stearley  (505) 845-7571  (FAX 844-9297) |
 | Sandia National Laboratories  Scalable Systems Integration   |
 +--+


 # x and y are of SparseM class matrix.csr
 aggregate.csr -
 function(x, fac) {
  # make a vector indicating the row of each nonzero
  rows - integer(length=length([EMAIL PROTECTED]))
  [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row
  rows - as.integer(cumsum(rows)) # and finish with a cumsum

  # make a vector indicating the column factor of each nonzero
  f - [EMAIL PROTECTED]

  # aggregate by row,f
  y - tapply([EMAIL PROTECTED], list(rows,f), sum)

  # sparsify it
  y[is.na(y)] - 0  # change tapply NAs to as.matrix.csr 0s
  y - as.matrix.csr(y)

  y
 }


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory-efficient column aggregation of a sparse matrix

2007-02-01 Thread roger koenker
Doug is right, I think, that this would be easier with full indexing
using the  matrix.coo classe, if you want to use SparseM.  But
then the tapply seems to be the way to go.

url:www.econ.uiuc.edu/~rogerRoger Koenker
email[EMAIL PROTECTED]Department of Economics
vox: 217-333-4558University of Illinois
fax:   217-244-6678Champaign, IL 61820


On Feb 1, 2007, at 7:22 AM, Douglas Bates wrote:

 On 1/31/07, Jon Stearley [EMAIL PROTECTED] wrote:
 I need to sum the columns of a sparse matrix according to a factor -
 ie given a sparse matrix X and a factor fac of length ncol(X), sum
 the elements by column factors and return the sparse matrix Y of size
 nrow(X) by nlevels(f).  The appended code does the job, but is
 unacceptably memory-bound because tapply() uses a non-sparse
 representation.  Can anyone suggest a more memory and cpu efficient
 approach?  Eg, a sparse matrix tapply method?  Thanks.

 This is the sort of operation that is much more easily performed in
 the triplet representation of a sparse matrix where each nonzero
 element is represented by its row index, column index and value.
 Using that representation you could map the column indices according
 to the factor then convert back to one of the other representations.
 The only question would be what to do about nonzeros in different
 columns of the original matrix that get mapped to the same element in
 the result.  It turns out that in the sparse matrix code used by the
 Matrix package the triplet representation allows for duplicate index
 positions with the convention that the resulting value at a position
 is the sum of the values of any triplets with that index pair.

 If you decide to use this approach please be aware that the indices
 for the triplet representation in the Matrix package are 0-based (as
 in C code) not 1-based (as in R code).  (I imagine that Martin is
 thinking we really should change that as he reads this part.)


 --
 +--+
 | Jon Stearley  (505) 845-7571  (FAX 844-9297) |
 | Sandia National Laboratories  Scalable Systems Integration   |
 +--+


 # x and y are of SparseM class matrix.csr
 aggregate.csr -
 function(x, fac) {
  # make a vector indicating the row of each nonzero
  rows - integer(length=length([EMAIL PROTECTED]))
  [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row
  rows - as.integer(cumsum(rows)) # and finish with a cumsum

  # make a vector indicating the column factor of each nonzero
  f - [EMAIL PROTECTED]

  # aggregate by row,f
  y - tapply([EMAIL PROTECTED], list(rows,f), sum)

  # sparsify it
  y[is.na(y)] - 0  # change tapply NAs to as.matrix.csr 0s
  y - as.matrix.csr(y)

  y
 }


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory-efficient column aggregation of a sparse matrix

2007-02-01 Thread Jon Stearley

On Feb 1, 2007, at 6:22 AM, Douglas Bates wrote:

 It turns out that in the sparse matrix code used by the
 Matrix package the triplet representation allows for duplicate index
 positions with the convention that the resulting value at a position
 is the sum of the values of any triplets with that index pair.

Very handy!  I suggest adding this nugget near the (possibly  
redundant) triplets phrase in Matrix.pdf.

 If you decide to use this approach please be aware that the indices
 for the triplet representation in the Matrix package are 0-based (as
 in C code) not 1-based (as in R code).  (I imagine that Martin is
 thinking we really should change that as he reads this part.)

The Value of the appended function is equivalent to my previous  
version, but it runs in 1/10'th the time, uses vastly less memory,  
and is fewer lines of code to boot!  Sure it's tricky, but it does  
the trick.

THANK YOU SO MUCH!

-jon

NEWaggregate.csr - function(x,fac) {
 # cast into handy Matrix sparse Triplet form
 x.T - as(as(x, dgRMatrix), dgTMatrix)

 # factor column indexes (compensating for 0 vs 1 indexing)
 [EMAIL PROTECTED] - as.integer(as.integer([EMAIL PROTECTED])-1)

 # cast back, magically computing factor sums along the way :)
 y - as(x.T, matrix.csr)

 # and fix the dimension (doing this on x.T bus errors!)
 [EMAIL PROTECTED] - as.integer(c(nrow(y),nlevels(fac)))
 y
}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] memory-efficient column aggregation of a sparse matrix

2007-01-31 Thread Jon Stearley
I need to sum the columns of a sparse matrix according to a factor -  
ie given a sparse matrix X and a factor fac of length ncol(X), sum  
the elements by column factors and return the sparse matrix Y of size  
nrow(X) by nlevels(f).  The appended code does the job, but is  
unacceptably memory-bound because tapply() uses a non-sparse  
representation.  Can anyone suggest a more memory and cpu efficient  
approach?  Eg, a sparse matrix tapply method?  Thanks.

-- 
+--+
| Jon Stearley  (505) 845-7571  (FAX 844-9297) |
| Sandia National Laboratories  Scalable Systems Integration   |
+--+


# x and y are of SparseM class matrix.csr
aggregate.csr -
function(x, fac) {
 # make a vector indicating the row of each nonzero
 rows - integer(length=length([EMAIL PROTECTED]))
 [EMAIL PROTECTED]:nrow(x)]] - 1 # put a 1 at start of each row
 rows - as.integer(cumsum(rows)) # and finish with a cumsum

 # make a vector indicating the column factor of each nonzero
 f - [EMAIL PROTECTED]

 # aggregate by row,f
 y - tapply([EMAIL PROTECTED], list(rows,f), sum)

 # sparsify it
 y[is.na(y)] - 0  # change tapply NAs to as.matrix.csr 0s
 y - as.matrix.csr(y)

 y
}

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.