Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-19 Thread Douglas Bates
On Thu, Jul 14, 2011 at 10:21 AM, Alireza Mahani
alireza.s.mah...@gmail.com wrote:
 (I am using a LINUX machine)

 Jeff,

 In creating reproducible results, I 'partially' answered my question. I have
 attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
 files into your chosen directory, then run 'Rscript mvMultiply.r' in that
 directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
 verify that the two methods produce the same output vector.)

 Below are the results that I get, along with discussion (tR and tCall are in
 sec):

 INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
 F,F,13.536,13.875
 F,T,13.824,14.299
 T,F,13.688,18.167
 T,T,13.982,30.730

 Interpretation: The execution time for the .Call line is nearly identical to
 the call to R operator '%*%'. Two data preparation lines for the .Call
 method create the overhead:

 A - t(A) (~12sec, or 12usec per call)
 dim(A) - dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)

 While the first line can be avoided by providing options in c++ function (as
 is done in the BLAS API), I wonder if the second line can be avoided, aside
 from the obvious option of rewriting the R scripts to use vectors instead of
 matrices. But this defies one of the primary advantages of using R, which is
 succinct, high-level coding. In particular, if one has several matrices as
 input into a .Call function, then the overhead from matrix-to-vector
 transformations can add up. To summarize, my questions are:

 1- Do the above results seem reasonable to you? Is there a similar penalty
 in R's '%*%' operator for transforming matrices to vectors before calling
 BLAS functions?
 2- Are there techniques for reducing the above overhead for developers
 looking to augment their R code with compiled code?

 Regards,
 Alireza

 ---
 # mvMultiply.r
 # comparing performance of matrix multiplication in R (using '%*%' operator)
 vs. calling compiled code (using .Call function)
 # y [m x 1] = A [m x n] %*% x [n x 1]

 rm(list = ls())
 system(R CMD SHLIB mvMultiply.cc)
 dyn.load(mvMultiply.so)

 INCLUDE_DATAPREP - F
 ROWMAJOR - F #indicates whether the c++ function treats A in a row-major or
 column-major fashion

 m - 100
 n - 10
 N - 100

 diffVec - array(0, dim = N)
 tR - 0.0
 tCall - 0.0
 for (i in 1:N) {

        A - runif(m*n); dim(A) - c(m,n)
        x - runif(n)

        t1 - proc.time()[3]
        y1 - A %*% x
        tR - tR + proc.time()[3] - t1

        if (INCLUDE_DATAPREP) {
                t1 - proc.time()[3]
        }
        if (ROWMAJOR) { #since R will convert matrix to vector in a 
 column-major
 fashion, if the c++ function expects a row-major format, we need to
 transpose A before converting it to a vector
                A - t(A)
        }
        dim(A) - dim(A)[1] * dim(A)[2]
        if (!INCLUDE_DATAPREP) {
                t1 - proc.time()[3]
        }
        y2 - .Call(matvecMultiply, as.double(A), as.double(x),
 as.logical(c(ROWMAJOR)))
        tCall - tCall + proc.time()[3] - t1

        diffVec[i] - max(abs(y2 - y1))
 }
 cat(Data prep time for '.Call' included: , INCLUDE_DATAPREP, \n)
 cat(C++ function expects row-major matrix: , ROWMAJOR, \n)
 cat(Time - Using '%*%' operator in R: , tR, sec\n)
 cat(Time - Using '.Call' function: , tCall, sec\n)
 cat(Maximum difference between methods: , max(diffVec), \n)

 dyn.unload(mvMultiply.so)
 ---
 # mvMultiply.cc
 #include Rinternals.h
 #include R.h

 extern C {

 SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
        double *rA = REAL(A), *rx = REAL(x), *ry;
        int *rrm = LOGICAL(rowmajor);
        int n = length(x);
        int m = length(A) / n;
        SEXP y;
        PROTECT(y = allocVector(REALSXP, m));
        ry = REAL(y);
        for (int i = 0; i  m; i++) {
                ry[i] = 0.0;
                for (int j = 0; j  n; j++) {
                        if (rrm[0] == 1) {
                                ry[i] += rA[i * n + j] * rx[j];
                        } else {
                                ry[i] += rA[j * m + i] * rx[j];
                        }
                }
        }
        UNPROTECT(1);
        return(y);
 }

 }


I realize that you are just beginning to use the .C and .Call
interfaces and your example is therefore a simple one.  However, if
you plan to continue with such development it is worthwhile learning
of some of the tools available.  I think one of the most important is
the inline package that can take a C or C++ code segment as a text
string and go through all the steps of creating and loading a
.Call'able compiled function.

Second, if you are going to use C++ (the code you show could be C code
as it doesn't use any C++ extensions) then you should look at the Rcpp
package written by Dirk Eddelbuettel and Romain Francois which allows
for comparatively painless interfacing of R objects and C++ objects.

Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-19 Thread Douglas Bates
I just saw that I left a syntax error in the .R and the first
_Rout.txt files.  Notice that in the second _Rout.txt file the order
of the arguments in the constructors for the MMatrixXd and the
MVectorXd are in a different order than in the .R and the first
_Rout.txt files.  The correct order has the pointer first, then the
dimensions.  For the first _Rout.txt file this part of the code is not
used.

On Tue, Jul 19, 2011 at 10:00 AM, Douglas Bates ba...@stat.wisc.edu wrote:
 On Thu, Jul 14, 2011 at 10:21 AM, Alireza Mahani
 alireza.s.mah...@gmail.com wrote:
 (I am using a LINUX machine)

 Jeff,

 In creating reproducible results, I 'partially' answered my question. I have
 attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
 files into your chosen directory, then run 'Rscript mvMultiply.r' in that
 directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
 verify that the two methods produce the same output vector.)

 Below are the results that I get, along with discussion (tR and tCall are in
 sec):

 INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
 F,F,13.536,13.875
 F,T,13.824,14.299
 T,F,13.688,18.167
 T,T,13.982,30.730

 Interpretation: The execution time for the .Call line is nearly identical to
 the call to R operator '%*%'. Two data preparation lines for the .Call
 method create the overhead:

 A - t(A) (~12sec, or 12usec per call)
 dim(A) - dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)

 While the first line can be avoided by providing options in c++ function (as
 is done in the BLAS API), I wonder if the second line can be avoided, aside
 from the obvious option of rewriting the R scripts to use vectors instead of
 matrices. But this defies one of the primary advantages of using R, which is
 succinct, high-level coding. In particular, if one has several matrices as
 input into a .Call function, then the overhead from matrix-to-vector
 transformations can add up. To summarize, my questions are:

 1- Do the above results seem reasonable to you? Is there a similar penalty
 in R's '%*%' operator for transforming matrices to vectors before calling
 BLAS functions?
 2- Are there techniques for reducing the above overhead for developers
 looking to augment their R code with compiled code?

 Regards,
 Alireza

 ---
 # mvMultiply.r
 # comparing performance of matrix multiplication in R (using '%*%' operator)
 vs. calling compiled code (using .Call function)
 # y [m x 1] = A [m x n] %*% x [n x 1]

 rm(list = ls())
 system(R CMD SHLIB mvMultiply.cc)
 dyn.load(mvMultiply.so)

 INCLUDE_DATAPREP - F
 ROWMAJOR - F #indicates whether the c++ function treats A in a row-major or
 column-major fashion

 m - 100
 n - 10
 N - 100

 diffVec - array(0, dim = N)
 tR - 0.0
 tCall - 0.0
 for (i in 1:N) {

        A - runif(m*n); dim(A) - c(m,n)
        x - runif(n)

        t1 - proc.time()[3]
        y1 - A %*% x
        tR - tR + proc.time()[3] - t1

        if (INCLUDE_DATAPREP) {
                t1 - proc.time()[3]
        }
        if (ROWMAJOR) { #since R will convert matrix to vector in a 
 column-major
 fashion, if the c++ function expects a row-major format, we need to
 transpose A before converting it to a vector
                A - t(A)
        }
        dim(A) - dim(A)[1] * dim(A)[2]
        if (!INCLUDE_DATAPREP) {
                t1 - proc.time()[3]
        }
        y2 - .Call(matvecMultiply, as.double(A), as.double(x),
 as.logical(c(ROWMAJOR)))
        tCall - tCall + proc.time()[3] - t1

        diffVec[i] - max(abs(y2 - y1))
 }
 cat(Data prep time for '.Call' included: , INCLUDE_DATAPREP, \n)
 cat(C++ function expects row-major matrix: , ROWMAJOR, \n)
 cat(Time - Using '%*%' operator in R: , tR, sec\n)
 cat(Time - Using '.Call' function: , tCall, sec\n)
 cat(Maximum difference between methods: , max(diffVec), \n)

 dyn.unload(mvMultiply.so)
 ---
 # mvMultiply.cc
 #include Rinternals.h
 #include R.h

 extern C {

 SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
        double *rA = REAL(A), *rx = REAL(x), *ry;
        int *rrm = LOGICAL(rowmajor);
        int n = length(x);
        int m = length(A) / n;
        SEXP y;
        PROTECT(y = allocVector(REALSXP, m));
        ry = REAL(y);
        for (int i = 0; i  m; i++) {
                ry[i] = 0.0;
                for (int j = 0; j  n; j++) {
                        if (rrm[0] == 1) {
                                ry[i] += rA[i * n + j] * rx[j];
                        } else {
                                ry[i] += rA[j * m + i] * rx[j];
                        }
                }
        }
        UNPROTECT(1);
        return(y);
 }

 }


 I realize that you are just beginning to use the .C and .Call
 interfaces and your example is therefore a simple one.  However, if
 you plan to continue with such development it is worthwhile learning
 of some of the tools 

Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-19 Thread Alireza Mahani
Prof. Bates,

It looks like you read my mind! I am working on writing an R package for
high-performance MCMC estimation of a class of Hierarchical Bayesian models
most often used in the field of quantitative marketing. This would
essentially be a parallelized version of Peter Rossi's bayesm package. While
I've made great progress in parallelizing the most mathematically difficult
part of the algorithm, namely slice sampling of low-level coefficients, yet
I've realized that putting the entire code together while minimizing bugs is
a big challenge in C/C++/CUDA environments. I have therefore decided to
follow a more logical path of first developing the code logic in R, and then
exporting it function by function to compiled code. 

The tools that you mentioned seem to be exactly the kind of stuff I need in
order to be able to do go through this incremental, test-oriented
development process with relatively little pain.

I'm not sure if this is what you had in mind while suggesting the tools to
me, so please let me know if I'm misinterpreting your comments, or if I need
to be aware of other tools beyond what you mentioned.

Many thanks,
Alireza


--
View this message in context: 
http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3679056.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-14 Thread Jeff Ryan
The .Call overhead isn't the issue. If you'd like some insight into what you 
are doing wrong (and right), you need to provide code for the list to reproduce 
your timings with.

This is outlined in the posting guide as well.

Best,
Jeff



On Jul 13, 2011, at 8:28 AM, asmahani alireza.s.mah...@gmail.com wrote:

 Hello,
 
 I am in the process of writing an R extension for parallelized MCMC, with
 heavy use of compiled code (C++). I have been getting my feet wet by
 implementing a simple matrix-vector multiplication function in C++ (which
 calls a BLAS level 2 function dgemv), and comparing it to the '%*%' operator
 in R (which apparently calls a BLAS level 3 function dgemm).
 
 Interestingly, I cannot replicate the performance of the R native operator,
 using either '.C' or '.Call'. The relative times are 17 (R), 30 (.C), and 26
 (.Call). In other words, R native operator is 1.5x faster than my compiled
 code. Can you explain to me why this is? Through testing I strongly suspect
 that the BLAS function itself isn't what takes the bulk part of the time,
 but perhaps data transfer and other overhead associated with the calls (.C
 and .Call) are the main issues. Are there any ways to reach the performance
 level of native R code in this case?
 
 Thank you,
 Alireza Mahani
 
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3665017.html
 Sent from the R devel mailing list archive at Nabble.com.
 
 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-14 Thread Alireza Mahani
(I am using a LINUX machine)

Jeff,

In creating reproducible results, I 'partially' answered my question. I have
attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
files into your chosen directory, then run 'Rscript mvMultiply.r' in that
directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
verify that the two methods produce the same output vector.)

Below are the results that I get, along with discussion (tR and tCall are in
sec):

INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
F,F,13.536,13.875
F,T,13.824,14.299
T,F,13.688,18.167
T,T,13.982,30.730

Interpretation: The execution time for the .Call line is nearly identical to
the call to R operator '%*%'. Two data preparation lines for the .Call
method create the overhead: 

A - t(A) (~12sec, or 12usec per call)
dim(A) - dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)

While the first line can be avoided by providing options in c++ function (as
is done in the BLAS API), I wonder if the second line can be avoided, aside
from the obvious option of rewriting the R scripts to use vectors instead of
matrices. But this defies one of the primary advantages of using R, which is
succinct, high-level coding. In particular, if one has several matrices as
input into a .Call function, then the overhead from matrix-to-vector
transformations can add up. To summarize, my questions are:

1- Do the above results seem reasonable to you? Is there a similar penalty
in R's '%*%' operator for transforming matrices to vectors before calling
BLAS functions?
2- Are there techniques for reducing the above overhead for developers
looking to augment their R code with compiled code?

Regards,
Alireza

---
# mvMultiply.r
# comparing performance of matrix multiplication in R (using '%*%' operator)
vs. calling compiled code (using .Call function)
# y [m x 1] = A [m x n] %*% x [n x 1]

rm(list = ls())
system(R CMD SHLIB mvMultiply.cc)
dyn.load(mvMultiply.so)

INCLUDE_DATAPREP - F
ROWMAJOR - F #indicates whether the c++ function treats A in a row-major or
column-major fashion

m - 100
n - 10
N - 100

diffVec - array(0, dim = N)
tR - 0.0
tCall - 0.0
for (i in 1:N) {

A - runif(m*n); dim(A) - c(m,n)
x - runif(n)

t1 - proc.time()[3]
y1 - A %*% x
tR - tR + proc.time()[3] - t1

if (INCLUDE_DATAPREP) {
t1 - proc.time()[3]
}
if (ROWMAJOR) { #since R will convert matrix to vector in a column-major
fashion, if the c++ function expects a row-major format, we need to
transpose A before converting it to a vector
A - t(A)
}
dim(A) - dim(A)[1] * dim(A)[2]
if (!INCLUDE_DATAPREP) {
t1 - proc.time()[3]
}
y2 - .Call(matvecMultiply, as.double(A), as.double(x),
as.logical(c(ROWMAJOR)))
tCall - tCall + proc.time()[3] - t1

diffVec[i] - max(abs(y2 - y1))
}
cat(Data prep time for '.Call' included: , INCLUDE_DATAPREP, \n)
cat(C++ function expects row-major matrix: , ROWMAJOR, \n)
cat(Time - Using '%*%' operator in R: , tR, sec\n)
cat(Time - Using '.Call' function: , tCall, sec\n)
cat(Maximum difference between methods: , max(diffVec), \n)

dyn.unload(mvMultiply.so)
---
# mvMultiply.cc
#include Rinternals.h
#include R.h

extern C {

SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
double *rA = REAL(A), *rx = REAL(x), *ry;
int *rrm = LOGICAL(rowmajor);
int n = length(x);
int m = length(A) / n;
SEXP y;
PROTECT(y = allocVector(REALSXP, m));
ry = REAL(y);
for (int i = 0; i  m; i++) {
ry[i] = 0.0;
for (int j = 0; j  n; j++) {
if (rrm[0] == 1) {
ry[i] += rA[i * n + j] * rx[j];
} else {
ry[i] += rA[j * m + i] * rx[j];
}
}
}
UNPROTECT(1);
return(y);
}

}


--
View this message in context: 
http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3667896.html
Sent from the R devel mailing list archive at Nabble.com.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Performance of .C and .Call functions vs. native R code

2011-07-14 Thread Gabriel Becker
On Thu, Jul 14, 2011 at 8:21 AM, Alireza Mahani
alireza.s.mah...@gmail.comwrote:

 (I am using a LINUX machine)

 Jeff,

 In creating reproducible results, I 'partially' answered my question. I
 have
 attached two scripts, 'mvMultiply.r' and 'mvMultiply.cc'. Please copy both
 files into your chosen directory, then run 'Rscript mvMultiply.r' in that
 directory while changing the two boolean parameters 'INCLUDE_DATAPREP' and
 'ROWMAJOR' to all four permutations. (The variable 'diffVec' is there to
 verify that the two methods produce the same output vector.)

 Below are the results that I get, along with discussion (tR and tCall are
 in
 sec):

 INCLUDE_DATAPREP,ROWMAJOR,tR,tCall
 F,F,13.536,13.875
 F,T,13.824,14.299
 T,F,13.688,18.167
 T,T,13.982,30.730

 Interpretation: The execution time for the .Call line is nearly identical
 to
 the call to R operator '%*%'. Two data preparation lines for the .Call
 method create the overhead:

 A - t(A) (~12sec, or 12usec per call)
 dim(A) - dim(A)[1] * dim(A)[2] (~4sec, or 4usec per call)



AFAIK R stores matrices as vectors internally anyway and the dims just tell
it the position of the various elements, so I'm not sure that second line is
needed at all.

I have attached a tiny piece of c code which verifies this.

The output I get from that is:

 dyn.load(/home/gmbecker/gabe/matvectest.so)
 vec = 1.1:8.1
 mat = matrix(vec, ncol = 4)
 .Call(R_MatVecTest, vec, mat, 8L)
[1] TRUE


Note if you create the matrix with byrow=TRUE they may not be the same.

Hope that helps,
Gabe


 While the first line can be avoided by providing options in c++ function
 (as
 is done in the BLAS API), I wonder if the second line can be avoided, aside
 from the obvious option of rewriting the R scripts to use evectors instead
 of
 matrices. But this defies one of the primary advantages of using R, which
 is
 succinct, high-level coding. In particular, if one has several matrices as
 input into a .Call function, then the overhead from matrix-to-vector
 transformations can add up. To summarize, my questions are:

 1- Do the above results seem reasonable to you? Is there a similar penalty
 in R's '%*%' operator for transforming matrices to vectors before calling
 BLAS functions?
 2- Are there techniques for reducing the above overhead for developers
 looking to augment their R code with compiled code?

 Regards,
 Alireza

 ---
 # mvMultiply.r
 # comparing performance of matrix multiplication in R (using '%*%'
 operator)
 vs. calling compiled code (using .Call function)
 # y [m x 1] = A [m x n] %*% x [n x 1]

 rm(list = ls())
 system(R CMD SHLIB mvMultiply.cc)
 dyn.load(mvMultiply.so)

 INCLUDE_DATAPREP - F
 ROWMAJOR - F #indicates whether the c++ function treats A in a row-major
 or
 column-major fashion

 m - 100
 n - 10
 N - 100

 diffVec - array(0, dim = N)
 tR - 0.0
 tCall - 0.0
 for (i in 1:N) {

A - runif(m*n); dim(A) - c(m,n)
x - runif(n)

t1 - proc.time()[3]
y1 - A %*% x
tR - tR + proc.time()[3] - t1

if (INCLUDE_DATAPREP) {
t1 - proc.time()[3]
}
if (ROWMAJOR) { #since R will convert matrix to vector in a
 column-major
 fashion, if the c++ function expects a row-major format, we need to
 transpose A before converting it to a vector
A - t(A)
}
dim(A) - dim(A)[1] * dim(A)[2]
if (!INCLUDE_DATAPREP) {
t1 - proc.time()[3]
}
y2 - .Call(matvecMultiply, as.double(A), as.double(x),
 as.logical(c(ROWMAJOR)))
tCall - tCall + proc.time()[3] - t1

diffVec[i] - max(abs(y2 - y1))
 }
 cat(Data prep time for '.Call' included: , INCLUDE_DATAPREP, \n)
 cat(C++ function expects row-major matrix: , ROWMAJOR, \n)
 cat(Time - Using '%*%' operator in R: , tR, sec\n)
 cat(Time - Using '.Call' function: , tCall, sec\n)
 cat(Maximum difference between methods: , max(diffVec), \n)

 dyn.unload(mvMultiply.so)
 ---
 # mvMultiply.cc
 #include Rinternals.h
 #include R.h

 extern C {

 SEXP matvecMultiply(SEXP A, SEXP x, SEXP rowmajor) {
double *rA = REAL(A), *rx = REAL(x), *ry;
int *rrm = LOGICAL(rowmajor);
int n = length(x);
int m = length(A) / n;
SEXP y;
PROTECT(y = allocVector(REALSXP, m));
ry = REAL(y);
for (int i = 0; i  m; i++) {
ry[i] = 0.0;
for (int j = 0; j  n; j++) {
if (rrm[0] == 1) {
ry[i] += rA[i * n + j] * rx[j];
} else {
ry[i] += rA[j * m + i] * rx[j];
}
}
}
UNPROTECT(1);
return(y);
 }

 }


 --
 View this message in context:
 http://r.789695.n4.nabble.com/Performance-of-C-and-Call-functions-vs-native-R-code-tp3665017p3667896.html
 Sent from the R devel mailing list archive at Nabble.com.