[Rd] sample.int() algorithms

2015-09-09 Thread Nathan Kurz
I was experiencing a strange pattern of slowdowns when using
sample.int(), where sampling from a one  population would sometimes
take 1000x longer than taking the same number of samples from a
slightly larger population.   For my application, this resulted in a
runtime of several hours rather than a few seconds.  Looking into it,
I saw that sample.int() is hardcoded to switch algorithms when the
population is larger than 1e+7, and I was straddling this boundary:

sample.int  <- function(n, size = n, replace = FALSE, prob = NULL)
{
if (!replace && is.null(prob) && n > 1e7 && size <= n/2)
.Internal(sample2(n, size))
else .Internal(sample(n, size, replace, prob))
}

do_sample2() takes the approach of taking a sample, and then checking
if this sample is a duplicate.  As long as the population size is much
larger than numbers of samples, this will be efficient.  This explains
the check for "size <= n/2".   But I'm not sure why the "n > 1e7"
check is needed.

I put together some sample code to show the difference in timing
letting sample.int() choose the cutoff point versus manually
specifying the use of do_sample2():

###  compare times for sample.int() vs internal function sample2()
compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9),
sampleSizeList=c(10, 100, 1000, 1),
numReplications=1000) {
for (sampleSize in sampleSizeList) {
for (popSize in popSizeList)  {
elapsed1 = system.time(replicate(numReplications,
sample.int(popSize, sampleSize)))[["elapsed"]]
elapsed2 = system.time(replicate(numReplications,
.Internal(sample2(popSize, sampleSize[["elapsed"]]
cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n",
sampleSize, popSize, elapsed1, elapsed2))
}
cat("\n")
}
}

compareSampleTimes()

https://gist.github.com/nkurz/8fa6ff3772a054294531

And here's the output showing the 1000x slowdowns at population sizes
of 10e7 under R-3.2.2:

$ Rscript compareSampleTimes.R
Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds
Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds
Sample 10 from 1e+07: 13.19 vs 0.003000 seconds
Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds
Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds

Sample 100 from 1e+05: 0.18 vs 0.007000 seconds
Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds
Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds
Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds
Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds

Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds
Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds
Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds
Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds
Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds

Sample 1 from 1e+05: 0.414000 vs 0.712000 seconds
Sample 1 from 1e+06: 1.10 vs 0.453000 seconds
Sample 1 from 1e+07: 14.882000 vs 0.443000 seconds
Sample 1 from 1e+08: 0.448000 vs 0.443000 seconds
Sample 1 from 1e+09: 0.445000 vs 0.443000 seconds

Since my usage involves taking samples of 1K from populations of about
10M, the do_sample2() approach is the clear winner: .05 seconds vs 13
seconds.  I tested on a couple machines, and got similar results on
both. Would there be a downside to having sample.int() use the faster
do_sample2() approach for populations of all sizes whenever the ratio
of population size to sample size is large?

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Build R with MKL and ICC

2015-09-09 Thread Nathan Kurz
As a short and simple approach, I just compiled the current R release
on Ubuntu with ICC and MKL using just this:

$ tar -xzf R-3.2.2.tar.gz
$ cd R-3.2.2
$ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g
-O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack
--enable-memory-profiling --enable-R-shlib
$ make
$ sudo make install
$ R --version
R version 3.2.2 (2015-08-14) -- "Fire Safety"

If you have 'ifort' available, you would probably want to add it to
the list of environment variables.

--nate

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Profiling function that contains both C++ and Fortran Code

2015-09-09 Thread Julian Karch

Hello,

I am trying to profile a function of OpenMx 
(http://openmx.psyc.virginia.edu) for CPU time. My operating system is 
OS X 10.10. OpenMx contains C++ and Fortran code. I have read the 
section regarding profiling compiled code in the manual 
(https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-compiled-code). 
This section and this post (http://blog.fellstat.com/?p=337) lead me to 
try Instruments. Here is what I did:


-Opened Instruments
-Chose the Time Profiler Template
-Pressed Record
-Started my script using RStudio

The output of instruments looks like this: 
http://i.stack.imgur.com/aKIQm.jpg. The command line tool "sample" 
returns the same output


The problem is that it looks like "omxunsafedgemm_", the functions that 
consumes the vast majority of the time, would be called directly from 
the Main Thread. However, this is a low level Fortran function. It is 
always called by a C++ function called "omxDGEMM". In this example 
"omxDGEMM" is first called by "omxCallRamExpection" (so almost at the 
bottom of the call tree). The total time of "omxDGEMM" is 0. Thus, the 
profiling information is currently useless.


In the original version of the package "omxDGEMM" is defined as inline. 
I changed this in the hope that it would resolve the issue. This was not 
the case. "omxunsafedgemm" is called by "omxDGEMM" like that


F77_CALL(omxunsafedgemm)(&transa, &transb,
&(nrow), &(ncol), &(nmid),
&alpha, a->data, &(a->leading),
b->data, &(b->leading),&beta, 		 
result->data, 		&(result->leading));


Any ideas how to obtain a sensible profiler output?


Best,

Julian Karch

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Build R form source - manuals

2015-09-09 Thread arnaud gaboury
On Wed, Sep 9, 2015 at 9:11 AM, Prof Brian Ripley  wrote:
> On 09/09/2015 07:40, arnaud gaboury wrote:
>>
>> I built R form source succesfully on my Fedora 22 box. No errors.
>
>
> Which version of R?
3-2-1

>>
>>
>> I can read there is an issue with some manuals at build time when
>> running makeinfo, especially these two:
>> doc/manual/R-exts.texi
>> cp doc/manual/R-intro.texi
>> Some distro have hacks about makeinfo 5 in their build script.
>>
>> I wonder if some manuals are broken but couldn't see it when running make.
>>
>>
>> May someone tells me more about this issue and what can I do to make
>> sure these manuals are correctly built.
>
>
> You are the one claiming there is an issue, so the onus is on you to tell
> us.

please find below two build scripts for R 3-2-1

* Archlinux:
..
  # fix for texinfo 5.X
  sed -i 's|test ${makeinfo_version_min} -lt 7|test
${makeinfo_version_min} -lt 0|' configure

* FEDORA 22
.
%if 0%{?fedora} >= 19
# What a hack.
# Current texinfo doesn't like @eqn. Use @math instead where stuff breaks.
cp doc/manual/R-exts.texi doc/manual/R-exts.texi.spot
cp doc/manual/R-intro.texi doc/manual/R-intro.texi.spot
sed -i 's|@eqn|@math|g' doc/manual/R-exts.texi
sed -i 's|@eqn|@math|g' doc/manual/R-intro.texi
%endif
%if %{texi2any}
make MAKEINFO=texi2any info
%else
make MAKEINFO=makeinfo info
%endif

# Convert to UTF-8
for i in doc/manual/R-intro.info doc/manual/R-FAQ.info doc/FAQ
doc/manual/R-admin.info doc/manual/R-exts.info-1; do
  iconv -f iso-8859-1 -t utf-8 -o $i{.utf8,}
  mv $i{.utf8,}
done

%install
make DESTDIR=${RPM_BUILD_ROOT} install install-info
# And now, undo the hack. :P
%if 0%{?fedora} >= 19
mv doc/manual/R-exts.texi.spot doc/manual/R-exts.texi
mv doc/manual/R-intro.texi.spot doc/manual/R-intro.texi
%endif
make DESTDIR=${RPM_BUILD_ROOT} install-pdf



As seen above, these two scripts contain hacks.

Again, building on my Linux R from CRAN source is OK. My worry is
being left with broken manuals, thus the idea to verify if everything
is correctly built.

Thank you

>
> Recent versions of R work with makeinfo 5.1/2 (5.0 is broken) or report that
> makeinfo is not available.  And versions released after 6.0 work with 6.0
> 
>
> --
> Brian D. Ripley,  rip...@stats.ox.ac.uk
> Emeritus Professor of Applied Statistics, University of Oxford
> 1 South Parks Road, Oxford OX1 3TG, UK
>
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



-- 

google.com/+arnaudgabourygabx

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Build R form source - manuals

2015-09-09 Thread Prof Brian Ripley

On 09/09/2015 07:40, arnaud gaboury wrote:

I built R form source succesfully on my Fedora 22 box. No errors.


Which version of R?


I can read there is an issue with some manuals at build time when
running makeinfo, especially these two:
doc/manual/R-exts.texi
cp doc/manual/R-intro.texi
Some distro have hacks about makeinfo 5 in their build script.

I wonder if some manuals are broken but couldn't see it when running make.


May someone tells me more about this issue and what can I do to make
sure these manuals are correctly built.


You are the one claiming there is an issue, so the onus is on you to 
tell us.


Recent versions of R work with makeinfo 5.1/2 (5.0 is broken) or report 
that makeinfo is not available.  And versions released after 6.0 work 
with 6.0 


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel