[Rd] sample.int() algorithms
I was experiencing a strange pattern of slowdowns when using sample.int(), where sampling from a one population would sometimes take 1000x longer than taking the same number of samples from a slightly larger population. For my application, this resulted in a runtime of several hours rather than a few seconds. Looking into it, I saw that sample.int() is hardcoded to switch algorithms when the population is larger than 1e+7, and I was straddling this boundary: sample.int <- function(n, size = n, replace = FALSE, prob = NULL) { if (!replace && is.null(prob) && n > 1e7 && size <= n/2) .Internal(sample2(n, size)) else .Internal(sample(n, size, replace, prob)) } do_sample2() takes the approach of taking a sample, and then checking if this sample is a duplicate. As long as the population size is much larger than numbers of samples, this will be efficient. This explains the check for "size <= n/2". But I'm not sure why the "n > 1e7" check is needed. I put together some sample code to show the difference in timing letting sample.int() choose the cutoff point versus manually specifying the use of do_sample2(): ### compare times for sample.int() vs internal function sample2() compareSampleTimes = function(popSizeList=c(1e5, 1e6, 1e7, 1e8, 1e9), sampleSizeList=c(10, 100, 1000, 1), numReplications=1000) { for (sampleSize in sampleSizeList) { for (popSize in popSizeList) { elapsed1 = system.time(replicate(numReplications, sample.int(popSize, sampleSize)))[["elapsed"]] elapsed2 = system.time(replicate(numReplications, .Internal(sample2(popSize, sampleSize[["elapsed"]] cat(sprintf("Sample %d from %.0e: %f vs %f seconds\n", sampleSize, popSize, elapsed1, elapsed2)) } cat("\n") } } compareSampleTimes() https://gist.github.com/nkurz/8fa6ff3772a054294531 And here's the output showing the 1000x slowdowns at population sizes of 10e7 under R-3.2.2: $ Rscript compareSampleTimes.R Sample 10 from 1e+05: 0.133000 vs 0.003000 seconds Sample 10 from 1e+06: 0.931000 vs 0.003000 seconds Sample 10 from 1e+07: 13.19 vs 0.003000 seconds Sample 10 from 1e+08: 0.004000 vs 0.003000 seconds Sample 10 from 1e+09: 0.004000 vs 0.002000 seconds Sample 100 from 1e+05: 0.18 vs 0.007000 seconds Sample 100 from 1e+06: 0.908000 vs 0.006000 seconds Sample 100 from 1e+07: 13.161000 vs 0.007000 seconds Sample 100 from 1e+08: 0.007000 vs 0.006000 seconds Sample 100 from 1e+09: 0.007000 vs 0.006000 seconds Sample 1000 from 1e+05: 0.194000 vs 0.057000 seconds Sample 1000 from 1e+06: 1.084000 vs 0.049000 seconds Sample 1000 from 1e+07: 13.226000 vs 0.049000 seconds Sample 1000 from 1e+08: 0.047000 vs 0.046000 seconds Sample 1000 from 1e+09: 0.048000 vs 0.047000 seconds Sample 1 from 1e+05: 0.414000 vs 0.712000 seconds Sample 1 from 1e+06: 1.10 vs 0.453000 seconds Sample 1 from 1e+07: 14.882000 vs 0.443000 seconds Sample 1 from 1e+08: 0.448000 vs 0.443000 seconds Sample 1 from 1e+09: 0.445000 vs 0.443000 seconds Since my usage involves taking samples of 1K from populations of about 10M, the do_sample2() approach is the clear winner: .05 seconds vs 13 seconds. I tested on a couple machines, and got similar results on both. Would there be a downside to having sample.int() use the faster do_sample2() approach for populations of all sizes whenever the ratio of population size to sample size is large? --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Build R with MKL and ICC
As a short and simple approach, I just compiled the current R release on Ubuntu with ICC and MKL using just this: $ tar -xzf R-3.2.2.tar.gz $ cd R-3.2.2 $ CC=icc CXX=icpc AR=xiar LD=xild CFLAGS="-g -O3 -xHost" CXXFLAGS="-g -O3 -xHost" ./configure --with-blas="-lmkl_rt -lpthread" --with-lapack --enable-memory-profiling --enable-R-shlib $ make $ sudo make install $ R --version R version 3.2.2 (2015-08-14) -- "Fire Safety" If you have 'ifort' available, you would probably want to add it to the list of environment variables. --nate __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Profiling function that contains both C++ and Fortran Code
Hello, I am trying to profile a function of OpenMx (http://openmx.psyc.virginia.edu) for CPU time. My operating system is OS X 10.10. OpenMx contains C++ and Fortran code. I have read the section regarding profiling compiled code in the manual (https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Profiling-compiled-code). This section and this post (http://blog.fellstat.com/?p=337) lead me to try Instruments. Here is what I did: -Opened Instruments -Chose the Time Profiler Template -Pressed Record -Started my script using RStudio The output of instruments looks like this: http://i.stack.imgur.com/aKIQm.jpg. The command line tool "sample" returns the same output The problem is that it looks like "omxunsafedgemm_", the functions that consumes the vast majority of the time, would be called directly from the Main Thread. However, this is a low level Fortran function. It is always called by a C++ function called "omxDGEMM". In this example "omxDGEMM" is first called by "omxCallRamExpection" (so almost at the bottom of the call tree). The total time of "omxDGEMM" is 0. Thus, the profiling information is currently useless. In the original version of the package "omxDGEMM" is defined as inline. I changed this in the hope that it would resolve the issue. This was not the case. "omxunsafedgemm" is called by "omxDGEMM" like that F77_CALL(omxunsafedgemm)(&transa, &transb, &(nrow), &(ncol), &(nmid), &alpha, a->data, &(a->leading), b->data, &(b->leading),&beta, result->data, &(result->leading)); Any ideas how to obtain a sensible profiler output? Best, Julian Karch __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Build R form source - manuals
On Wed, Sep 9, 2015 at 9:11 AM, Prof Brian Ripley wrote: > On 09/09/2015 07:40, arnaud gaboury wrote: >> >> I built R form source succesfully on my Fedora 22 box. No errors. > > > Which version of R? 3-2-1 >> >> >> I can read there is an issue with some manuals at build time when >> running makeinfo, especially these two: >> doc/manual/R-exts.texi >> cp doc/manual/R-intro.texi >> Some distro have hacks about makeinfo 5 in their build script. >> >> I wonder if some manuals are broken but couldn't see it when running make. >> >> >> May someone tells me more about this issue and what can I do to make >> sure these manuals are correctly built. > > > You are the one claiming there is an issue, so the onus is on you to tell > us. please find below two build scripts for R 3-2-1 * Archlinux: .. # fix for texinfo 5.X sed -i 's|test ${makeinfo_version_min} -lt 7|test ${makeinfo_version_min} -lt 0|' configure * FEDORA 22 . %if 0%{?fedora} >= 19 # What a hack. # Current texinfo doesn't like @eqn. Use @math instead where stuff breaks. cp doc/manual/R-exts.texi doc/manual/R-exts.texi.spot cp doc/manual/R-intro.texi doc/manual/R-intro.texi.spot sed -i 's|@eqn|@math|g' doc/manual/R-exts.texi sed -i 's|@eqn|@math|g' doc/manual/R-intro.texi %endif %if %{texi2any} make MAKEINFO=texi2any info %else make MAKEINFO=makeinfo info %endif # Convert to UTF-8 for i in doc/manual/R-intro.info doc/manual/R-FAQ.info doc/FAQ doc/manual/R-admin.info doc/manual/R-exts.info-1; do iconv -f iso-8859-1 -t utf-8 -o $i{.utf8,} mv $i{.utf8,} done %install make DESTDIR=${RPM_BUILD_ROOT} install install-info # And now, undo the hack. :P %if 0%{?fedora} >= 19 mv doc/manual/R-exts.texi.spot doc/manual/R-exts.texi mv doc/manual/R-intro.texi.spot doc/manual/R-intro.texi %endif make DESTDIR=${RPM_BUILD_ROOT} install-pdf As seen above, these two scripts contain hacks. Again, building on my Linux R from CRAN source is OK. My worry is being left with broken manuals, thus the idea to verify if everything is correctly built. Thank you > > Recent versions of R work with makeinfo 5.1/2 (5.0 is broken) or report that > makeinfo is not available. And versions released after 6.0 work with 6.0 > > > -- > Brian D. Ripley, rip...@stats.ox.ac.uk > Emeritus Professor of Applied Statistics, University of Oxford > 1 South Parks Road, Oxford OX1 3TG, UK > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- google.com/+arnaudgabourygabx __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Build R form source - manuals
On 09/09/2015 07:40, arnaud gaboury wrote: I built R form source succesfully on my Fedora 22 box. No errors. Which version of R? I can read there is an issue with some manuals at build time when running makeinfo, especially these two: doc/manual/R-exts.texi cp doc/manual/R-intro.texi Some distro have hacks about makeinfo 5 in their build script. I wonder if some manuals are broken but couldn't see it when running make. May someone tells me more about this issue and what can I do to make sure these manuals are correctly built. You are the one claiming there is an issue, so the onus is on you to tell us. Recent versions of R work with makeinfo 5.1/2 (5.0 is broken) or report that makeinfo is not available. And versions released after 6.0 work with 6.0 -- Brian D. Ripley, rip...@stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel