Re: [Rd] R with Array Hashes
Jeff, Nice writeup and promising idea. From the gimme numbers department: - do you pass the R regression tests? - what sort of speedups do you see on which type of benchmarks? When you asked about benchmark code on Twitter, I shared the somewhat well-known (but no R ...) http://benchmarksgame.alioth.debian.org/ Did you write new benchmarks? Did you try the ones once assembled by Simon? Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] R with Array Hashes
Hi, I wanted to share with the mailing list members here details about the project I've been working on: https://github.com/jeffreyhorner/R-Array-Hash This is a re-implementation of R's hashed environments, the global variable cache, the global string cache and symbol table with cache-conscious array hash tables. The results are quite encouraging. However, the implementation is a big departure from R's API: An array hash is a cache-conscious data structure that takes advantage of hardware prefetchers for improved performance on large hash tables, those large enough to fit in main memory and larger than fast fixed size cpu caches. However, their implementation is a radical departure from standard chained hash tables. Rather than using chains of hash buckets for collision resolution, array hashes use segements of contiguous memory called dynamic arrays to store keys and values. Adding and deleting items from the hash involve copying the entire segment to new areas in memory. While this may seem wasteful and slow, it's surprisingly efficient in both time and space. In R, hashed environments are implemented using lists with each list element (a CONS cell) acting as the hash bucket. The CONS cell is the binding agent for a symbol and value. Hashed environments are searched using the pointer address of the symbol rather than the symbol's printed name. R-Array-Hash takes advantage of this by implementing an integer array hash to store addresses of symbols and their associated values. Care is also taken to account for whether or not a binding is locked, active, etc. Similarly, R-Array-Hash re-implements R's string cache using a string array hash. This introduces the most radical change to R's API: CHAR() no longer returns an address that points to the area at the end of the SEXP (containing the string value). Rather it returns an address located in one of the contiguous dynamic arrays of the string hash table. Therefore, care must be taken in C code to use the address immediately since additions and deletions to the string hash could render the result of CHAR() useless. There are many areas of the code that sidestep this by calling translateChar(), which has been changed to always copy the string pointed by CHAR(). Comments, constructive or otherwise are welcome. Best, Jeff __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] requirement for named assays in SummarizedExperiment
Hi Aaron, Thanks for catching this. I favor enforcing names in 'assays'. Combining by position alone is too dangerous. I'm thinking of the VCF class where the genome information is stored in 'assays' and the fields are rarely in the same order. Looks like we also need a more informative error message when names don't match. assays(se1) List of length 1 names(1): counts1 assays(se2) List of length 1 names(1): counts2 cbind(se1, se2) Error in sQuote(accessorName) : argument accessorName is missing, with no default Valerie On 03/05/2015 11:09 PM, Aaron Lun wrote: Dear all, I stumbled upon some unexpected behaviour with cbind'ing SummarizedExperiment objects with unnamed assays: require(GenomicRanges) nrows - 5; ncols - 4 counts - matrix(runif(nrows * ncols, 1, 1e4), nrows) rowData - GRanges(chr1, IRanges(1:nrows, 1:nrows)) colData - DataFrame(Treatment=1:ncols, row.names=LETTERS[1:ncols]) sset - SummarizedExperiment(counts, rowData=rowData, colData=colData) sset class: SummarizedExperiment dim: 5 4 exptData(0): assays(1): '' rownames: NULL rowData metadata column names(0): colnames(4): A B C D colData names(1): Treatment cbind(sset, sset) dim: 5 8 exptData(0): assays(0): rownames: NULL rowData metadata column names(0): colnames(8): A B ... C1 D1 colData names(1): Treatment Upon cbind'ing, the assays in the SE object are lost. I think this is due to the fact that the cbind code matches up assays by their names. Thus, if there are no names, the code assumes that there are no assays. I guess this could be prevented by enforcing naming of assays in the SummarizedExperiment constructor. Or, the binding code could be modified to work positionally when there are no assay names, e.g., by cbind'ing the first assays across all SE objects, then the second assays, etc. Any thoughts? Regards, Aaron sessionInfo() R Under development (unstable) (2014-12-14 r67167) Platform: x86_64-unknown-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=en_US.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats4parallel stats graphics grDevices utils datasets [8] methods base other attached packages: [1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13 IRanges_2.1.41 [4] S4Vectors_0.5.21 BiocGenerics_0.13.6 loaded via a namespace (and not attached): [1] XVector_0.7.4 __ The information in this email is confidential and inte...{{dropped:15}} ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel
Re: [Rd] R with Array Hashes
On Fri, Mar 6, 2015 at 9:36 AM, Dirk Eddelbuettel e...@debian.org wrote: Jeff, Nice writeup and promising idea. From the gimme numbers department: - do you pass the R regression tests? I made sure that the implementation passed 99% of the tests, however there were two that gave differing results which I think are related to traversing hashed environments. - what sort of speedups do you see on which type of benchmarks? I wrote up some notes on the benchmark I conducted here: https://github.com/jeffreyhorner/R-Array-Hash/tree/master/benchmarks When you asked about benchmark code on Twitter, I shared the somewhat well-known (but no R ...) http://benchmarksgame.alioth.debian.org/ Did you write new benchmarks? Did you try the ones once assembled by Simon? I decided to design the benchmark very close to the one I found in: Askitis, Nikolas, and Justin Zobel. Redesigning the string hash table, burst trie, and bst to exploit cache. Journal of Experimental Algorithmics (JEA) 15 (2010): 1-7. Its a synthetic benchmark that just measures aspects of constructing and searching an R environment. Dirk -- http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Hyper-dual numbers in R
Hi, Has anyone in R core thought about providing hyper-dual numbers in R? Hyper-dual (HD) numbers, invented by Jeffrey Fike at Stanford, are useful for computing exact second-order derivatives (e.g., Hessian). HD numbers are extensions of complex numbers. They are like quaternions and have 4 parts to them (one real and 3 non-real). They seem to be available in Julia. Obviously, the HD numbers involve a lot more book keeping. http://adl.stanford.edu/hyperdual/ Thanks, Ravi [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Bioc-devel] New(ish!) Seattle Bioconductor team member
I realized I sent this response the first time from the wrong email, so I don't believe it made it to the mailing list. Apologies if you receive this twice. In regards to using covr with RUnit tests, covr is not dependent on using any particular testing framework it simply runs any commands found in tests/. So assuming `BiocGenerics:::testPackage()` (or whatever you have in tests/) can properly find the tests to run from the package source root directory it should work. The bigger issue seems to be supporting S4 classes as used by Bioconductor. I have some support for tracking coverage of S4 classes, but Bioconductor packages leverage far more S4 features than I have tested so far, so you are likely to run into cases that break things I have not encountered. Currently S4 coverage is experimental at best I would say. Thank you for all the warm welcome as well! Jim I second the welcomings. And I am quite interested in covr, but I wonder what we have to do to get it to work with the RUnit-based conventions that we've followed so far, with [basefolder]/tests and [basefolder]/inst/unitTests? On Wed, Mar 4, 2015 at 6:03 PM, Henrik Bengtsson h...@biostat.ucsf.edu wrote: On Wed, Mar 4, 2015 at 2:29 PM, Michael Lawrence lawrence.mich...@gene.com wrote: Welcome. For those who don't know, Jim is also the author of the neat lintr package, which checks your R code as you type, across multiple editors. https://github.com/jimhester/lintr Not to mention https://github.com/jimhester/covr - It only took me one round of 'covr' to become a test-coverage-oholic. Jim, great to have you on board. /Henrik Michael On Wed, Mar 4, 2015 at 2:20 PM, Martin Morgan mtmor...@fredhutch.org wrote: Let me take this belated opportunity to introduce Jim Hester jhes...@fredhutch.org to the Bioconductor developer community. Jim is working in the short term on SummarizedExperiment, including the refactoring efforts he introduced yesterday as well as coercion methods to and from ExpressionSet (an initial version from ExpressionSet to SummarizedExperiment is available in the development version GenomicRanges; iterations will include coercion in the reverse direction as well as perhaps more 'clever' mapping between the probeset or gene names of ExpressionSet and relevant range-based notation). Jim will also contribute to ongoing project activities like new package reviews, package maintenance, and upcoming release activities. Jim brings a lot of interesting biological and software development experience to the project. Say hi when you have a chance! Martin -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel [[alternative HTML version deleted]] ___ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel