Re: [Rd] R with Array Hashes

2015-03-06 Thread Dirk Eddelbuettel

Jeff,

Nice writeup and promising idea.  From the gimme numbers department:

 - do you pass the R regression tests?

 - what sort of speedups do you see on which type of benchmarks?

When you asked about benchmark code on Twitter, I shared the somewhat
well-known (but no R ...) http://benchmarksgame.alioth.debian.org/
Did you write new benchmarks?  Did you try the ones once assembled by Simon?  

Dirk

-- 
http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] R with Array Hashes

2015-03-06 Thread Jeffrey Horner
Hi,

I wanted to share with the mailing list members here details about the
project I've been working on:

https://github.com/jeffreyhorner/R-Array-Hash

This is a re-implementation of R's hashed environments, the global
variable cache, the global string cache and symbol table with
cache-conscious array hash tables. The results are quite encouraging.
However, the implementation is a big departure from R's API:

An array hash is a cache-conscious data structure that takes
advantage of hardware prefetchers for improved performance on large
hash tables, those large enough to fit in main memory and larger than
fast fixed size cpu caches.

However, their implementation is a radical departure from standard
chained hash tables. Rather than using chains of hash buckets for
collision resolution, array hashes use segements of contiguous memory
called dynamic arrays to store keys and values. Adding and deleting
items from the hash involve copying the entire segment to new areas in
memory. While this may seem wasteful and slow, it's surprisingly
efficient in both time and space.

In R, hashed environments are implemented using lists with each list
element (a CONS cell) acting as the hash bucket. The CONS cell is the
binding agent for a symbol and value. Hashed environments are searched
using the pointer address of the symbol rather than the symbol's
printed name.

R-Array-Hash takes advantage of this by implementing an integer array
hash to store addresses of symbols and their associated values. Care
is also taken to account for whether or not a binding is locked,
active, etc.

Similarly, R-Array-Hash re-implements R's string cache using a string
array hash. This introduces the most radical change to R's API: CHAR()
no longer returns an address that points to the area at the end of the
SEXP (containing the string value). Rather it returns an address
located in one of the contiguous dynamic arrays of the string hash
table. Therefore, care must be taken in C code to use the address
immediately since additions and deletions to the string hash could
render the result of CHAR() useless. There are many areas of the code
that sidestep this by calling translateChar(), which has been changed
to always copy the string pointed by CHAR().

Comments, constructive or otherwise are welcome.

Best,

Jeff

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Bioc-devel] requirement for named assays in SummarizedExperiment

2015-03-06 Thread Valerie Obenchain

Hi Aaron,

Thanks for catching this.

I favor enforcing names in 'assays'. Combining by position alone is too 
dangerous. I'm thinking of the VCF class where the genome information is 
stored in 'assays' and the fields are rarely in the same order.


Looks like we also need a more informative error message when names 
don't match.


 assays(se1)
List of length 1
names(1): counts1

 assays(se2)
List of length 1
names(1): counts2

 cbind(se1, se2)
Error in sQuote(accessorName) :
  argument accessorName is missing, with no default


Valerie


On 03/05/2015 11:09 PM, Aaron Lun wrote:

Dear all,

I stumbled upon some unexpected behaviour with cbind'ing
SummarizedExperiment objects with unnamed assays:


require(GenomicRanges)
nrows - 5; ncols - 4
counts - matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowData - GRanges(chr1, IRanges(1:nrows, 1:nrows))
colData - DataFrame(Treatment=1:ncols, row.names=LETTERS[1:ncols])
sset - SummarizedExperiment(counts, rowData=rowData, colData=colData)
sset

class: SummarizedExperiment
dim: 5 4
exptData(0):
assays(1): ''
rownames: NULL
rowData metadata column names(0):
colnames(4): A B C D
colData names(1): Treatment


cbind(sset, sset)

dim: 5 8
exptData(0):
assays(0):
rownames: NULL
rowData metadata column names(0):
colnames(8): A B ... C1 D1
colData names(1): Treatment

Upon cbind'ing, the assays in the SE object are lost. I think this is
due to the fact that the cbind code matches up assays by their names.
Thus, if there are no names, the code assumes that there are no assays.

I guess this could be prevented by enforcing naming of assays in the
SummarizedExperiment constructor. Or, the binding code could be modified
to work positionally when there are no assay names, e.g., by cbind'ing
the first assays across all SE objects, then the second assays, etc.

Any thoughts?

Regards,

Aaron


sessionInfo()

R Under development (unstable) (2014-12-14 r67167)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4parallel  stats graphics  grDevices utils
datasets
[8] methods   base

other attached packages:
[1] GenomicRanges_1.19.42 GenomeInfoDb_1.3.13   IRanges_2.1.41
[4] S4Vectors_0.5.21  BiocGenerics_0.13.6

loaded via a namespace (and not attached):
[1] XVector_0.7.4


__
The information in this email is confidential and inte...{{dropped:15}}


___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Rd] R with Array Hashes

2015-03-06 Thread Jeffrey Horner
On Fri, Mar 6, 2015 at 9:36 AM, Dirk Eddelbuettel e...@debian.org wrote:

 Jeff,

 Nice writeup and promising idea.  From the gimme numbers department:

  - do you pass the R regression tests?

I made sure that the implementation passed 99% of the tests, however
there were two that gave differing results which I think are related
to traversing hashed environments.


  - what sort of speedups do you see on which type of benchmarks?

I wrote up some notes on the benchmark I conducted here:

https://github.com/jeffreyhorner/R-Array-Hash/tree/master/benchmarks

 When you asked about benchmark code on Twitter, I shared the somewhat
 well-known (but no R ...) http://benchmarksgame.alioth.debian.org/
 Did you write new benchmarks?  Did you try the ones once assembled by Simon?

I decided to design the benchmark very close to the one I found in:

Askitis, Nikolas, and Justin Zobel. Redesigning the string hash
table, burst trie, and bst to exploit cache. Journal of Experimental
Algorithmics (JEA) 15 (2010): 1-7.

Its a synthetic benchmark that just measures aspects of constructing
and searching an R environment.


 Dirk

 --
 http://dirk.eddelbuettel.com | @eddelbuettel | e...@debian.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Hyper-dual numbers in R

2015-03-06 Thread Ravi Varadhan
Hi,
Has anyone in R core thought about providing hyper-dual numbers in R?  
Hyper-dual (HD) numbers, invented by Jeffrey Fike at Stanford, are useful for 
computing exact second-order derivatives (e.g., Hessian).  HD numbers are 
extensions of complex numbers. They are like quaternions and have 4 parts to 
them (one real and 3 non-real).  They seem to be available in Julia.  
Obviously, the HD numbers involve a lot more book keeping.

http://adl.stanford.edu/hyperdual/

Thanks,
Ravi


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Bioc-devel] New(ish!) Seattle Bioconductor team member

2015-03-06 Thread Jim Hester
I realized I sent this response the first time from the wrong email, so I
don't believe it made it to the mailing list.  Apologies if
you receive this twice.

In regards to using covr with RUnit tests, covr is not dependent on using
any particular testing framework it simply runs any commands found in
tests/.  So assuming `BiocGenerics:::testPackage()` (or whatever you have
in tests/) can properly find the tests to run from the package source root
directory it should work.

The bigger issue seems to be supporting S4 classes as used by
Bioconductor.  I have some support for tracking coverage of S4 classes, but
Bioconductor packages leverage far more S4 features than I have tested so
far, so you are likely to run into cases that break things I have not
encountered.  Currently S4 coverage is experimental at best I would say.

Thank you for all the warm welcome as well!

Jim

 I second the welcomings.  And I am quite interested in covr, but I wonder
  what we have
  to do to get it to work with the RUnit-based conventions that we've
  followed so far,
  with [basefolder]/tests and [basefolder]/inst/unitTests?
 
  On Wed, Mar 4, 2015 at 6:03 PM, Henrik Bengtsson h...@biostat.ucsf.edu
  wrote:
 
   On Wed, Mar 4, 2015 at 2:29 PM, Michael Lawrence
   lawrence.mich...@gene.com wrote:
Welcome.
   
For those who don't know, Jim is also the author of the neat lintr
package, which checks your R code as you type, across multiple
 editors.
   
https://github.com/jimhester/lintr
  
   Not to mention https://github.com/jimhester/covr - It only took me one
   round of 'covr' to become a test-coverage-oholic.
  
   Jim, great to have you on board.
  
   /Henrik
  
   
Michael
   
On Wed, Mar 4, 2015 at 2:20 PM, Martin Morgan 
 mtmor...@fredhutch.org
wrote:
   
Let me take this belated opportunity to introduce Jim Hester 
jhes...@fredhutch.org to the Bioconductor developer community.
   
Jim is working in the short term on SummarizedExperiment, including
 the
refactoring efforts he introduced yesterday as well as coercion
 methods
   to
and from ExpressionSet (an initial version from ExpressionSet to
SummarizedExperiment is available in the development version
   GenomicRanges;
iterations will include coercion in the reverse direction as well as
perhaps more 'clever' mapping between the probeset or gene names of
ExpressionSet and relevant range-based notation). Jim will also
   contribute
to ongoing project activities like new package reviews, package
maintenance, and upcoming release activities.
   
Jim brings a lot of interesting biological and software development
experience to the project. Say hi when you have a chance!
   
Martin
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
   
Location: Arnold Building M1 B861
Phone: (206) 667-2793
   
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
   
   
[[alternative HTML version deleted]]
   
___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
  
   ___
   Bioc-devel@r-project.org mailing list
   https://stat.ethz.ch/mailman/listinfo/bioc-devel
  


[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel