Re: [Rd] vector finalizers

2016-08-10 Thread luke-tierney

There is no plan to change R's garbage collector, and I did not say
there was. What I wrote is:

If R is built to use reference counting for determining sharing
information this does not happen, so this is likely to change and not
force a copy by 3.4.0.

So reference counting is to be used for determining sharing, _not_ for
memory management.

There is some work in progress to allow alternate representation for R
vectors that would for the most part behave like standard
vectors. There are however a lot of thorny issues: while it is nice if
passing such things to sum() or mean() behaves in the 'usual' way, it
is probably not so nice if passing to log() or to serialize() behaves
in the 'usual' way. We'll have to see over the next few month whether
these issues can be addressed in a reusable way.

Best,

luke

On Sat, 6 Aug 2016, frede...@ofb.net wrote:


Dear R Devel,

In a thread this morning Luke Tierney mentioned that R's way of
garbage collecting is going to change soon in 3.4.0. I couldn't find
this info on Google but I wanted to share what I had been discussing
in another forum, in case now is not too late to raise considerations
which could affect the design of planned changes to R's garbage
collection facilities.

I ran into a problem when trying to get R to quickly load some vectors
from disk. R should be able to do this efficiently using memory
mapping. There is a package 'ff' which implements efficient loading of
disk-based vectors using memory mapping. It works pretty well, but the
problem is that it creates a separate data type - the vectors are not
"native" R vectors. There are some wrapper functions in a package
'ffbase' which allow people to use common functions like 'sum' on
these 'ff' vectors. However, a new wrapper has to be written for every
such function, and I guess the 'ffbase' authors do not have time to
write wrappers that are as efficient as the native R functions - in my
testing, there was a 10x slow-down for 'sum'.

The situation is a bit wistful because an 'ff' vector and a native R
vector are basically the same data type, they both store elements
contiguously in memory. Apparently, what prevents 'ffbase' and 'ff'
from creating native R vectors is the fact that it is impossible to
assign a "finalizer" to a native R vector. We need a finalizer so that
R can tell us when a vector is being freed, so we can unmap the
associated memory/file. Ffbase maintainer Edwin de Jonge was even
skeptical that CRAN would accept a package implementing the hack I had
proposed to simulate native R vectors from mmap'ed 'ff' vectors. The
issue is discussed here:

https://github.com/edwindj/ffbase/issues/52

Of course, weak references and external pointers allow finalizers to
be assigned to objects, but as I understand it, such objects are

i> separate types from vectors - there is no way in R to synthesize a

native vector endowed with a finalizer - something which could be
passed directly to built-in functions like 'sum'.

I think a finalizer facility for vectors would be useful because it
would allow us to take advantage of the memory mapping architecture
present in all modern processors, to do fast copy-free operations on
large disk-based data structures, without having to re-implement
internal functions like 'sum' which are essentially the same algorithm
no matter where the data is stored.

Thank you,

Frederick



--
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa  Phone: 319-335-3386
Department of Statistics andFax:   319-335-3017
   Actuarial Science
241 Schaeffer Hall  email:   luke-tier...@uiowa.edu
Iowa City, IA 52242 WWW:  http://www.stat.uiowa.edu

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] vector finalizers

2016-08-05 Thread frederik
Dear R Devel,

In a thread this morning Luke Tierney mentioned that R's way of
garbage collecting is going to change soon in 3.4.0. I couldn't find
this info on Google but I wanted to share what I had been discussing
in another forum, in case now is not too late to raise considerations
which could affect the design of planned changes to R's garbage
collection facilities.

I ran into a problem when trying to get R to quickly load some vectors
from disk. R should be able to do this efficiently using memory
mapping. There is a package 'ff' which implements efficient loading of
disk-based vectors using memory mapping. It works pretty well, but the
problem is that it creates a separate data type - the vectors are not
"native" R vectors. There are some wrapper functions in a package
'ffbase' which allow people to use common functions like 'sum' on
these 'ff' vectors. However, a new wrapper has to be written for every
such function, and I guess the 'ffbase' authors do not have time to
write wrappers that are as efficient as the native R functions - in my
testing, there was a 10x slow-down for 'sum'.

The situation is a bit wistful because an 'ff' vector and a native R
vector are basically the same data type, they both store elements
contiguously in memory. Apparently, what prevents 'ffbase' and 'ff'
from creating native R vectors is the fact that it is impossible to
assign a "finalizer" to a native R vector. We need a finalizer so that
R can tell us when a vector is being freed, so we can unmap the
associated memory/file. Ffbase maintainer Edwin de Jonge was even
skeptical that CRAN would accept a package implementing the hack I had
proposed to simulate native R vectors from mmap'ed 'ff' vectors. The
issue is discussed here:

https://github.com/edwindj/ffbase/issues/52

Of course, weak references and external pointers allow finalizers to
be assigned to objects, but as I understand it, such objects are
separate types from vectors - there is no way in R to synthesize a
native vector endowed with a finalizer - something which could be
passed directly to built-in functions like 'sum'.

I think a finalizer facility for vectors would be useful because it
would allow us to take advantage of the memory mapping architecture
present in all modern processors, to do fast copy-free operations on
large disk-based data structures, without having to re-implement
internal functions like 'sum' which are essentially the same algorithm
no matter where the data is stored.

Thank you,

Frederick

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel