Re: [R] how to change the ff properties of a ff-related R object after the original ff output folder has been moved

2015-06-26 Thread Jens Oehlschlägel

Tao,

I do assume that the ff-files are still at some location and not deleted 
by a finalizer. The following explains how to manipulate file locations 
with ff and ffdf objects.


Kind regards
jens

library(ff)
path1 - c:/tmp
path2 - c:/tmp2
# create ffdf,
# using non-standard path sets finalizer to 'close' instead of 'delete'
fdf1 - as.ffdf(iris, col_args=list(pattern=file.path(path1,iris)))
# let's copy the old metadata (but not the files, useclone for that)
# using ffs hybrid copying semantics
fdf2 - fdf1
# note both are open
is.open(fdf1)
is.open(fdf2)
# close the files
close(fdf1)
# and note that
is.open(fdf1)
is.open(fdf2)
# the magic has kept physical metadata in synch even in the copy
# (virtual metadata is not kept in synch
# which allows different virtual views into the same files
# not unlike SQL VIEWs virtualize dastabase TABLEs)

# filename on a ffdf
filename(fdf2)
# is a shortcut for
lapply(physical(fdf2), filename)
# so filename is a physical attribute
# actually moving the files can be done with the filename- method
lapply(physical(fdf2), function(x)filename(x) - sub(path1, path2, 
filename(x)))

# check this
filename(fdf1)
filename(fdf2)

# filename on ff
filename(fdf1$Species)
# is a shortcut for
attr(attr(fdf1$Species, physical), filename)
# and if you directly manipulate this attribute
# you circummvent the filename method
# and the file itself will not be moved
attr(attr(fdf1$Species, physical), filename) - sub(path2, path1, 
filename(fdf1$Species))

# now the metadata points to a different location
filename(fdf1$Species)
# note that this physical attribute was also changed
# for the copy
filename(fdf2$Species)
# of course you can fix the erroneous metadata by
attr(attr(fdf1$Species, physical), filename) - sub(path1, path2, 
filename(fdf1$Species))

# or for all columns in a ffdf by
lapply(physical(fdf2), function(x)attr(attr(x, physical), filename) 
- sub(path2, path1, filename(x)))

# now we have your situation with broken metadata
open(fdf2)
# and can fix that by
lapply(physical(fdf2), function(x)attr(attr(x, physical), filename) 
- sub(path1, path2, filename(x)))

# check
open(fdf2)



Am 26.06.2015 um 01:04 schrieb Shi, Tao:

Hi all,

I'm new to ff package through the using Bioconductor package crlmm.  Here 
is my problem:

I've created a few R objects (e.g. an CNSet) using crlmm based on my data and 
save them in a .RData file.  crlmm heavily uses ff package to store results on 
a local folder.  For certain reasons, I have moved the ff output folder to 
somewhere else.  Now when I go back to R, I can't open those CNSet, for 
example, anymore, as the file has a property still storing the old ff output 
folder path.

My question is: is there a quick way to change these paths to the new one, so I 
don't have to re-run the own analysis.

Many thanks!

Tao



__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] package bit64 with new functionality

2012-11-10 Thread Jens Oehlschlägel


Dear R community,

The new version of package 'bit64' - which extends R with fast 64-bit 
integers - now has fast (single-threaded) implementations of the most 
important univariate algorithmic operations (those based on hashing and 
sorting). Package 'bit64' now has methods for 'match', '%in%', 
'duplicated', 'unique', 'table', 'sort', 'order', 'rank', 'quantile', 
'median' and 'summary'. Regarding data management it has novel generics 
'unipos' (positions of the unique values), 'tiepos' (positions of ties), 
'keypos' (positions of values in a sorted unique table) and derived 
methods 'as.factor' and 'as.ordered'. This 64-bit functionality is 
implemented carefully to be not slower than the respective 32-bit 
operations in Base R and also to avoid excessive execution times 
observed with 'order', 'rank' and 'table' (speedup factors 20/16/200 
respective). This increases the dataset size with wich we can work truly 
interactive. The speed is achieved by simple heuristic optimizers: the 
mentioned high-level functions choose the best from multiple low-level 
algorithms and further take advantage of a novel optional caching 
method. In an example R session using a couple of these operations the 
64-bit integers performed 22x faster than base 32-bit integers, 
hash-caching improved this to 24x amortized, sortorder-caching was most 
efficient with 38x (caching both, hashing and sorting is not worth it 
with 32x at duplicated RAM consumption).


Since the package covers the most important functions for (univariate) 
data exploration and data management, I think it is now appropriate to 
claim that R has sound 64-bit integer support, for example for working 
with keys or counts imported from large databases. For details 
concerning approach, implementation and roadmap please check the 
ANNOUNCEMENT-0.9-Details.txt file and the package help files.


Kind regards


Jens Oehlschlägel
Munich, 8.11.2012

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [R-sig-hpc] Quickest way to make a large empty file on disk?

2012-05-03 Thread Jens Oehlschlägel

   Jonathan,
   On some filesystems (e.g. NTFS, see below) it is possible to create 'sparse'
   memory-mapped files, i.e. reserving the space without the cost of actually
   writing initial values.
   Package 'ff' does this automatically and also allows to access the file in
   parallel.  Check  the  example  below and see how big file creation is
   immediate.
   Jens Oehlschlägel
library(ff)
library(snowfall)
ncpus - 2
n - 1e8
system.time(
   + x - ff(vmode=double, length=n, filename=c:/Temp/x.ff)
   + )
  User  System verstrichen
  0.010.000.02
# check finalizer, with an explicit filename we should have a 'close'
   finalizer
finalizer(x)
   [1] close
# if not, set it to 'close' inorder to not let slaves delete x on slave
   shutdown
finalizer(x) - close
sfInit(parallel=TRUE, cpus=ncpus, type=SOCK)
   R Version:  R version 2.15.0 (2012-03-30)
   snowfall 1.84 initialized (using snow 0.3-9): parallel execution on 2 CPUs.
sfLibrary(ff)
   Library ff loaded.
   Library ff loaded in cluster.
   Warnmeldung:
   In library(package = ff, character.only = TRUE, pos = 2, warn.conflicts =
   TRUE,  :
 'keep.source' is deprecated and will be ignored
sfExport(x) # note: do not export the same ff multiple times
# explicitely opening avoids a gc problem
sfClusterEval(open(x, caching=mmeachflush)) # opening with 'mmeachflush'
   inststead of 'mmnoflush' is a bit slower but prevents OS write storms when
   the file is larger than RAM
   [[1]]
   [1] TRUE
   [[2]]
   [1] TRUE
system.time(
   + sfLapply( chunk(x, length=ncpus), function(i){
   +   x[i] - runif(sum(i))
   +   invisible()
   + })
   + )
  User  System verstrichen
  0.000.00   30.78
system.time(
   + s - sfLapply( chunk(x, length=ncpus), function(i) quantile(x[i], c(0.05,
   0.95)) )
   + )
  User  System verstrichen
  0.000.004.38
# for completeness
sfClusterEval(close(x))
   [[1]]
   [1] TRUE
   [[2]]
   [1] TRUE
csummary(s)
5%  95%
   Min.0.04998 0.95
   1st Qu. 0.04999 0.95
   Median  0.05001 0.95
   Mean0.05001 0.95
   3rd Qu. 0.05002 0.95
   Max.0.05003 0.95
# stop slaves
sfStop()
   Stopping cluster
 # with the close finalizer we are responsible for deleting the file
   explicitely (unless we want to keep it)
delete(x)
   [1] TRUE
# remove r-side metadata
rm(x)
# truly free memory
gc()
   Gesendet: Donnerstag, 03. Mai 2012 um 00:23 Uhr
   Von: Jonathan Greenberg j...@illinois.edu
   An: r-help r-help@r-project.org, r-sig-...@r-project.org
   Betreff: [R-sig-hpc] Quickest way to make a large empty file on disk?
   R-helpers:
   What would be the absolute fastest way to make a large empty file (e.g.
   filled with all zeroes) on disk, given a byte size and a given number
   number of empty values. I know I can use writeBin, but the object in
   this case may be far too large to store in main memory. I'm asking because
   I'm going to use this file in conjunction with mmap to do parallel writes
   to this file. Say, I want to create a blank file of 10,000 floating point
   numbers.
   Thanks!
   --j
   --
   Jonathan A. Greenberg, PhD
   Assistant Professor
   Department of Geography and Geographic Information Science
   University of Illinois at Urbana-Champaign
   607 South Mathews Avenue, MC 150
   Urbana, IL 61801
   Phone: 415-763-5476
   AIM: jgrn307, MSN: jgrn...@hotmail.com, Gchat: jgrn307, Skype: jgrn3007
   [1]http://www.geog.illinois.edu/people/JonathanGreenberg.html
   [[alternative HTML version deleted]]
   ___
   R-sig-hpc mailing list
   r-sig-...@r-project.org
   [2]https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

References

   1. http://www.geog.illinois.edu/people/JonathanGreenberg.html
   2. https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ff objects saving problem

2010-11-10 Thread Jens Oehlschlägel
Xiaobo,

You indeed need external 'zip' and 'unzip' utlities in the path, citing from 
ffsave's help: using an external zip utility, e.g. for windows in Rtools on 
[http://www.murdoch-sutherland.com/Rtools/];.

Please note that the mentioned utilities have a 4 GB limit for the zip file, 
AFAIK. I will for the next release check for a way to get rid of this limit and 
also to get rid of inconsistencies in upper/lower-case spelling of drive 
letters which can cause ffsave to fail. Note that - even without fffsave - ff 
objects can be made permanent simply by creating them with 'filename' resp. 
'pattern' outside of fftempdir and saving the R-side ff-object with the usual 
'save' or 'save.image' function. 

In a new R session, after 'library(ff)' and 'load' you again have access, 
assumed your ff files are still in the same location.
And yes, each column of a ffdf dataframe is stored as a separate ff file. 

Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] ff version 2.2.0

2010-10-01 Thread Jens Oehlschlägel
Dear R community,

The next release of package ff is available on CRAN. With kind help of Brian 
Ripley it now supports the Win64 and Sun versions of R. It has three major 
functional enhancements:

a) new fast in-memory sorting and ordering functions (single-threaded)
b) ff now supports on-disk sorting and ordering of ff vectors and ffdf 
dataframes
c) ff integer vectors now can be used as subscripts of ff vectors and ffdf 
dataframes

a) is achieved by careful implementation of NA-handling and exploiting context 
information
b) although permanently stored, sorting and ordering of ff objects can be 
faster than the standard routines in R
c) applying an order to ff vectors and ffdf dataframes is substantially slower 
than in pure R because it involves disk-access AND sorting index positions (to 
avoid random access). 

There is still room for improvement, however, the current status should already 
be useful. I run some comparisons with SAS (see end of mail): 
- both could sort German census size (81e6 rows) on a 3GB notebook
- ff sorts and orders faster on single columns
- sorting big multicolumn-tables is faster in SAS

Win64 binaries and version 2.2.1 supporting Sun should appear during the next 
days on CRAN. For the impatient: checkout from r-forge with revision 67 or 
higher.
Non-Windows users: please note that you need to set appropriate values for 
options 'ffbatchbytes' and 'ffmaxbytes' yourself.

Note that  virtual window support is deprecated now because it leads to too 
complex code. Let us know if you urgently need this and why.

Feedback, ideas and contributions appreciated. To those who offered code during 
the last months: please forgive us that integrating and documenting was not 
possible with this release. 


Jens  Daniel



P.S. NEWS


CHANGES IN ff VERSION 2.2.0


NEW FEATURES

o   ff now supports the 64 bit Windows and Sun versions of R 
(thanks to Brian Ripley)
o   ff now supports sorting and ordering of ff vectors and dataframes
(see ramsort, ffsort, ffdfsort, ramorder, fforder, ffdforder)
o   ff now supports ff vectors as subscripts of ff objects
(currently positive integers only, booleans are planned)
o   New option 'ffmaxbytes' which allows certain ff procedures like sorting
using larger limit of RAM than 'ffbatchbytes' in chunked processing.
Such higher limit is useful for (single-R-process) sorting compared to
some multi-R-process chunked processing. It is a good idea to reduce 
'ffmaxbytes' on slaves or avoid ff sorting there completely.
o   New generic 'pagesize' with method 'pagesize.ff' which returns the 
current pagesize as defined on opening the ff object.


USER VISIBLE CHANGES

o   [.ff now returns with the same vmode as the ff-object
o   Certain operations are faster now because we worked around 
unnecessary copying triggered by many of R's assignment functions.
For example reading a factor from a (well-cached) file is now 20%
faster and thus as fast as just creating this factor in-RAM using 
levels()- and class()- assignments. 
(consider this tuning temporary, hoping for a generic fix in base R)
o   ff() can now open files larger than .Machine$integer.max elements
(but gives access only to the first .Machine$integer.max elements)
o   ff now has default pattern NULL translating to the pattern in 'filename'
(and only to the previous default 'ff' if no filename is given)
o   ff now sets the pattern in synch with a requested 'filename'
o   clone.ff now always creates a file consistent with the previous pattern
o   clone.ff now always creates a finalizer consistent with the file 
location
o   clone.ffdf has a new argument 'nrow' which allows to create an empty 
copy 
with a different number of rows (currently requires 'initdata=NULL')
o   clone.default now deep-copies lists and atomic vectors


DEPRECATED

o   virtual window support is deprecated. Let us know if you urgently need 
this and why.
   

BUG FIXES

o   read.table.ffdf now also works if transFUN filters and returns less rows


BUG FIXES at 2.1.4

o   [-.ffdf no longer does calculate the number of elements in an ffdf
which could led to an integer overflow


BUG FIXES at 2.1.3


o   ffsafe now always closes ffdf objects - also partially closed ones

o   ffsafe no longer passes arguments 'add' and 'move' to 'save'

o   ffsafe and friends now work around the fact that under windows getwd()
can report the same path in upper and lower case versions. 



CHANGES IN bit VERSION 1.1.5


NEW FEATURES

o new utility functions setattr() and setattributes() allow to set 
attributes 
  by reference (unlike attr()- attributes()- without copying the object)

o new utility unattr() returns copy of input with attributes removed


USER VISIBLE CHANGES

o certain 

Re: [R] Pass By Value Questions

2010-08-20 Thread Jens Oehlschlägel
Jeff, 
R has 'environments' as a general mechanism to pass around objects by 
reference. However, that does not help with most functions like 'apply' which 
take arguments other than environments. 
 I'm familiar with FF and BigMemory, but are there any packages/tricks which 
 allow for passing such objects by reference without having to code in C? 
With ff (and I assume with bigmemory as well) you can pass around objects by 
reference without C-coding.To be more precise with regard to ff: atomic ff 
objects have 'hybrid copying semantics', which means that two references to an 
ff object will share the data and SOME features (like the 'length') while OTHER 
features (like 'dim') are copied on modify (see 'vt' for an powerful 
application of this concept). You might want to have a look at 'ffapply' and 
friends and at 'chunk'.

HTH

Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] ff objects and ordinary analytical functions.

2010-08-02 Thread Jens Oehlschlägel
Xiaobo.Gu,
 Can the plenty of analytical functions provided by base R and contributed 
 packages be called with ff objects as parameters directly, or do we have to 
 write special version of the functions for ff objects? If it is the latter 
 case, is there a list of functions which support ff objects already.Xiaobo.Gu
ff is an add-on package that allows you to store and access larger datasets - 
its not part of the language. 
ff objects have different copy semantics than standard R objects (partially by 
reference) so it is unlikely that you can write R code that does use ff objects 
exactly the same way as with standard R objects. 

There is no comprehensive list, but some functions allow ff objects, e.g. 
'biglars' which you find if you look at the reverse-dependencies of ff on CRAN. 

Other functions are prepared to handle large datasets in chunks - like 'biglm' 
- and it is your responsibility to extract those chunks from ff, a database or 
whatever other source. 

HTH


Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can saved R object .RData files be loaded by more than one R sessions for read only purpose?

2010-08-02 Thread Jens Oehlschlägel
Xiaobo.Gu,

Shared reading should be fine. 
Shared writing is also possible, but it is important to understand that .RData 
files do only contain the meta-data of ff objects, not the ff data itself. 

This means you cannot have multiple processes updating the same .RData metadata
but you can have multiple processes writing simultaneously to the same ff 
datafile.

(it is your responsibility to avoid conflicts and to make sure you do not 
suffer problems with delayed cache refreshs as can happen on network drives) 

HTH

Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Can saved R object .RData files be loaded by more than one R sessions for read only purpose?

2010-08-02 Thread Jens Oehlschlägel
Xiaobo.Gu,

Shared reading should be fine. 
Shared writing is also possible, but it is important to understand that .RData 
files do only contain the meta-data of ff objects, not the ff data itself. 

This means you cannot have multiple processes updating the same .RData metadata
but you can have multiple processes writing simultaneously to the same ff 
datafile.

(it is your responsibility to avoid conflicts and to make sure you do not 
suffer problems with delayed cache refreshs as can happen on network drives) 

HTH

Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to deal with more than 6GB dataset using R?

2010-07-28 Thread Jens Oehlschlägel
Matthew,

You might want to look at function read.table.ffdf in the ff package, which can 
read large csv files in chunks and store the result  in a binary format on disk 
that can be quickly accessed from R. ff allows you to access complete columns 
(returned as a vector or array) or subsets of the data identified by 
row-positions (and column selection, returned as a data.frame). As Jim pointed 
out: all depends on what you are going with the data. If you want to access 
subsets not by row-position but rather by search conditions, you are better-off 
with an indexed database. 

Please let me know if you write a fast read.fwf.ffdf - we would be happy to 
include it into the ff package.


Jens

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Assign Formulas to Arrays or Matrices?

2010-07-06 Thread Jens Oehlschlägel
for assigning formulas to arrays use an array of list

nr   form.arr[[31,5]]y ~ 1 + 2
Jens Oehlschlägel


-Ursprüngliche Nachricht-
Von: McLovin  
Gesendet: Jul 6, 2010 9:13:49 AM
An: r-help@r-project.org
Betreff: [R] Assign Formulas to Arrays or Matrices?


Hi,

I am very new to R.  I am hoping to create formulas and assign them to
locations within an array (or matrix, if it will work).

Here's a simplified example of what I'm trying to do:

form.arr  for (i in seq(from=1, to=31, by=1)) {
   for (j in seq(from=1, to=5, by=1)) {
   form.arr[i,j,]  }
}

which results in this error:
Error in form.arr[i, j, ]incorrect number of subscripts

The reason I had made the 3rd dimension of the array size 3 is because
that's the length R tells me that formula is.

When I had tried to do this using a matrix, using this code:

form.mat  for (i in seq(from=1, to=31, by=1)) {
   for (j in seq(from=1, to=5, by=1)) {
   form.mat[i,j] = as.formula(y~1+2)
   }
}

I was told:

Error in form.mat[i, j] = as.formula(y ~ 1 + 2) : 
  number of items to replace is not a multiple of replacement length

My question is: is it possible to assign formulas within a matrix or array? 
If so, how?  thanks@real.com
-- 
View this message in context: 
http://r.789695.n4.nabble.com/Assign-Formulas-to-Arrays-or-Matrices-tp2279136p2279136.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] memory management in R

2010-06-16 Thread Jens Oehlschlägel
You might want to mention/talk about packages that enhance R's ability to work 
with less RAM / more data, such as package SOAR (transparently moving objects 
between RAM and disk) and ff (which allows vectors and dataframes larger than 
RAM and which supports dense datatypes like true boolean, short integers etc.). 

Jens Oehlschlägel



-Ursprüngliche Nachricht-
Von: john mull...@fastmail.fm
Gesendet: Jun 16, 2010 12:20:17 PM
An: r-help@r-project.org
Betreff: [R] memory management in R



I have volunteered to give a short talk on memory management in R 
   to my local R user group, mainly to motivate myself to learn about it. 

The focus will be on what a typical R coder might want to know  ( e.g. how
objects are created, call by value, basics of garbage collection ) but I
want to go a little deeper just in case there are some advanced users in the
crowd. 

Here are the resources I am using right now
  Chambers book Software for Data Analysis 
  Manuals such as R Internals and Writing R Extensions 

Any suggestions on other sources of information? 

There are still some things that are not clear to me, such as
  - how to make sense of the output from various memory diagnostics such as 
memory.profile ... are these counts? 
How to get the amount of memory used: gc() and memory.size() seem to
differ
 -  what gets allocated on the heap versus stack
 - why the name cons cells for the stack allocation 

Any help with these would be greatly appreciated. 

Thanks greatly, 

John Muller

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to read CSV file in R?

2010-06-09 Thread Jens Oehlschlägel
If you have memory problems reading csv you can use read.csv.ffdf from package 
ff which reads in chunks.
The result is a ffdf object, say myffdf, from which binary subscripting [,] 
returns standard data.frames, such as 
myffdf[,]  # returns all data (if it fits into memory)
myffdf[somerows,] # returns a subset of data

Do read and understand the help concerning filename location and implications 
for finalizers and permanency.

Cheers
Jens Oehlschlägel


-Ursprüngliche Nachricht-
Von: Joris Meys jorism...@gmail.com
Gesendet: Jun 8, 2010 1:11:20 PM
An: dhanush dhana...@gmail.com
Betreff: Re: [R] how to read CSV file in R?

That will be R 2.10.1 if I'm correct.

For reading in csv files, there's a function read.csv who does just that:
los - read.csv(file.csv,header=T)

But that is just a detail. You have problems with your memory, but
that's not caused by the size of your dataframe. On my system, a
matrix with 100,000 rows and 75 columns takes only 28 Mb. So I guess
your workspace is cluttered with other stuff.

Check following help pages :
?Memory
?memory.size
?Memory.limits

it generally doesn't make a difference, but sometimes using gc() can
set some memory free again.

If none of this information helps, please provide us with a bit more
info regarding your system and the content of your current workspace.

Cheers
Joris

On Tue, Jun 8, 2010 at 8:46 AM, dhanush dhana...@gmail.com wrote:

 I tried to read a CSV file in R. The file has about 100,000 records and 75
 columns. When used read.delim, I got this error. I am using R ver 10.1.

 los-read.delim(file.csv,header=T,sep=,)
 Warning message:
 In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
  Reached total allocation of 1535Mb: see help(memory.size)

 Thanks
 --
 View this message in context: 
 http://r.789695.n4.nabble.com/how-to-read-CSV-file-in-R-tp2246930p2246930.html
 Sent from the R help mailing list archive at Nabble.com.

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Joris Meys
Statistical consultant

Ghent University
Faculty of Bioscience Engineering
Department of Applied mathematics, biometrics and process control

tel : +32 9 264 59 87
joris.m...@ugent.be
---
Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] appending objects to file created with save()

2010-05-25 Thread Jens Oehlschlägel
If you work with large data you might want to look at the ff package - useful 
if your data is close or above your RAM. The package has ffsave where with 
option add=TRUE you can add data to an existing ff archive. With ff data is 
stored outside of R in files, only meta-data is stored within R. An ff archive 
consists of two files, a zip file which stores the data (and to which you can 
add) and a standard .RData file which stores the meta-data using standard 
save(). 
HTH
Jens

-Ursprüngliche Nachricht-
Von: Jannis bt_jan...@yahoo.de
Gesendet: May 25, 2010 8:22:26 PM
An: r-help@r-project.org
Betreff: [R] appending objects to file created with save()

Dears,


is there a way to append R objects similar to the function save() to a binary 
file that already consists some previously saved R objects?

I browsed the mailing list archive and only found some suggestions that 
include reading in the old file first and then saving the new objects together 
with the old ones. This would not be handy for me as my data is rather large.

I have tried dump() but this does not seem to compress my data.


Cheers
Jannis



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ff for 64-bit windows and 64-bit R

2010-05-14 Thread Jens Oehlschlägel
Lawrence,

My understanding is that only a minor change is needed in ff's C++ layer in 
order to remove the 64bit compiler warnings/errors. 
The C++ layer is maintained by Daniel Adler, who can give you an outlook 
if/when he plans to attack this. 

Until a 64bit version of ff is available, you might consider using the 32bit 
win version of R and ff on a 64bit win machine: while 32bit R itself has 
limited memory access, ff can handle larger objects faster because it benefits 
from *all* RAM via filesystem-caching. 

Jens



-
Von: Hunsicker, Lawrence lawrence-hunsic...@uiowa.edu
Gesendet: May 13, 2010 3:32:25 PM
An: jens.oehlschlae...@truecluster.com
Betreff: ff for 64-bit windows and 64-bit R


!-- Converted from text/rtf format --
Jens:
I am running R on a 64 bit PC, 64 bit Windows 7, and 64 bit R.  I have to 
handle rather large data sets, and I need the 64 bit environment to run some of 
my analyses.  Use of ff has been recommended to me to help with some of the 
memory problems, but I am told that ff has not yet been ported to a 64-bit 
Windows environment.  I have access, of course, to the native code, but I am 
not the world’s best compiler operator.  Do you have any plans to port ff to a 
64-bit Windows and R environment?  Is there anything that I can do to encourage 
this?  I would be happy to make a contribution to the “ff project” if such a 
thing exists.
Let me know your plans.
[L. G. Hunsicker, M.D.]
Professor, Internal Medicine
U. Iowa College of Medicine
Phone:  (319) 356-4763
Fax:  (319) 356-7488
lawrence-hunsic...@uiowa.edu
__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to work with big matrices and the ff-package?

2010-04-15 Thread Jens Oehlschlägel
Anne,

 

 After the above step I need to convert my ff_matrix to a data.frame to 
 discretize the whole matrix and calculate the mutual information.

 The calculated result should be saved as an ffdf-object or something similar.
 disc - as.ffdf(discretize(as.data.frame(as.ffdf(ffmat)), disc=equalwidth, 
 nbins=5))

 

ffdf are ff's aquivalent to data.frames: they handle many rows (2^31-1) and a 
limited number of columns (with potentially different
column types). Like data.frames, they are not suitable for millions of columns. 
You probably want to store your data in one big ff matrix.



If you use ff objects because you don't have the RAM for standard R objects, 
converting ff to a data.frame is not an option because it will require too much 
RAM.

If 'discretize' expects a data.frame, you cannot call it on an ff matrix 
either. But if 'discretize' works on single columns, you can call discretize on 
chunks of columns that you coerce to data.frames.

 

something like

for (i in chunk(from=1, to=ncol(ffmat), by=10))

ffmat[,i] - as.matrix(discretize(as.data.frame(ffmat[,i])))

 

If discretize returns integers, you might want to write the results rather to 
an integer ff matrix because this saves disk space and improves caching.

 

HTH

Jens Oehlschlägel

 

 

 

 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] as.ffdf.data.frame now breaks if using pattern

2010-04-07 Thread Jens Oehlschlägel
Ramon,
for me this works

 setwd(d:/tmp)
 ffd - as.ffdf(d, col_args=list(pattern = paste(getwd(), /fftmp, sep = )))
 filename(ffd)
$x
[1] d:/tmp/fftmp35c34861.ff

$y
[1] d:/tmp/fftmp5be946bb.ff

$z
[1] d:/tmp/fftmp26c49ce.ff

Jens


-Ursprüngliche Nachricht-
Von: Ramon Diaz-Uriarte rdia...@gmail.com
Gesendet: Apr 7, 2010 7:01:23 PM
An: r-help@r-project.org
Betreff: as.ffdf.data.frame now breaks if using pattern

Dear All,

I am using package ff. In version 2.1-1 it was possible to use
pattern with as.ffdf.data.frame:

d - data.frame(x=1:26, y=letters, z=Sys.time()+1:26)
as.ffdf(d, pattern = paste(getwd(), /fftmp, sep = ))

With the latest version, the last command crashes. I wonder if the new
behavior is intentional or a bug. If intentional, what is the
recommended way of using pattern now?

Thanks,

R.

-- 
Ramon Diaz-Uriarte
Structural Biology and Biocomputing Programme
Spanish National Cancer Centre (CNIO)
http://ligarto.org/rdiaz
Phone: +34-91-732-8000 ext. 3019
Fax: +-34-91-224-6972

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ff package: ff objects don't reload compl etely on NFS drives from a different machine

2010-01-25 Thread Jens Oehlschlägel
Try to close the file on the first nfs client before reopening it on the second 
nfs client. NFS has something called close-to-open cache consistency.
This means that two clients which have the same nfs file open, cannot rely on 
seeing the updates from the respective other client. If one clients closes, and 
the other client opens thereafter, it should see the changes. If you want 
multiple clients to write at the same time, you should make sure they only 
write non-overlapping sections (and then all need to close for synching). Let 
me know if this worked for you. 
J.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] New version of package ff

2010-01-24 Thread Jens Oehlschlägel
Dear R community,

Package bit version 1.1-3 and ff version 2.1.2 is available on CRAN and should 
be useful to handle large datasets.

It adds convenient utilities for managing ff objects and files (see ?ffsave) 
and removes some performance bottlenecks. 

In case you experience unexpected performance problems with ff, here is a 
couple of recommendations based on FAQs:

1) Compare the size of data to be written at the same time to available RAM for 
your filesystem cache. 
   If the data exceeds available RAM, then consider using caching=mmeachflush 
instead of caching=mmnoflush, this will make write operations predictably 
slower but prevent write storms stalling some systems (observed under NTFS 
win32+64).
   You can set ff's caching option 
   either with options(ffcaching=mmeachflush) before creating ff objects
   or create ff objects with ffobj - ff(..., caching=mmeachflush) 
   or open your existing ff object with open(ffobj, caching=mmeachflush) 
(while it is closed)
   ff objects will remember this setting

2) If you use caching=mmnoflush: check the writeback cache configuration of 
your filesystem (e.g. set data=writeback for ext3, tune limits for dirty pages, 
consider different filesystem, consider different OS). 

3) Choose a reasonable size for options(ffbatchbytes), which limits the 
amount of RAM used for one chunk. 
   With too small chunks you pay more performance overhead. 
   Note that bigger chunks are not always better, for example if you distribute 
chunked processing on many cores or if some operation involved does not scale 
well with chunk size. 

Final remark: testing ff access functionality  on a Core i7 920 (4 cores, 8 
cores with HT) shows that hyperthreading with 8 parallel processes (snowfall, 
sockets) gives about 5x the performance of a single process, but already 7 
processes with HT perform worse than 4 processes without HT. Conclusion: if a 
machine is dedicated to R for RAM-critical applications, try switching 
hyperthreading off. 

Hope you find this useful. We appreciate any feedback.


Jens  Daniel

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A question about the ff package

2010-01-07 Thread Jens Oehlschlägel
Peter,

ff objects are not allowed as subscripts to ff objects. You can take several 
routes

1) use bit objects instead of logical or ff logical. This is fast and takes 
factor 32 less RAM than logicals (BTW bit objects can be coerced to ff via 
as.ff() and as.bit() but they convert to vmode boolean (1 bit), not logical 
(2 bits). Examples for working with bit are on 
http://ff.r-forge.r-project.org/ffbit_UseR!2009.pdf

2) convert your logicals into positive integer subscripts (assuming that there 
are not too many elements selected, as you assume if writing bigData[select,]

3) keep your logical in a ff logical or ff boolean and then do chunked looping 
over both - the ff with the subscripts and the ffdf - and in each chunk convert 
the logical selection to integers, see 2)

HTH


Jens Oehlschlägel


P.S. you  might want to try the newer version on r-forge. It has several 
improvements but is not yet on CRAN because there is currently some issue with 
snow leopard.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Ops method does not dispatch on either of two classes

2009-12-31 Thread Jens Oehlschlägel
Thanks Brian,

 This is as documented on the help page for Ops (the page in base, not 
 the one in Methods) which is linked from ?|.  For background you 
 should also read the reference on that help page.

Unfortunately I have no access to that book.

 You are wrong in asserting that the internal method is 'for logicals': 
 please do study the help page.  It covers e.g. integer vectors which 
 is what I suspect you have here (assuming this is something to do with 
 package 'bit', unmentioned).

Yes, both bit and bitwhich are integer with an S3 class attribute (bitwhich 
sometimes is logical instead).

I am lost. What do I need to do to get |.a dispatched if calling 
a | b
where a and b are objects from S3 classes a and b that both have 
methods defined for |
?

In the R Language definition I find
If they do not suggest a single method then the default method is used.
Does this mean it is not possible to write Ops methods for classes a and b 
such that |.a is called in 
a | b
?

I don't see how I can get any hook into the dispatch mechanism, my methods are 
always bypassed if the classes of e1 and e2 differ (simple example below).
Best wishes for 2010

Jens Oehlschlägel



 ca - function(x){
+   x - as.integer(x)
+   oldClass(x) - a
+   x
+ }
 cb - function(x){
+   x - as.integer(x)
+   oldClass(x) - b
+   x
+ }
 
 a - ca(1)
 b - cb(1)
 
 Ops.a -
+ function(e1, e2){
+   cat(here Ops.a \n)
+   NULL
+ }
 
 Ops.b -
+ function(e1, e2){
+   cat(here Ops.a \n)
+   NULL
+ }
 
 # OK, Ops.a dispatched
 a | a
here |.a 
NULL
 
 # BUT both, Ops.a and Ops.b bypassed
 a | b
[1] TRUE
Warning message:
Incompatible methods (|.a, |.b) for | 
 
 
 |.a - function(e1, e2){
+ cat(here |.a \n)
+ NULL
+ }
 
 |.b - function(e1, e2){
+ cat(here |.b \n)
+ NULL
+ }
 
 # OK, |.a dispatched
 a | a
here |.a 
NULL
 
 # BUT both, |.a and |.b bypassed
 a | b
[1] TRUE
Warning message:
Incompatible methods (|.a, |.b) for | 

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Ops method does not dispatch on either of two classes

2009-12-28 Thread Jens Oehlschlägel
I have defined boolean methods for bit and bitwhich objects, for example
|.bit - function(e1,e2)
and
|.bitwhich - function(e1,e2)

Both methods coerce their arguments to the respective class, however if I do 
something like 

bit_obj | bitwhich_obj

then I get a warning 

Warning message:
Incompatible methods (|.bit, |.bitwhich) for | 

and none of the two methods is called. Instead the (internal) method for 
logicals seems to be called - not even coercing its arguments to logical. Same 
problem with Ops.bit and Ops.bitwhich .

What is the recommended way to get my methods reliably dispatched? 

Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Save workspace with ff objects

2009-11-27 Thread Jens Oehlschlägel
 My script generates a mixture of normal and ff objects. 
 Very often I would like to save the workspace for 
 each parameter setting, so that I can get back to it later on.
 Is there an easy way to do this, 
 instead of needing to save individual ff objects separately? 

With one save() you can store as many ff objects as you like. 
However, this does not save the ff files to a different location.

 I've tried the naive way of just saving the workspace, only to find that ff 
 objects are empty.

When loading the ff objects, the ff files need to be in their original 
locations.
You need to make sure that you do not overwrite those 
and they survive finalizer and tempdir remove at rm(ff) or q() time.
Do read the ff help on parameters 'filename', 'pattern', 'finalizer', 
'finonexit'.

The next version of ff will have ffsave() 
which will store a mixture of normal and ff objects 
*and* all ff-files into a ffarchive, i.e. two files 
ffarchive.RData and ffarchive.ffData 
from which you can restore all or a selection of 
ff objects / files using the ffload() command.

Regards
Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] questions on the ff package

2009-11-27 Thread Jens Oehlschlägel
 I wonder how efficiently it is to do the following command on a frequent 
 basis.
 nrow(matFF) - nrow(matFF)+1

Obviously there is overhead (closing file, enlarging file, openeing file). 
I recommend you measure yourself whether this is acceptable for you.

 no large file copying is needed each time the nrow is changed?

With a decent filesystem there is *no* copying from smaller to larger file.

 would you think I can open 2000 large matrices and leave them open or I
need to close each after it is opened and used?

Not tested yet. 
I guess the number of open files can be configured when compiling your OS. 
Please test and let us know your experience.

Regards
Jens Oehlschlägel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] questions on the ff package

2009-11-25 Thread Jens Oehlschlägel
Jeff,

 I need to save a matrix as a memory-mapped file and load it back later. 
 To save the matrix, I use
 mat = matrix(1:20, 4, 5)
 matFF = ff(mat, dim=dim(mat), filename=~/a.mat
 , overwrite=TRUE, dimnames = dimnames(mat))

# This stores the data in an ff file, 
# but not the metadata in R's ff object. 
# To do the latter you need to do 
save(matFF, file=~/matFF.RData)

# Assuming that your ff file remains in the same location, 
# in a new R session you simply 
load(file=~/matFF.RData)
# and the ff file is available automagically

 However, I don't always know the dimension when loading the matrix back.
 If I miss the dim attributes, ff will return it as vector. 
 Is there a way to load the matrix without specifying the dimension?

# You can create an ff object using your existing ff file by
matFF - ff(filename=~/a.mat, vmode=double, dim=c(4,5))

# You can do the same at unknown file size with 
matFF - ff(filename=~/a.mat, vmode=double)
# which gives you the length of the ff object
length(matFF)
# if you know the number of columns you can calculate the number of rows and 
give your ff object the interpretation of a matrix
dim(matFF) - c(length(matFF)/5, 5)

 the matrix may grow in terms of the number of rows. 
 Is there an efficient way to do this?

# there are two ways to grow a matrix by rows

# 1) you create the matrix in major row order
matFF - ff(1:20, dim=c(4,5), dimorder=c(2:1))
# then you require a higher number of rows
nrow(matFF) - 6
# as you can see there are new empty rows in the file
matFF

# 2) Instead of a matrix you create a ffdf data.frame
#which you can also give more rows using nrow-
#An example of this is in read.table.ffdf
#which reads a csv file in chunks and extends the 
#number of rows in the ffdf

Jens Oehlschlägel

-- 
Preisknaller: GMX DSL Flatrate für nur 16,99 Euro/mtl.!
http://portal.gmx.net/de/go/dsl02

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Error: cannot allocate vector of size...

2009-11-11 Thread Jens Oehlschlägel
For me with ff - on a 3 GB notebook - 3e6x100 works out of the box even without 
compression: doubles consume 2.2 GB on disk, but the R process remains under 
100MB, rest of RAM used by file-system-cache.
If you are under windows, you can create the ffdf files in a compressed folder. 
For the random doubles this reduces size on disk to 230MB - which should even 
work on a 1GB notebook.
BTW: the most compressed datatype (vmode) that can handle NAs is logical: 
consumes 2bit per tri-bool. The nextmost compressed is byte covering c(NA, 
-127:127) and consuming its name on disk and in fs-cache.

The code below should give an idea of how to do pairwise stats on columns where 
each pair fits easily into RAM. In the real world, you would not create the 
data but import it using read.csv.ffdf (expect that reading your file takes 
longer than reading/writing the ffdf).

Regards


Jens Oehlschlägel



library(ff)
k - 100
n - 3e6

# creating a ffdf dataframe of the requires size
l - vector(list, k)
for (i in 1:k)
  l[[i]] - ff(vmode=double, length=n, update=FALSE)
names(l) - paste(c, 1:k, sep=)
d - do.call(ffdf, l)

# writing 100 columns of 1e6 random data takes 90 sec
system.time(
for (i in 1:k){
  cat(i,  )
  print(system.time(d[,i] - rnorm(n))[elapsed])
  }
)[elapsed]


m - matrix(as.double(NA), k, k)

# pairwise correlating one column against all others takes ~ 17.5 sec
# pairwise correlating all combinations takes 15 min
system.time(
for (i in 2:k){
  cat(i,  )
  print(system.time({
x - d[[i]][]
for (j in 1:(i-1)){
  m[i,j] - cor(x, d[[j]][])
}
  })[elapsed])
}
)[elapsed]


-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] New version of package ff

2009-11-09 Thread Jens Oehlschlägel
Dear R community,

ff Version 2.1.1 is available on CRAN. It now supports large data.frames, 
csv import/export, packed atomic datatypes and bit filtering from package 
'bit' on which it depends from now.

Some performance results in seconds from test data with 78 mio rows and 7 
columns on a 3 GB notebook:

sequential reading 1 mio rows: csv = 32.7  ffdf = 1.3
sequential writing 1 mio rows: csv = 35.5  ffdf = 1.5

Examples of things you can do with ff and bit:
- direct random access to rows of large data-frame instead of talking to SQL 
database (?ffdf)
- store 4-level factor like A,T,G,C with 2bit instead of 32bit (?vmode)
- fast chunked iteration (?chunk)
- run linear model on large dataset using biglm (?chunk.ffdf)
- handle boolean selections by factor 32 faster and less RAM consuming (?bit)
- handle very skewed selections very fast (?bitwhich)
- parallel access to large dataset just by sending ff's small metadata from 
master to slaves (e.g. with snowfall)

ff is hosted on r-forge now and you find some presentations on ff at
http://ff.r-forge.r-project.org/

Hope you find this useful. We appreciate any feedback.


Jens  Daniel

___
R-packages mailing list
r-packa...@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Hand-crafting an .RData file

2009-11-09 Thread Jens Oehlschlägel
If you can manage to write out your data in separate binary files, one for each 
column, then another possibility is using package ff. You can link those binary 
columns into R by defining an ffdf dataframe: columns are memory mapped and you 
can access those parts you need - without initially importing them. This is 
much faster than a csv import and also works for files that are too large to 
import at once. If all your columns have the same storage.mode (vmode in ff), 
then another alternative is writing out all your data in one single binary 
matrix with major row-order (because that can be written row by row from your 
program) and link the file into R as a single ff_matrix.

Since ffdf in ff is new, I give a mini-tutorial below.
Let me know how that works for you.

Kind regards


Jens Oehlschlägel




library(ff)

# Create example csv
fnam - /tmp/example.csv
write.csv(data.frame(a=1:9, b=1:9+0.1), file=fnam, row.names=FALSE)

# Create example binary files on disk.
# Reading csv into ffdf actually stores
# each column as a binary file on disk.
# Using a pattern outside fftempdir automatically sets finalizer=close
# and thus makes those binary files permanent.
path - /tmp/example_
x - read.csv.ffdf(file=fnam, ff_args=list(pattern=path))
close(x)

# Note that a standard ffdf is made-up column by column from simple ff objects.
# More coplex mappings from ff objects into ffdf are possible, 
# but let's keep it simple for now.
p - physical(x)
p

# Now let's just create an ffdf from existing binary files.
# Step one: create an ff object for each binary file (without reading them).

# Note that because we open ff files outside fftempdir, 
# the default finalizer is close, not delete, 
# so the file will not be deleted on finalization
# files are opened for memory mapping, but not read
ffcols - vector(list, length(p))
for (i in 1:length(p)){
  ffcols[[i]] - ff(filename=filename(p[[i]]), vmode=vmode(p[[i]]))
}
ffcols

# step two: bundle several ff objects into one ffdf data.frame 
# (still without reading data)
ffdafr - ffdf(a=ffcols[[1]], b=ffcols[[2]])

# now reading rows from this will return a standard data.frame 
# (and only read the required rows)
ffdafr[1:4,]
ffdafr[5:9,]


# As an alternative create an example binary 
# (double) matrix in major row order
y - as.ff(t(ffdafr[,]), filename=d:/tmp/example_single_matrix.ff)

# Again we can link this existing binary file.
# if we know the size of the matrix we can do
z - ff(filename=filename(y), vmode=double, dim=c(9,2), dimorder=c(2,1))
z
rm(z)

# If we only know the number of columns we can do
z - ff(filename=filename(y), vmode=double)
# and set dim later
dim(z) - c(length(z)/2, 2)
# Note that so far we have interpreted the file in major column order
z
# To interpret the file in major column order we set dimorder 
# (a generalization for n-way arrays)
dimorder(z) - c(2,1)
z


# removing the ff objects will trigger finalizer 
# at next garbage collection
rm(x, ffcols, ffdafr, y, z)
gc()

# since we carefully selected the close finalizer, 
# the files still exist
dir(path=/tmp, pattern=example_)

# now remove them physically
unlink(file.path(/tmp, dir(path=/tmp, pattern=example_)))

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Incremental ReadLines

2009-11-05 Thread Jens Oehlschlägel
Gene,

You might want to look at function read.csv.ffdf from package ff which can read 
large csv-files into a ffdf object. That's kind of data.frame which is stored 
on disk resp. in the file-system-cache. Once you subscript part of it, you get 
a regular data.frame. 


Jens Oehlschlägel
-- 
Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 -
sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] (no subject)

2009-10-01 Thread Jens Oehlschlägel
Hi,
Does anyone know where the following package is available: 

Holleczek B, Gondos A, Brenner H.
PeriodR - an R package to calculate long term survival estimates using period 
analysis.
Methods of Information in Medicine 2009; 48: 123-128.

Thanks
Jens Oehlschlägel
-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] questions on csv reading

2009-09-26 Thread Jens Oehlschlägel
Hi,

Is there any official way to determine the colClasses of a data.frame?
Why has POSIXct such a strange class structure?
Why is colClasses ordered not allowed (and doesn't work)?

Background
==
I am writing a chunked csv reader that provides the functionality of read.table 
for large files (in the next version of package ff). In chunked reading, one 
wants to learn the colClasses from the data.frame returned for the first chunk 
and submit this as argument colClasses= to the following chunks (following 
calls to read.table). 

for most column types 
colClasses - sapply(data.frame, class)
works fine. However, two column types have more than one class: 

ordered has c(ordered, factor) - currently we can't tell read.table that 
a column is an ordered factor

POSIXct has c(POSIXt,POSIXct) - here the LESS specific class POSIXt is 
in the first position and would win in class-dispatch over the MORE specific 
class POSIXct. Why?


Jens Oehlschlägel

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Is there any difference between - and =

2009-03-12 Thread Jens Oehlschlägel
Sean,

 would like to receive expert opinion to avoid potential trouble
[..]
 i think the following is the most secure way if one really
 really has to do assignment in a function call
f({a=3})
 and if one keeps this convention, - can be dropped altogether.

secure is relative, since due to R's lazy evaluation you never know whether a 
function's argument is being evalutated, look at:

 f- function(x)TRUE
 x - 1
 f((x=2)) # obscured attempt to assign in a function call
[1] TRUE
 x
[1] 1

Thus there is dangerous advice in the referenced blog which reads:

f(x - 3)
which means assign 3 to x, and call f with the first argument set to the value 
3

This might be the case in C but not in R. Actually in R f(x - 3) means: call 
f with a first unevaluated argument x - 3, and if and only if f decides to 
evaluate its first argument, then the assignment is done. To make this very 
clear:

 f - function(x)if(runif(1)0.5) TRUE else x
 x - 1
 print(f(x - x + 1))
[1] TRUE
 print(f(x - x + 1))
[1] 2
 print(f(x - x + 1))
[1] 3
 print(f(x - x + 1))
[1] TRUE
 print(f(x - x + 1))
[1] 4
 print(f(x - x + 1))
[1] 5
 print(f(x - x + 1))
[1] TRUE
 print(f(x - x + 1))
[1] 6
 print(f(x - x + 1))
[1] TRUE

Here it is unpredictable whether your assignment takes place. Thus assigning 
like f({x=1}) or f((x=1))is the maximum dangerous thing to do: even if you have 
a code-reviewer and the guy is aware of the danger of f(x-1) he will probably 
miss it because f((x=1)) does look too similar to a standard call f(x=1).

According to help(-), R's assignment operator is rather - than =:


The operators - and = assign into the environment in which they are evaluated. 
The operator - can be used anywhere, whereas the operator = is only allowed at 
the top level (e.g., in the complete expression typed at the command prompt) or 
as one of the subexpressions in a braced list of expressions.


So my recommendation is 
1) use R's assignment operator with two spaces around (or assign()) and don't 
obscure assignments by using C's assignment operator (or other languages 
equality operator)
2) do not assign in function arguments unless you have good reasons like in 
system.time(x - something)

HTH


Jens Oehlschlägel

P.S. Disclaimer: you can consider me biased towards -, never trust experts, 
whether experienced or not.

P.P.S. a puzzle, following an old tradition:

What is going on here? (and what would you need to do to prove it?)

 search()
[1] .GlobalEnvpackage:stats package:graphics  
package:grDevices package:utils package:datasets  package:methods  
[8] Autoloads package:base 
 ls(all.names = TRUE)
[1] y
 y
[1] 1 2 3
 identical(y, 1:3)
[1] TRUE
 y[] - 1  # assigning 1 fails
 y
[1] 1 2 3
 y[] - 2  # assigning 2 works
 y
[1] 2 2 2
 
 # Tip: no standard packages modified, no extra packages loaded, neither 
 classes nor methods defined, no print methods hiding anything, if you would 
 investigate my R you would not find any false bottom anymore
 
 version
   _   
platform   i386-pc-mingw32 
arch   i386
os mingw32 
system i386, mingw32   
status 
major  2   
minor  8.1 
year   2008
month  12  
day22  
svn rev47281   
language   R   
version.string R version 2.8.1 (2008-12-22)

--

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] [R-pkgs] New package: bit 1.0

2008-10-11 Thread Jens Oehlschlägel
Dear R community,

Package 'bit' Version 1.0 is available on CRAN. 
It provides bitmapped vectors of booleans (no NAs), 
coercion from and to logicals, integers and integer subscripts; 
fast boolean operators and fast summary statistics. 

With bit vectors you can store true binary booleans {FALSE,TRUE} at the expense 
of 1 bit only, on a 32 bit architecture this means factor 32 less RAM and 
factor 32 more speed on boolean operations. With this speed gain it even 
pays-off to convert to bit in order to avoid a single boolean operation on 
logicals or a single set operation on (longer) integer subscripts, the pay-off 
is dramatic when such components are used more than once. 

Reading from and writing to bit is approximately as fast as accessing standard 
logicals - mostly due to R's time for memory allocation. The package allows to 
work with pre-allocated memory for return values by calling .Call() directly: 
when evaluating the speed of C-access with pre-allocated vector memory, coping 
from bit to logical requires only 70% of the time for copying from logical to 
logical; and copying from logical to bit comes at a performance penalty of 150%.

Functions 'which' and 'xor' are made S3 generic, 'xor.default' is implemented 
much faster than in base R (this should go into base R).

The package has automated regression-tests and is hopefully useful for better
handling large datasets, together with packages 'rindex' and 'ff'.

Best regards


Jens Oehlschlägel
Munich, 10.10.2008

___
R-packages mailing list
[EMAIL PROTECTED]
https://stat.ethz.ch/mailman/listinfo/r-packages

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R 2.7.0 is released

2008-04-22 Thread Jens Oehlschlägel
Many thanks to the core team for an impressive list of new improvements ...

 o strwidth() and strheight() gain 'font' and 'vfont' arguments and
accept in-line pars such as 'family' in the same way as text()
does. (Longstanding wish of PR#776)

... and for not having forgotten an 8 year old wish!


Jens Oehlschlaegel

-- 
Ist Ihr Browser Vista-kompatibel? Jetzt die neuesten

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] cbind function

2007-11-13 Thread Jens Oehlschlägel
And here is a second solution, that differs in what happens if the variables 
have differing lengths:

 var1 - 1:4
 var2 - 1:3
 sapply(ls(patt=^var[0-9]), get)
$var1
[1] 1 2 3 4

$var2
[1] 1 2 3

 do.call(cbind, lapply(ls(patt=^var[0-9]), get))
 [,1] [,2]
[1,]11
[2,]22
[3,]33
[4,]41
Warning message:
In cbind(1:4, 1:3) :
  number of rows of result is not a multiple of vector length (arg 2)

Best regards


Jens Oehlschlägel
--

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.