[R] subset ffdf does not accept bit vector anymore (package ffbase)

2014-09-25 Thread christian.kamenik
Hi everyone

Since I updated package 'ffbase', subset.ffdf does not work with bit vectors 
anymore. Here is a short example:

data(iris)

library(ffbase)
iris.ffdf - as.ffdf(iris)
index - sample(c(FALSE,TRUE), nrow(iris), TRUE)
index.bit - as.bit(index)

subset(iris.ffdf, subset=index.bit)

results in the error message:
Error in which(eval(e, nl, envir)) : argument to 'which' is not logical


My code was working prior to the update...
and help on subset.ffdf sais:

subset: an expression, ri, bit or logical ff vector that can be 
used to index x

Any help would be highly appreciated.

Many thanks
Christian



 R.Version()



$platform

[1] i386-w64-mingw32



$arch

[1] i386



$os

[1] mingw32



$system

[1] i386, mingw32



$status

[1] 



$major

[1] 3



$minor

[1] 1.1



$year

[1] 2014



$month

[1] 07



$day

[1] 10



$`svn rev`

[1] 66115



$language

[1] R



$version.string

[1] R version 3.1.1 (2014-07-10)



$nickname

[1] Sock it to Me



 sessionInfo()



R version 3.1.1 (2014-07-10)

Platform: i386-w64-mingw32/i386 (32-bit)



locale:

[1] LC_COLLATE=German_Switzerland.1252  LC_CTYPE=German_Switzerland.1252
LC_MONETARY=German_Switzerland.1252

[4] LC_NUMERIC=CLC_TIME=German_Switzerland.1252



attached base packages:

[1] stats graphics  grDevices utils datasets  methods   base



other attached packages:

[1] stringr_0.6.2 ffbase_0.11.3 ff_2.2-13 bit_1.1-12track_1.0-15



loaded via a namespace (and not attached):

[1] fastmatch_1.0-4 tools_3.1.1



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trouble with subset.ffdf

2014-04-28 Thread christian.kamenik
Dear all



I am having trouble with subsetting an ffdf object, hopefully somebody can 
help...



I have an index, which is a ff object of vmode logical:

 index.SAS

ff (open) logical length=4977231 (4977231)

  [1]   [2]   [3]   [4]   [5]   [6]   [7]   [8] 
  [4977224] [4977225] [4977226] [4977227] [4977228] [4977229]

 TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE 
:  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

[4977230] [4977231]

 TRUE  TRUE



I would like to use this index to subset the ffdf object data.SAS. The number 
of rows in data.SAS equals the length of index.SAS. However, the command

 Missing.data - subset(data.SAS, !index.SAS)



gives me the following error:

Error in ffdf(x = x) : ffdf components must be atomic ff objects



A similar command also results in an error:

 Missing.data - data.SAS[!index.SAS,]

Error: vmode(index) == integer is not TRUE



I do not want to use index.SAS[] (which works in many cases, but sometimes 
crashes), because - as far as I understand - this will cause trouble with 
really large index vectors (I would prefer using ff objects).



So I came up with the following syntax, which seems to work:

 Missing.data - data.SAS[ffwhich(index.SAS,index.SAS==FALSE),]



...I am just not sure if this is the right approach.



I am running



platform   i386-w64-mingw32

arch   i386

os mingw32

system i386, mingw32

status

major  3

minor  0.3

year   2014

month  03

day06

svn rev65126

language   R

version.string R version 3.0.3 (2014-03-06)

nickname   Warm Puppy



with ffbase_0.11.3 and ff_2.2-12



Many thanks in advance

Christian


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] read.table.ffdf and fixed width files

2013-08-12 Thread christian.kamenik
Dear R Users

This is a summary of the things I tried with read.table.ffdf and fixed-width 
files. I would like to thank Jan Wijffels and Jan van der Laan for their 
suggestions and the time they spent on my problem!

My objective was to import a file with 6'079'455 lines and 32 variables using 
the tools provided by the ff package. The fixed-width file I got was supposed 
to have a total width of 238. But it turned out that the last column, which 
should have had a width of four, contained either no entry, or entries with one 
or two characters followed by \n\r. The corresponding spaces were dropped when 
the file was created. This could be shown by

   lines - readLines(my_file.txt)
   range(nchar(lines))

which resulted in 235 237 instead of 238. So the file was not really fixed 
width...

I tried importing the file with

library(ff)
library(stringr)
my.data  - read.table.ffdf(file=my_file.txt,
FUN=read.fwf,
widths = my.widths,
header=F, VERBOSE=TRUE, first.rows=10,
col.names = my.names,
fileEncoding = LATIN1,
transFUN=function(x){
  z - sapply(x, function(y) {
y - str_trim(y)
y[y==] - NA
factor(y)})
  as.data.frame(z)
}
)

This took 4168 seconds and resulted in an object that included only 100'000 
lines instead of 6'079'455 lines (I still don't know why...).


Another approach was to use laf_open_fwf from package LaF and then laf_to_ffdf 
from package ffbase, which is really a simple approach as long as the width is 
not shorter than the given width (i.e. 238). So the idea was to add the missing 
spaces by running

con - file(my_file.txt, rt)
out - file(my_file_converted.txt, wt)
system.time(
   while (TRUE) {
 lines - readLines(con, encoding='LATIN1', n=1E5)
 if (length(lines) == 0) break
 lines - sprintf(%-238s, lines)
 writeLines(lines, out, useBytes=TRUE) }
)
close(con)
close(out)

and then

library(LaF)
library(ffbase)
my.data.laf - laf_open_fwf(my_file_converted.txt , column_types=my.types, 
column_widths = my.widths, column_names = my.names)
my.data - laf_to_ffdf(my.data.laf)

This worked really well, except that the whole process took quite some time. 
Appending the spaces took 2436 seconds, and converting the file from laf to 
ffdf took another 2628 seconds.


The third approach I tested was the fastest, but used the Unix/Linux program 
awk outside R (run on Cygwin installed on Windows 7 32-bit):

First, I converted my original file into a tab-delimited text file using awk:
awk -v FIELDWIDTHS='3 28  4 30 28  6  3 30 10  3  3  6  6  5  1  2  1  1  2  2  
2  4  2  4  7 30  1  1  3  2  4  4' -v OFS='\t' '{ $1=$1 ; print }' 
my_file.txt my_file_delimited.txt

Then I used read.delim.ffdf provided by the ff package:

library(ff)
library(stringr)
my.data - read.delim.ffdf(file=my_file_delimited.txt,
  header=F, VERBOSE=TRUE, first.rows=10,
  col.names = my.names,
  colClasses=my.classes,
  fileEncoding = LATIN1,
  transFUN=function(x) {
z - sapply(x, function(y) {
  y - str_trim(y)
  y[y==] - NA
  factor(y)})
as.data.frame(z)
  }
)


Running awk took only 203 seconds! And the import of the delimited file was 
finished after 1141 seconds.

What I like most about the variants of read.table.ffdf and also about 
laf_to_ffdf is the fransFUN argument! Have a look at it, it allows a lot of 
fine tuning.

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch
www.astra.admin.chhttp://www.astra.admin.ch/

Von: Jan Wijffels [mailto:jwijff...@bnosac.be]
Gesendet: Donnerstag, 8. August 2013 11:46
An: Kamenik Christian ASTRA
Betreff: Re: read.table.ffdf and fixed width files

Christian,

You probably misspecified column names in the transFUN. Mark that 
read.table.ffdf reads in your data in chunks and puts that chunk to an ffdf. In 
transFUN you get one chunk in RAM based on which you can do data manipulations. 
It should return a data.frame which will be appended to your ffdf.

So. This worked out fine for me.

Jan

require(ff)

Re: [R] laf_open_fwf

2013-08-09 Thread christian.kamenik
Jan,

Many thanks for your suggestion! The code runs perfectly fine on the test set. 
Applying it to the complete data set, however, results in the following error:

 while (TRUE) {
+  lines - readLines(con, encoding='LATIN1')
+  if (length(lines) == 0) break
+  lines - sprintf(%-238s, lines)
+  writeLines(lines, out, useBytes=TRUE) }
Error: cannot allocate vector of size 23.2 Mb


Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC  
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89 
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch
www.astra.admin.ch


-Ursprüngliche Nachricht-
Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] 
Gesendet: Freitag, 9. August 2013 10:01
An: Kamenik Christian ASTRA
Betreff: Re: AW: AW: [R] laf_open_fwf

Christian,

It seems some of the lines in your file have additional characters at the end 
causing the line lengths to vary. The only way I could think of is to first add 
whitespace to the shorter lines to make all line lengths equal:

# Add whitespace to the end of the lines to make all lines the same length con 
- file(testdata.txt, rt) out - file(testdata_2.txt, wt) while (TRUE) {
   lines - readLines(con, n=1E5)
   if (length(lines) == 0) break
   lines - sprintf(%-238s, lines)
   writeLines(lines, out, useBytes=TRUE) }
close(con)
close(out)


I am then able to read you test file using LaF:

library(LaF)

column_widths - c(3, 28, 4, 30, 28, 6, 3, 30, 10, 26, 25, 30, 2, 5, 5) 
column_types - rep(string, length(column_widths)) column_types[c(1, 3, 7)] 
- integer

laf - laf_open_fwf(testdata_2.txt, column_types = column_types, 
column_widths = column_widths)


HTH,
Jan







christian.kame...@astra.admin.ch schreef:

 Hello Jan

 I attached an example. Any help is highly appreciated!

 Kind Regard

 Christian Kamenik
 Project Manager

 Federal Department of the Environment, Transport, Energy and 
 Communications DETEC Federal Roads Office FEDRO Division Road Traffic 
 Road Accident Statistics

 Mailing Address: 3003 Bern
 Location: Weltpoststrasse 5, 3015 Bern

 Tel +41 31 323 14 89
 Fax +41 31 323 43 21

 christian.kame...@astra.admin.ch
 www.astra.admin.ch
 -Ursprüngliche Nachricht-
 Von: Jan van der Laan [mailto:rh...@eoos.dds.nl]
 Gesendet: Donnerstag, 8. August 2013 13:58
 An: r-help@r-project.org
 Cc: Kamenik Christian ASTRA
 Betreff: Re: AW: [R] laf_open_fwf


 Without example data it is difficult to give suggestions on how you 
 might read this file.

 Are you sure your file is fixed width? Sometimes columns are neatly 
 aligned using whitespace (tabs/spaces). In that case you could use 
 read.table with the default settings.

 Another possibility might be that the file is encoded in utf8. I 
 expect that reading it in assuming another encoding (such as latin1) 
 would lead to varying line sizes. Although I would expect the lengths 
 to be larger than the sum of your column widths (as one symbol can be 
 larger than one byte).

 Jan



 christian.kame...@astra.admin.ch schreef:

 Dear Jan

 Many thanks for your help. In fact, all lines are shorter than my 
 column width...

 my.column.widths:238
 range(nchar(lines)): 235 237

 So, it seems I have an inconsistent file structure...
 I guess there is no way to handle this in an automated way?

 Best Regard

 Christian Kamenik
 Project Manager

 Federal Department of the Environment, Transport, Energy and 
 Communications DETEC Federal Roads Office FEDRO Division Road Traffic 
 Road Accident Statistics

 Mailing Address: 3003 Bern
 Location: Weltpoststrasse 5, 3015 Bern

 Tel +41 31 323 14 89
 Fax +41 31 323 43 21

 christian.kame...@astra.admin.ch
 www.astra.admin.ch
 -Ursprüngliche Nachricht-
 Von: Jan van der Laan [mailto:rh...@eoos.dds.nl]
 Gesendet: Mittwoch, 7. August 2013 20:57
 An: r-help@r-project.org
 Cc: Kamenik Christian ASTRA
 Betreff: Re: [R] laf_open_fwf

 Dear Christian,

 Well... it shouldn't normally do that. The only way I can currently 
 think of that might cause this problem is that the file has \r\n\r\n, 
 which would mean that every line is followed by an empty line.

 Another cause might be (although I would not really expect the 
 results you see) that the sum of your column widths is larger than 
 the actual with of the line.

 You can check your line lengths using:

 lines - readLines(my.filename)
 nchar(lines)

 Each line should have the same length and be equal to (or at least 
 larger than) sum(my.column.widths)

 If this is not the problem: would it be possible that you send me a 
 small part of your file so that I could try to reproduce the problem?
 Or if you cannot share your data: replace the actual values with 
 nonsense values.

 Regards,
 Jan

 PS I read your mail by chance as I am not a regular r-help reader.
 

Re: [R] laf_open_fwf

2013-08-08 Thread christian.kamenik
Dear Jan

Many thanks for your help. In fact, all lines are shorter than my column 
width...

my.column.widths:   238
range(nchar(lines)):235 237

So, it seems I have an inconsistent file structure...
I guess there is no way to handle this in an automated way?

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC  
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89 
Fax +41 31 323 43 21

christian.kame...@astra.admin.ch
www.astra.admin.ch
-Ursprüngliche Nachricht-
Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] 
Gesendet: Mittwoch, 7. August 2013 20:57
An: r-help@r-project.org
Cc: Kamenik Christian ASTRA
Betreff: Re: [R] laf_open_fwf

Dear Christian,

Well... it shouldn't normally do that. The only way I can currently think of 
that might cause this problem is that the file has \r\n\r\n, which would mean 
that every line is followed by an empty line.

Another cause might be (although I would not really expect the results you see) 
that the sum of your column widths is larger than the actual with of the line.

You can check your line lengths using:

lines - readLines(my.filename)
nchar(lines)

Each line should have the same length and be equal to (or at least larger than) 
sum(my.column.widths)

If this is not the problem: would it be possible that you send me a small part 
of your file so that I could try to reproduce the problem? Or if you cannot 
share your data: replace the actual values with nonsense values.

Regards,
Jan

PS I read your mail by chance as I am not a regular r-help reader. When you 
have specific LaF problems it is better to also cc me directly.

On 08/06/2013 12:35 PM, christian.kame...@astra.admin.ch wrote:
 Dear all

 I was trying the (fairly new) LaF package, and came across the following 
 problem:

 I opened a connection to a fixed width ASCII file using 
 laf_open_fwf(my.filename, my.column_types, my.column_widths, 
 my.column_names)

 When looking at the data, it turned out that \n (newline) and \r (carriage 
 return) were considered as characters, thus destroying the structure in my 
 data (the second column does not include any numbers):

 my.data[1565:1575,1:3]

 MF_FARZ1  Fahrzeugarttext MF_MARKE
 1 \n043 Landwirt. Traktor2140
 2 \n043 Landwirt. Traktor6206
 3 \n001 Personenwagen2026
 4 \n001 Personenwagen2026
 5\r\n00 1Personenwagen404
 6\r\n02 0Gesellschaftswagen   710
 7\r\n00 1Personenwagen505
 8\r\n00 1Personenwagen505
 9\r\n00 1Personenwagen301
 10   \r\n00 1Personenwagen553
 11   \r\n04 3Landwirt. Traktor257

 I am working on Windows 7 32-bit.

 Any help would be highly appreciated.

 Best Regard

 Christian Kamenik
 Project Manager

 Federal Department of the Environment, Transport, Energy and 
 Communications DETEC Federal Roads Office FEDRO Division Road Traffic 
 Road Accident Statistics

 Mailing Address: 3003 Bern
 Location: Weltpoststrasse 5, 3015 Bern

 Tel +41 31 323 14 89
 Fax +41 31 323 43 21

 christian.kame...@astra.admin.chmailto:christian.kamenik@astra.admin.
 ch www.astra.admin.chhttp://www.astra.admin.ch/


   [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] read.table.ffdf and fixed width files

2013-08-06 Thread christian.kamenik
Dear all

I am working on Windows 7 32-bit, and the ff- package is my daily life-saver to 
overcome the inherent memory limitations. Recently, I tried using 
read.table.ffdf to import data from a fixed-width ASCII file (file size: 
1'440'865'015 Bytes) with 6'079'455 lines and 32 variables using the command
read.table.ffdf(file=my.filename, FUN=read.fwf, width=my.format, 
asffdf_args=list(col_args=list(pattern = my.pattern))

The command generates a temporary file, which has 1'629'328'120 Bytes, plus 32 
ff files following my.pattern. The latter 32 files, however, only take up 
136'000 Bytes. And the resulting R object has a dimension of 1000 x 32. To me, 
it seems that read.table.ffdf aborts the data import after 1000 lines, instead 
of importing the entire file.

I tried running read.table.ffdf with different parameter settings, I was 
browsing the help pages and the mailing lists, but I did not find any hint on 
why read.table.ffdf aborts the data import. (Does it really? - The file size of 
the temporary file suggests that all data were read.)

Any help would be highly appreciated

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch
www.astra.admin.chhttp://www.astra.admin.ch/


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] laf_open_fwf

2013-08-06 Thread christian.kamenik
Dear all

I was trying the (fairly new) LaF package, and came across the following 
problem:

I opened a connection to a fixed width ASCII file using
laf_open_fwf(my.filename, my.column_types, my.column_widths, my.column_names)

When looking at the data, it turned out that \n (newline) and \r (carriage 
return) were considered as characters, thus destroying the structure in my data 
(the second column does not include any numbers):

 my.data[1565:1575,1:3]

   MF_FARZ1  Fahrzeugarttext MF_MARKE
1 \n043 Landwirt. Traktor2140
2 \n043 Landwirt. Traktor6206
3 \n001 Personenwagen2026
4 \n001 Personenwagen2026
5\r\n00 1Personenwagen404
6\r\n02 0Gesellschaftswagen   710
7\r\n00 1Personenwagen505
8\r\n00 1Personenwagen505
9\r\n00 1Personenwagen301
10   \r\n00 1Personenwagen553
11   \r\n04 3Landwirt. Traktor257

I am working on Windows 7 32-bit.

Any help would be highly appreciated.

Best Regard

Christian Kamenik
Project Manager

Federal Department of the Environment, Transport, Energy and Communications 
DETEC
Federal Roads Office FEDRO
Division Road Traffic
Road Accident Statistics

Mailing Address: 3003 Bern
Location: Weltpoststrasse 5, 3015 Bern

Tel +41 31 323 14 89
Fax +41 31 323 43 21

christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch
www.astra.admin.chhttp://www.astra.admin.ch/


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.