[R] subset ffdf does not accept bit vector anymore (package ffbase)
Hi everyone Since I updated package 'ffbase', subset.ffdf does not work with bit vectors anymore. Here is a short example: data(iris) library(ffbase) iris.ffdf - as.ffdf(iris) index - sample(c(FALSE,TRUE), nrow(iris), TRUE) index.bit - as.bit(index) subset(iris.ffdf, subset=index.bit) results in the error message: Error in which(eval(e, nl, envir)) : argument to 'which' is not logical My code was working prior to the update... and help on subset.ffdf sais: subset: an expression, ri, bit or logical ff vector that can be used to index x Any help would be highly appreciated. Many thanks Christian R.Version() $platform [1] i386-w64-mingw32 $arch [1] i386 $os [1] mingw32 $system [1] i386, mingw32 $status [1] $major [1] 3 $minor [1] 1.1 $year [1] 2014 $month [1] 07 $day [1] 10 $`svn rev` [1] 66115 $language [1] R $version.string [1] R version 3.1.1 (2014-07-10) $nickname [1] Sock it to Me sessionInfo() R version 3.1.1 (2014-07-10) Platform: i386-w64-mingw32/i386 (32-bit) locale: [1] LC_COLLATE=German_Switzerland.1252 LC_CTYPE=German_Switzerland.1252 LC_MONETARY=German_Switzerland.1252 [4] LC_NUMERIC=CLC_TIME=German_Switzerland.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] stringr_0.6.2 ffbase_0.11.3 ff_2.2-13 bit_1.1-12track_1.0-15 loaded via a namespace (and not attached): [1] fastmatch_1.0-4 tools_3.1.1 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Trouble with subset.ffdf
Dear all I am having trouble with subsetting an ffdf object, hopefully somebody can help... I have an index, which is a ff object of vmode logical: index.SAS ff (open) logical length=4977231 (4977231) [1] [2] [3] [4] [5] [6] [7] [8] [4977224] [4977225] [4977226] [4977227] [4977228] [4977229] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE : TRUE TRUE TRUE TRUE TRUE TRUE [4977230] [4977231] TRUE TRUE I would like to use this index to subset the ffdf object data.SAS. The number of rows in data.SAS equals the length of index.SAS. However, the command Missing.data - subset(data.SAS, !index.SAS) gives me the following error: Error in ffdf(x = x) : ffdf components must be atomic ff objects A similar command also results in an error: Missing.data - data.SAS[!index.SAS,] Error: vmode(index) == integer is not TRUE I do not want to use index.SAS[] (which works in many cases, but sometimes crashes), because - as far as I understand - this will cause trouble with really large index vectors (I would prefer using ff objects). So I came up with the following syntax, which seems to work: Missing.data - data.SAS[ffwhich(index.SAS,index.SAS==FALSE),] ...I am just not sure if this is the right approach. I am running platform i386-w64-mingw32 arch i386 os mingw32 system i386, mingw32 status major 3 minor 0.3 year 2014 month 03 day06 svn rev65126 language R version.string R version 3.0.3 (2014-03-06) nickname Warm Puppy with ffbase_0.11.3 and ff_2.2-12 Many thanks in advance Christian [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] read.table.ffdf and fixed width files
Dear R Users This is a summary of the things I tried with read.table.ffdf and fixed-width files. I would like to thank Jan Wijffels and Jan van der Laan for their suggestions and the time they spent on my problem! My objective was to import a file with 6'079'455 lines and 32 variables using the tools provided by the ff package. The fixed-width file I got was supposed to have a total width of 238. But it turned out that the last column, which should have had a width of four, contained either no entry, or entries with one or two characters followed by \n\r. The corresponding spaces were dropped when the file was created. This could be shown by lines - readLines(my_file.txt) range(nchar(lines)) which resulted in 235 237 instead of 238. So the file was not really fixed width... I tried importing the file with library(ff) library(stringr) my.data - read.table.ffdf(file=my_file.txt, FUN=read.fwf, widths = my.widths, header=F, VERBOSE=TRUE, first.rows=10, col.names = my.names, fileEncoding = LATIN1, transFUN=function(x){ z - sapply(x, function(y) { y - str_trim(y) y[y==] - NA factor(y)}) as.data.frame(z) } ) This took 4168 seconds and resulted in an object that included only 100'000 lines instead of 6'079'455 lines (I still don't know why...). Another approach was to use laf_open_fwf from package LaF and then laf_to_ffdf from package ffbase, which is really a simple approach as long as the width is not shorter than the given width (i.e. 238). So the idea was to add the missing spaces by running con - file(my_file.txt, rt) out - file(my_file_converted.txt, wt) system.time( while (TRUE) { lines - readLines(con, encoding='LATIN1', n=1E5) if (length(lines) == 0) break lines - sprintf(%-238s, lines) writeLines(lines, out, useBytes=TRUE) } ) close(con) close(out) and then library(LaF) library(ffbase) my.data.laf - laf_open_fwf(my_file_converted.txt , column_types=my.types, column_widths = my.widths, column_names = my.names) my.data - laf_to_ffdf(my.data.laf) This worked really well, except that the whole process took quite some time. Appending the spaces took 2436 seconds, and converting the file from laf to ffdf took another 2628 seconds. The third approach I tested was the fastest, but used the Unix/Linux program awk outside R (run on Cygwin installed on Windows 7 32-bit): First, I converted my original file into a tab-delimited text file using awk: awk -v FIELDWIDTHS='3 28 4 30 28 6 3 30 10 3 3 6 6 5 1 2 1 1 2 2 2 4 2 4 7 30 1 1 3 2 4 4' -v OFS='\t' '{ $1=$1 ; print }' my_file.txt my_file_delimited.txt Then I used read.delim.ffdf provided by the ff package: library(ff) library(stringr) my.data - read.delim.ffdf(file=my_file_delimited.txt, header=F, VERBOSE=TRUE, first.rows=10, col.names = my.names, colClasses=my.classes, fileEncoding = LATIN1, transFUN=function(x) { z - sapply(x, function(y) { y - str_trim(y) y[y==] - NA factor(y)}) as.data.frame(z) } ) Running awk took only 203 seconds! And the import of the delimited file was finished after 1141 seconds. What I like most about the variants of read.table.ffdf and also about laf_to_ffdf is the fransFUN argument! Have a look at it, it allows a lot of fine tuning. Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch www.astra.admin.chhttp://www.astra.admin.ch/ Von: Jan Wijffels [mailto:jwijff...@bnosac.be] Gesendet: Donnerstag, 8. August 2013 11:46 An: Kamenik Christian ASTRA Betreff: Re: read.table.ffdf and fixed width files Christian, You probably misspecified column names in the transFUN. Mark that read.table.ffdf reads in your data in chunks and puts that chunk to an ffdf. In transFUN you get one chunk in RAM based on which you can do data manipulations. It should return a data.frame which will be appended to your ffdf. So. This worked out fine for me. Jan require(ff)
Re: [R] laf_open_fwf
Jan, Many thanks for your suggestion! The code runs perfectly fine on the test set. Applying it to the complete data set, however, results in the following error: while (TRUE) { + lines - readLines(con, encoding='LATIN1') + if (length(lines) == 0) break + lines - sprintf(%-238s, lines) + writeLines(lines, out, useBytes=TRUE) } Error: cannot allocate vector of size 23.2 Mb Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.ch www.astra.admin.ch -Ursprüngliche Nachricht- Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] Gesendet: Freitag, 9. August 2013 10:01 An: Kamenik Christian ASTRA Betreff: Re: AW: AW: [R] laf_open_fwf Christian, It seems some of the lines in your file have additional characters at the end causing the line lengths to vary. The only way I could think of is to first add whitespace to the shorter lines to make all line lengths equal: # Add whitespace to the end of the lines to make all lines the same length con - file(testdata.txt, rt) out - file(testdata_2.txt, wt) while (TRUE) { lines - readLines(con, n=1E5) if (length(lines) == 0) break lines - sprintf(%-238s, lines) writeLines(lines, out, useBytes=TRUE) } close(con) close(out) I am then able to read you test file using LaF: library(LaF) column_widths - c(3, 28, 4, 30, 28, 6, 3, 30, 10, 26, 25, 30, 2, 5, 5) column_types - rep(string, length(column_widths)) column_types[c(1, 3, 7)] - integer laf - laf_open_fwf(testdata_2.txt, column_types = column_types, column_widths = column_widths) HTH, Jan christian.kame...@astra.admin.ch schreef: Hello Jan I attached an example. Any help is highly appreciated! Kind Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.ch www.astra.admin.ch -Ursprüngliche Nachricht- Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] Gesendet: Donnerstag, 8. August 2013 13:58 An: r-help@r-project.org Cc: Kamenik Christian ASTRA Betreff: Re: AW: [R] laf_open_fwf Without example data it is difficult to give suggestions on how you might read this file. Are you sure your file is fixed width? Sometimes columns are neatly aligned using whitespace (tabs/spaces). In that case you could use read.table with the default settings. Another possibility might be that the file is encoded in utf8. I expect that reading it in assuming another encoding (such as latin1) would lead to varying line sizes. Although I would expect the lengths to be larger than the sum of your column widths (as one symbol can be larger than one byte). Jan christian.kame...@astra.admin.ch schreef: Dear Jan Many thanks for your help. In fact, all lines are shorter than my column width... my.column.widths:238 range(nchar(lines)): 235 237 So, it seems I have an inconsistent file structure... I guess there is no way to handle this in an automated way? Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.ch www.astra.admin.ch -Ursprüngliche Nachricht- Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] Gesendet: Mittwoch, 7. August 2013 20:57 An: r-help@r-project.org Cc: Kamenik Christian ASTRA Betreff: Re: [R] laf_open_fwf Dear Christian, Well... it shouldn't normally do that. The only way I can currently think of that might cause this problem is that the file has \r\n\r\n, which would mean that every line is followed by an empty line. Another cause might be (although I would not really expect the results you see) that the sum of your column widths is larger than the actual with of the line. You can check your line lengths using: lines - readLines(my.filename) nchar(lines) Each line should have the same length and be equal to (or at least larger than) sum(my.column.widths) If this is not the problem: would it be possible that you send me a small part of your file so that I could try to reproduce the problem? Or if you cannot share your data: replace the actual values with nonsense values. Regards, Jan PS I read your mail by chance as I am not a regular r-help reader.
Re: [R] laf_open_fwf
Dear Jan Many thanks for your help. In fact, all lines are shorter than my column width... my.column.widths: 238 range(nchar(lines)):235 237 So, it seems I have an inconsistent file structure... I guess there is no way to handle this in an automated way? Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.ch www.astra.admin.ch -Ursprüngliche Nachricht- Von: Jan van der Laan [mailto:rh...@eoos.dds.nl] Gesendet: Mittwoch, 7. August 2013 20:57 An: r-help@r-project.org Cc: Kamenik Christian ASTRA Betreff: Re: [R] laf_open_fwf Dear Christian, Well... it shouldn't normally do that. The only way I can currently think of that might cause this problem is that the file has \r\n\r\n, which would mean that every line is followed by an empty line. Another cause might be (although I would not really expect the results you see) that the sum of your column widths is larger than the actual with of the line. You can check your line lengths using: lines - readLines(my.filename) nchar(lines) Each line should have the same length and be equal to (or at least larger than) sum(my.column.widths) If this is not the problem: would it be possible that you send me a small part of your file so that I could try to reproduce the problem? Or if you cannot share your data: replace the actual values with nonsense values. Regards, Jan PS I read your mail by chance as I am not a regular r-help reader. When you have specific LaF problems it is better to also cc me directly. On 08/06/2013 12:35 PM, christian.kame...@astra.admin.ch wrote: Dear all I was trying the (fairly new) LaF package, and came across the following problem: I opened a connection to a fixed width ASCII file using laf_open_fwf(my.filename, my.column_types, my.column_widths, my.column_names) When looking at the data, it turned out that \n (newline) and \r (carriage return) were considered as characters, thus destroying the structure in my data (the second column does not include any numbers): my.data[1565:1575,1:3] MF_FARZ1 Fahrzeugarttext MF_MARKE 1 \n043 Landwirt. Traktor2140 2 \n043 Landwirt. Traktor6206 3 \n001 Personenwagen2026 4 \n001 Personenwagen2026 5\r\n00 1Personenwagen404 6\r\n02 0Gesellschaftswagen 710 7\r\n00 1Personenwagen505 8\r\n00 1Personenwagen505 9\r\n00 1Personenwagen301 10 \r\n00 1Personenwagen553 11 \r\n04 3Landwirt. Traktor257 I am working on Windows 7 32-bit. Any help would be highly appreciated. Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.chmailto:christian.kamenik@astra.admin. ch www.astra.admin.chhttp://www.astra.admin.ch/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] read.table.ffdf and fixed width files
Dear all I am working on Windows 7 32-bit, and the ff- package is my daily life-saver to overcome the inherent memory limitations. Recently, I tried using read.table.ffdf to import data from a fixed-width ASCII file (file size: 1'440'865'015 Bytes) with 6'079'455 lines and 32 variables using the command read.table.ffdf(file=my.filename, FUN=read.fwf, width=my.format, asffdf_args=list(col_args=list(pattern = my.pattern)) The command generates a temporary file, which has 1'629'328'120 Bytes, plus 32 ff files following my.pattern. The latter 32 files, however, only take up 136'000 Bytes. And the resulting R object has a dimension of 1000 x 32. To me, it seems that read.table.ffdf aborts the data import after 1000 lines, instead of importing the entire file. I tried running read.table.ffdf with different parameter settings, I was browsing the help pages and the mailing lists, but I did not find any hint on why read.table.ffdf aborts the data import. (Does it really? - The file size of the temporary file suggests that all data were read.) Any help would be highly appreciated Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch www.astra.admin.chhttp://www.astra.admin.ch/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] laf_open_fwf
Dear all I was trying the (fairly new) LaF package, and came across the following problem: I opened a connection to a fixed width ASCII file using laf_open_fwf(my.filename, my.column_types, my.column_widths, my.column_names) When looking at the data, it turned out that \n (newline) and \r (carriage return) were considered as characters, thus destroying the structure in my data (the second column does not include any numbers): my.data[1565:1575,1:3] MF_FARZ1 Fahrzeugarttext MF_MARKE 1 \n043 Landwirt. Traktor2140 2 \n043 Landwirt. Traktor6206 3 \n001 Personenwagen2026 4 \n001 Personenwagen2026 5\r\n00 1Personenwagen404 6\r\n02 0Gesellschaftswagen 710 7\r\n00 1Personenwagen505 8\r\n00 1Personenwagen505 9\r\n00 1Personenwagen301 10 \r\n00 1Personenwagen553 11 \r\n04 3Landwirt. Traktor257 I am working on Windows 7 32-bit. Any help would be highly appreciated. Best Regard Christian Kamenik Project Manager Federal Department of the Environment, Transport, Energy and Communications DETEC Federal Roads Office FEDRO Division Road Traffic Road Accident Statistics Mailing Address: 3003 Bern Location: Weltpoststrasse 5, 3015 Bern Tel +41 31 323 14 89 Fax +41 31 323 43 21 christian.kame...@astra.admin.chmailto:christian.kame...@astra.admin.ch www.astra.admin.chhttp://www.astra.admin.ch/ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.