Re: [R] Creating a custom connection to read from multiple files
Hello Andy, thanks for your examples, I rewrote everything to matrices lapply/sapply, rbind calls instead of for-cycles appends, it really helped. Reading files one by one and concatenating is now even faster than concatenating on disk, that 8MB table is read in 3.5 seconds. Tomas rbind is vectorized so you are using it (way) suboptimally. Here's an example: ## Create a 500 x 100 data matrix. x - matrix(rnorm(5e4), 500, 100) ## Generate 50 filenames. fname - paste(f, formatC(1:50, width=2, flag=0), .txt, sep=) ## Write the data to files 50 times. for (f in fname) write(t(x), file=f, ncol=ncol(x)) ## Read the files into a list of data frames. system.time(datList - lapply(fname, read.table, header=FALSE), gcFirst=TRUE) [1] 11.91 0.05 12.33NANA ## Specify colClasses to speed up. system.time(datList - lapply(fname, read.table, colClasses=rep(numeric, 100)), + gcFirst=TRUE) [1] 10.69 0.07 10.79NANA ## Stack them together. system.time(dat - do.call(rbind, datList), gcFirst=TRUE) [1] 5.34 0.09 5.45 NA NA ## Use matrices instead of data frames. system.time(datList - lapply(fname, + function(f) matrix(scan(f), ncol=100, byrow=TRUE)), gcFirst=TRUE) Read 5 items ... Read 5 items [1] 9.49 0.08 15.06NANA system.time(dat - do.call(rbind, datList), gcFirst=TRUE) [1] 0.09 0.03 0.12 NA NA ## Clean up the files. unlink(fname) A couple of points: - Usually specifying colClasses will make read.table() quite a bit faster, even though it's only marginally faster here. Look back in the list archive to see examples. - If your data files are all numerics (as in this example), storing them in matrices will be much more efficient. Note the difference in rbind()ing the 50 data frames and 50 matrices (5.34 seconds vs. 0.09!). rbind.data.frame() needs to ensure that the resulting data frame has unique rownames (a requirement for a legit data frame), and that's probably taking a big chunk of the time. Andy __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Creating a custom connection to read from multiple files
Hello, is it possible to create my own connection which I could use with read.table or scan ? I would like to create a connection that would read from multiple files in sequence (like if they were concatenated), possibly with an option to skip first n lines of each file. I would like to avoid using platform specific scripts for that... (currently I invoke /bin/cat from R to create a concatenation of all those files). Thanks, Tomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] Creating a custom connection to read from multiple files
Dear Prof Ripley, thanks for your suggestions, it's very nice one can create custom connections directly in R and I think it is what I need just now. However, what is wrong with reading a file at a time and combining the results in R using rbind? Well, the problem is performance. If I concatenate all those files, they have around 8MB, can grow to tens of MBs in near future. Both concatenating and reading from a single file by scan takes 5 seconds (which is almost OK). However, reading individual files by read.table and rbinding one by one ( samples=rbind(samples, newSamples ) takes minutes. The same is when I concatenate lists manually. Scan does not help significantly. I guess there is some overhead in detecting dimensions of objects in rbind (?) or re-allocation or copying data ? Best regards, Tomas Kalibera __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
[R] Vertical labels on axes overlap
Hello, when using horizontal labels (default) in plots on x-axis, R by default selects a subset of labels to plot so that the labels do not overlap. However, when using vertical labels, all labels are always drawn, even when they overlap. Is it a bug or do I have to adjust some magic parameter ? the problem can be shown on these 2 tiny examples: horizontal labels (default) [OK]: plot(1:100,axes=FALSE) axis(1,at=1:100,labels=rep(aaa,100)) (only a subset of labels is drawn) vertical labels [THE PROBLEM]: plot(1:100,axes=FALSE) axis(1,at=1:100,labels=rep(aaa,100),las=2) (all labels are drawn - and they do overlap) Thanks, Tomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html