Re: [R] Incremental
Hello, You have to convert y1 to class "Date" first, then do date arithmetic. The complete code would be dat<-read.table(text=" y1, flag 24-01-2016,S 24-02-2016,R 24-03-2016,X 24-04-2016,H 24-01-2016,S 24-11-2016,R 24-10-2016,R 24-02-2016,X 24-01-2016,H 24-11-2016,S 24-02-2016,R 24-10-2016,X 24-03-2016,H 24-04-2016,S ",sep=",",header=TRUE) str(dat) # See what we have, y1 is a factor dat$y1 <- as.Date(dat$y1, format = "%d-%m-%Y") str(dat) # now y1 is a Date dat$x1 <- cumsum(dat$flag == "S") dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) dat Instead of y - y[1] you can also use ?difftime. Rui Barradas Em 14-10-2016 20:06, Val escreveu: Thank you Rui, It Worked! How about if the first variable is date format? Like the following dat<-read.table(text=" y1, flag 24-01-2016,S 24-02-2016,R 24-03-2016,X 24-04-2016,H 24-01-2016,S 24-11-2016,R 24-10-2016,R 24-02-2016,X 24-01-2016,H 24-11-2016,S 24-02-2016,R 24-10-2016,X 24-03-2016,H 24-04-2016,S ",sep=",",header=TRUE) dat dat$x1 <- cumsum(dat$flag == "S") dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) error message In Ops.factor(y, y[1]) : ‘-’ not meaningful for factors On Thu, Oct 13, 2016 at 5:30 AM, Rui Barradaswrote: Hello, You must run the code to create x1 first, part 1), then part 2). I've tested with your data and all went well, the result is the following. dput(dat) structure(list(y1 = c(39958L, 40058L, 40105L, 40294L, 40332L, 40471L, 40493L, 40533L, 40718L, 40771L, 40829L, 40892L, 41056L, 41110L, 41160L, 41222L, 41250L, 41289L, 41324L, 41355L, 41415L, 41562L, 41562L, 41586L), flag = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 2L, 2L, 2L, 4L, 2L, 4L, 4L, 1L, 3L), .Label = c("H", "R", "S", "X"), class = "factor"), x1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L), z2 = c(0L, 100L, 147L, 336L, 0L, 139L, 161L, 201L, 386L, 0L, 58L, 121L, 285L, 0L, 50L, 112L, 140L, 179L, 214L, 245L, 305L, 452L, 452L, 0L)), .Names = c("y1", "flag", "x1", "z2"), row.names = c(NA, -24L), class = "data.frame") Rui Barradas Em 12-10-2016 21:53, Val escreveu: Rui, Thank You! the second one gave me NULL. dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) dat$z2 NULL On Wed, Oct 12, 2016 at 3:34 PM, Rui Barradas wrote: Hello, Seems simple: # 1) dat$x1 <- cumsum(dat$flag == "S") # 2) dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) Hope this helps, Rui Barradas Em 12-10-2016 21:15, Val escreveu: Hi all, I have a data set like dat<-read.table(text=" y1, flag 39958,S 40058,R 40105,X 40294,H 40332,S 40471,R 40493,R 40533,X 40718,H 40771,S 40829,R 40892,X 41056,H 41110,S 41160,R 41222,R 41250,R 41289,R 41324,X 41355,R 41415,X 41562,X 41562,H 41586,S ",sep=",",header=TRUE) First sort the data by y1. Then I want to create two columns . 1. the first new column is (x1): if flag is "S" then x1=1 and assign the following/subsequent rows 1 as well. When we reach to the next "S" then x1=2 and the subsequent rows will be assigned to 2. 2. the second variable (z2). Within each x1 find the difference between the first y1 and subsequent y1 values Example for the first few rows y1, flag, x1, z2 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) etc Here is the complete output for the sample data 39958,S,1,0 40058,R,1,100 40105,X,1,147 40294,H,1,336 40332,S,2,0 40471,R,2,139 40493,R,2,161 40533,X,2,201 40718,H,2,386 40771,S,3,0 40829,R,3,58 40892,X,3,121 41056,H,3,285 41110,S,4,0 41160,R,4,50 41222,R,4,112 41250,R,4,140 41289,R,4,179 41324,X,4,214 41355,R,4,245 41415,X,4,305 41562,X,4,452 41562,H,4,452 41586,S,5,0 Val __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental
Thank you Rui, It Worked! How about if the first variable is date format? Like the following dat<-read.table(text=" y1, flag 24-01-2016,S 24-02-2016,R 24-03-2016,X 24-04-2016,H 24-01-2016,S 24-11-2016,R 24-10-2016,R 24-02-2016,X 24-01-2016,H 24-11-2016,S 24-02-2016,R 24-10-2016,X 24-03-2016,H 24-04-2016,S ",sep=",",header=TRUE) dat dat$x1 <- cumsum(dat$flag == "S") dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) error message In Ops.factor(y, y[1]) : ‘-’ not meaningful for factors On Thu, Oct 13, 2016 at 5:30 AM, Rui Barradaswrote: > Hello, > > You must run the code to create x1 first, part 1), then part 2). > I've tested with your data and all went well, the result is the following. > >> dput(dat) > structure(list(y1 = c(39958L, 40058L, 40105L, 40294L, 40332L, > 40471L, 40493L, 40533L, 40718L, 40771L, 40829L, 40892L, 41056L, > 41110L, 41160L, 41222L, 41250L, 41289L, 41324L, 41355L, 41415L, > 41562L, 41562L, 41586L), flag = structure(c(3L, 2L, 4L, 1L, 3L, > 2L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 2L, 2L, 2L, 4L, 2L, 4L, > 4L, 1L, 3L), .Label = c("H", "R", "S", "X"), class = "factor"), > x1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, > 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L), z2 = c(0L, 100L, > 147L, 336L, 0L, 139L, 161L, 201L, 386L, 0L, 58L, 121L, 285L, > 0L, 50L, 112L, 140L, 179L, 214L, 245L, 305L, 452L, 452L, > 0L)), .Names = c("y1", "flag", "x1", "z2"), row.names = c(NA, > -24L), class = "data.frame") > > > Rui Barradas > > > Em 12-10-2016 21:53, Val escreveu: >> >> Rui, >> Thank You! >> >> the second one gave me NULL. >> dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) >> >> dat$z2 >> NULL >> >> >> >> On Wed, Oct 12, 2016 at 3:34 PM, Rui Barradas >> wrote: >>> >>> Hello, >>> >>> Seems simple: >>> >>> >>> # 1) >>> dat$x1 <- cumsum(dat$flag == "S") >>> >>> # 2) >>> dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) >>> >>> Hope this helps, >>> >>> Rui Barradas >>> >>> >>> Em 12-10-2016 21:15, Val escreveu: Hi all, I have a data set like dat<-read.table(text=" y1, flag 39958,S 40058,R 40105,X 40294,H 40332,S 40471,R 40493,R 40533,X 40718,H 40771,S 40829,R 40892,X 41056,H 41110,S 41160,R 41222,R 41250,R 41289,R 41324,X 41355,R 41415,X 41562,X 41562,H 41586,S ",sep=",",header=TRUE) First sort the data by y1. Then I want to create two columns . 1. the first new column is (x1): if flag is "S" then x1=1 and assign the following/subsequent rows 1 as well. When we reach to the next "S" then x1=2 and the subsequent rows will be assigned to 2. 2. the second variable (z2). Within each x1 find the difference between the first y1 and subsequent y1 values Example for the first few rows y1, flag, x1, z2 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) etc Here is the complete output for the sample data 39958,S,1,0 40058,R,1,100 40105,X,1,147 40294,H,1,336 40332,S,2,0 40471,R,2,139 40493,R,2,161 40533,X,2,201 40718,H,2,386 40771,S,3,0 40829,R,3,58 40892,X,3,121 41056,H,3,285 41110,S,4,0 41160,R,4,50 41222,R,4,112 41250,R,4,140 41289,R,4,179 41324,X,4,214 41355,R,4,245 41415,X,4,305 41562,X,4,452 41562,H,4,452 41586,S,5,0 Val __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. >>> > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental
Hello, You must run the code to create x1 first, part 1), then part 2). I've tested with your data and all went well, the result is the following. > dput(dat) structure(list(y1 = c(39958L, 40058L, 40105L, 40294L, 40332L, 40471L, 40493L, 40533L, 40718L, 40771L, 40829L, 40892L, 41056L, 41110L, 41160L, 41222L, 41250L, 41289L, 41324L, 41355L, 41415L, 41562L, 41562L, 41586L), flag = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 2L, 2L, 2L, 4L, 2L, 4L, 4L, 1L, 3L), .Label = c("H", "R", "S", "X"), class = "factor"), x1 = c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 5L), z2 = c(0L, 100L, 147L, 336L, 0L, 139L, 161L, 201L, 386L, 0L, 58L, 121L, 285L, 0L, 50L, 112L, 140L, 179L, 214L, 245L, 305L, 452L, 452L, 0L)), .Names = c("y1", "flag", "x1", "z2"), row.names = c(NA, -24L), class = "data.frame") Rui Barradas Em 12-10-2016 21:53, Val escreveu: Rui, Thank You! the second one gave me NULL. dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) dat$z2 NULL On Wed, Oct 12, 2016 at 3:34 PM, Rui Barradaswrote: Hello, Seems simple: # 1) dat$x1 <- cumsum(dat$flag == "S") # 2) dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) Hope this helps, Rui Barradas Em 12-10-2016 21:15, Val escreveu: Hi all, I have a data set like dat<-read.table(text=" y1, flag 39958,S 40058,R 40105,X 40294,H 40332,S 40471,R 40493,R 40533,X 40718,H 40771,S 40829,R 40892,X 41056,H 41110,S 41160,R 41222,R 41250,R 41289,R 41324,X 41355,R 41415,X 41562,X 41562,H 41586,S ",sep=",",header=TRUE) First sort the data by y1. Then I want to create two columns . 1. the first new column is (x1): if flag is "S" then x1=1 and assign the following/subsequent rows 1 as well. When we reach to the next "S" then x1=2 and the subsequent rows will be assigned to 2. 2. the second variable (z2). Within each x1 find the difference between the first y1 and subsequent y1 values Example for the first few rows y1, flag, x1, z2 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) etc Here is the complete output for the sample data 39958,S,1,0 40058,R,1,100 40105,X,1,147 40294,H,1,336 40332,S,2,0 40471,R,2,139 40493,R,2,161 40533,X,2,201 40718,H,2,386 40771,S,3,0 40829,R,3,58 40892,X,3,121 41056,H,3,285 41110,S,4,0 41160,R,4,50 41222,R,4,112 41250,R,4,140 41289,R,4,179 41324,X,4,214 41355,R,4,245 41415,X,4,305 41562,X,4,452 41562,H,4,452 41586,S,5,0 Val __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental
Rui, Thank You! the second one gave me NULL. dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) dat$z2 NULL On Wed, Oct 12, 2016 at 3:34 PM, Rui Barradaswrote: > Hello, > > Seems simple: > > > # 1) > dat$x1 <- cumsum(dat$flag == "S") > > # 2) > dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) > > Hope this helps, > > Rui Barradas > > > Em 12-10-2016 21:15, Val escreveu: >> >> Hi all, >> >> I have a data set like >> dat<-read.table(text=" y1, flag >> 39958,S >> 40058,R >> 40105,X >> 40294,H >> 40332,S >> 40471,R >> 40493,R >> 40533,X >> 40718,H >> 40771,S >> 40829,R >> 40892,X >> 41056,H >> 41110,S >> 41160,R >> 41222,R >> 41250,R >> 41289,R >> 41324,X >> 41355,R >> 41415,X >> 41562,X >> 41562,H >> 41586,S >> ",sep=",",header=TRUE) >> >> First sort the data by y1. >> Then >> I want to create two columns . >> 1. the first new column is (x1): if flag is "S" then x1=1 and >> assign the following/subsequent rows 1 as well. When we reach to >> the next "S" then x1=2 and the subsequent rows will be assigned to >> 2. >> >> 2. the second variable (z2). Within each x1 find the difference >> between the first y1 and subsequent y1 values >> >> Example for the first few rows >>y1, flag, x1, z2 >> 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) >> 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) >> 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) >> 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) >> 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) >> etc >> >> Here is the complete output for the sample data >> 39958,S,1,0 >> 40058,R,1,100 >> 40105,X,1,147 >> 40294,H,1,336 >> 40332,S,2,0 >> 40471,R,2,139 >> 40493,R,2,161 >> 40533,X,2,201 >> 40718,H,2,386 >> 40771,S,3,0 >> 40829,R,3,58 >> 40892,X,3,121 >> 41056,H,3,285 >> 41110,S,4,0 >> 41160,R,4,50 >> 41222,R,4,112 >> 41250,R,4,140 >> 41289,R,4,179 >> 41324,X,4,214 >> 41355,R,4,245 >> 41415,X,4,305 >> 41562,X,4,452 >> 41562,H,4,452 >> 41586,S,5,0 >> >> Val >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental
Hello, Seems simple: # 1) dat$x1 <- cumsum(dat$flag == "S") # 2) dat$z2 <- unlist(tapply(dat$y1, dat$x1, function(y) y - y[1])) Hope this helps, Rui Barradas Em 12-10-2016 21:15, Val escreveu: Hi all, I have a data set like dat<-read.table(text=" y1, flag 39958,S 40058,R 40105,X 40294,H 40332,S 40471,R 40493,R 40533,X 40718,H 40771,S 40829,R 40892,X 41056,H 41110,S 41160,R 41222,R 41250,R 41289,R 41324,X 41355,R 41415,X 41562,X 41562,H 41586,S ",sep=",",header=TRUE) First sort the data by y1. Then I want to create two columns . 1. the first new column is (x1): if flag is "S" then x1=1 and assign the following/subsequent rows 1 as well. When we reach to the next "S" then x1=2 and the subsequent rows will be assigned to 2. 2. the second variable (z2). Within each x1 find the difference between the first y1 and subsequent y1 values Example for the first few rows y1, flag, x1, z2 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) etc Here is the complete output for the sample data 39958,S,1,0 40058,R,1,100 40105,X,1,147 40294,H,1,336 40332,S,2,0 40471,R,2,139 40493,R,2,161 40533,X,2,201 40718,H,2,386 40771,S,3,0 40829,R,3,58 40892,X,3,121 41056,H,3,285 41110,S,4,0 41160,R,4,50 41222,R,4,112 41250,R,4,140 41289,R,4,179 41324,X,4,214 41355,R,4,245 41415,X,4,305 41562,X,4,452 41562,H,4,452 41586,S,5,0 Val __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Incremental
Hi all, I have a data set like dat<-read.table(text=" y1, flag 39958,S 40058,R 40105,X 40294,H 40332,S 40471,R 40493,R 40533,X 40718,H 40771,S 40829,R 40892,X 41056,H 41110,S 41160,R 41222,R 41250,R 41289,R 41324,X 41355,R 41415,X 41562,X 41562,H 41586,S ",sep=",",header=TRUE) First sort the data by y1. Then I want to create two columns . 1. the first new column is (x1): if flag is "S" then x1=1 and assign the following/subsequent rows 1 as well. When we reach to the next "S" then x1=2 and the subsequent rows will be assigned to 2. 2. the second variable (z2). Within each x1 find the difference between the first y1 and subsequent y1 values Example for the first few rows y1, flag, x1, z2 39958, S, 1,0 z2 is calculated as z2=(39958, 39958) 40058, R, 1, 100 z2 is calculated as z2=(40058, 39958) 40105, X, 1, 147 z2 is calculated as z2=(40105, 39958) 40294, H, 1, 336 z2 is calculated as z2=(40294, 39958) 40332, S, 2, 0z2 is calculated as z2=(40332, 40332) etc Here is the complete output for the sample data 39958,S,1,0 40058,R,1,100 40105,X,1,147 40294,H,1,336 40332,S,2,0 40471,R,2,139 40493,R,2,161 40533,X,2,201 40718,H,2,386 40771,S,3,0 40829,R,3,58 40892,X,3,121 41056,H,3,285 41110,S,4,0 41160,R,4,50 41222,R,4,112 41250,R,4,140 41289,R,4,179 41324,X,4,214 41355,R,4,245 41415,X,4,305 41562,X,4,452 41562,H,4,452 41586,S,5,0 Val __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Incremental Sparse Bridge PLS algorithm (iSB-PLS)
I'm reading the paper Predictive modeling with high-dimensional data streams: an on-line variable selection approach, from McWilliams and Montana, which introduces an interesting algorithm to dynamically select variables in high dimensional data streams. Does anyone know if this (or a similar approach) has been implemented in R? -- View this message in context: http://r.789695.n4.nabble.com/Incremental-Sparse-Bridge-PLS-algorithm-iSB-PLS-tp4354182p4354182.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi again, Changing my code by defining vectors outside the loop and combining them afterwards helped a lot so now the code does not slow down anymore and I was able to parse the file in less than 2 hours. Not fantastic but it works. I will William's the last suggestion of how to parse it without looping through for next time I have to parse a large file. Many thanks for your help! Frederik On Thu, Apr 14, 2011 at 4:58 PM, William Dunlap wdun...@tibco.com wrote: [see below] From: Frederik Lang [mailto:frederikl...@gmail.com] Sent: Thursday, April 14, 2011 12:56 PM To: William Dunlap Cc: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Doing the simple preallocation that I describe should speed it up a lot with very little effort. It is more work to manipulate the columns one at a time instead of using data.frame subscripting and it may not be worth it if you have lots of columns. If you have a lot of this sort of file and feel that it will be worth the programming time to do something fancier, here is some code that reads lines of the form cat(lines, sep=\n) Id: First Var1: false Var2: 6 Var3: 8 Id: Second Id: Last Var1: true Var3: 8 and produces a matrix with the Id's along the rows and the Var's along the columns: f(lines) Var1Var2 Var3 First false 6 8 Second NA NA NA Last true NA 8 The function f is: f - function (lines) { # keep only lines with colons lines - grep(value = TRUE, ^.+:, lines) lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines) isIdLine - grepl(^Id:, lines) group - cumsum(isIdLine) rownames - sub(^Id:[[:space:]]*, , lines[isIdLine]) lines - lines[!isIdLine] group - group[!isIdLine] varname - sub([[:space:]]*:.*$, , lines) value - sub(.*:[[:space:]]*, , lines) colnames - unique(varname) col - match(varname, colnames) retval - array(NA_character_, c(length(rownames), length(colnames)), dimnames = list(rownames, colnames)) retval[cbind(group, col)] - value retval } The main trick is the matrix subscript given to retval on the penultimate line. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same
Re: [R] Incremental ReadLines
Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Freds Sent: Wednesday, April 13, 2011 10:58 AM To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Mike, Thanks for your comment. I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done. I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka marchy...@hotmail.comwrote: Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Freds Sent: Wednesday, April 13, 2011 10:58 AM To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3 447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Date: Thu, 14 Apr 2011 11:57:40 -0400 Subject: Re: [R] Incremental ReadLines From: frederikl...@gmail.com To: marchy...@hotmail.com CC: r-help@r-project.org Hi Mike, Thanks for your comment. I must admit that I am very new to R and although it sounds interesting what you write I have no idea of where to start. Can you give some functions or examples where I can see how it can be done. I'm not sure I have a good R answer, simply pointing out the likley isuse and maybe the rest belongs on r-develoiper list or something. If you can determine you are running out of physical memory, then you either need to partitition somehting or make accesses more regular. My favorite example from personal experience is sorting a data set prior to piping into a c++ program that changed the execution time substantially by avoiding VM thrashing. R either needs a swapping buffer or has an equivalent that someone else could mention. I was under the impression that I had to do a loop since my blocks of observations are of varying length. Thanks again, Frederik On Thu, Apr 14, 2011 at 6:19 AM, Mike Marchywka wrote: Date: Wed, 13 Apr 2011 10:57:58 -0700 From: frederikl...@gmail.com To: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi there, I am having a similar problem with reading in a large text file with around 550.000 observations with each 10 to 100 lines of description. I am trying to parse it in R but I have troubles with the size of the file. It seems like it is slowing down dramatically at some point. I would be happy for any This probably occurs when you run out of physical memory but you can probably verify by looking at task manager. A readline() method wouldn't fit real well with R as you try to had blocks of data so that inner loops, implemented largely in native code, can operate efficiently. The thing you want is a data structure that can use disk more effectively and hide these details from you and algorightm. This works best if the algorithm works with data strcuture to avoid lots of disk thrashing. You coudl imagine that your read would do nothing until each item is needed but often people want the whole file validated before procesing, lots of details come up with exception handling as you get fancy here. Note of course that your parse output could be stored in a hash or something represnting a DOM and this could get arbitrarily large. Since it is designed for random access, this may cause lots of thrashing if partially on disk. Anything you can do to make access patterns more regular, for example sort your data, would help. suggestions. Here is my code, which works fine when I am doing a subsample of my dataset. #Defining datasource file - filename.txt #Creating placeholder for data and assigning column names data - data.frame(Id=NA) #Starting by case = 0 case - 0 #Opening a connection to data input - file(file, rt) #Going through cases repeat { line - readLines(input, n=1) if (length(line)==0) break if (length(grep(Id:,line)) != 0) { case - case + 1 ; data[case,] -NA split_line - strsplit(line,Id:) data[case,1] - as.numeric(split_line[[1]][2]) } } #Closing connection close(input) #Saving dataframe write.csv(data,'data.csv') Kind regards, Frederik -- View this message in context: http://r.789695.n4.nabble.com/Incremental-ReadLines-tp878581p3447859.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
[see below] From: Frederik Lang [mailto:frederikl...@gmail.com] Sent: Thursday, April 14, 2011 12:56 PM To: William Dunlap Cc: r-help@r-project.org Subject: Re: [R] Incremental ReadLines Hi Bill, Thank you so much for your suggestions. I will try and alter my code. Regarding the even shorter solution outside the loop it looks good but my problem is that not all observations have the same variables so that three different observations might look like this: Id: 1 Var1: false Var2: 6 Var3: 8 Id: 2 missing Id: 3 Var1: true 3 4 5 Var2: 7 Var3: 3 Doing it without looping through I thought my data had to quite systematic, which it is not. I might be wrong though. Doing the simple preallocation that I describe should speed it up a lot with very little effort. It is more work to manipulate the columns one at a time instead of using data.frame subscripting and it may not be worth it if you have lots of columns. If you have a lot of this sort of file and feel that it will be worth the programming time to do something fancier, here is some code that reads lines of the form cat(lines, sep=\n) Id: First Var1: false Var2: 6 Var3: 8 Id: Second Id: Last Var1: true Var3: 8 and produces a matrix with the Id's along the rows and the Var's along the columns: f(lines) Var1Var2 Var3 First false 6 8 Second NA NA NA Last true NA 8 The function f is: f - function (lines) { # keep only lines with colons lines - grep(value = TRUE, ^.+:, lines) lines - gsub(^[[:space:]]+|[[:space:]]+$, , lines) isIdLine - grepl(^Id:, lines) group - cumsum(isIdLine) rownames - sub(^Id:[[:space:]]*, , lines[isIdLine]) lines - lines[!isIdLine] group - group[!isIdLine] varname - sub([[:space:]]*:.*$, , lines) value - sub(.*:[[:space:]]*, , lines) colnames - unique(varname) col - match(varname, colnames) retval - array(NA_character_, c(length(rownames), length(colnames)), dimnames = list(rownames, colnames)) retval[cbind(group, col)] - value retval } The main trick is the matrix subscript given to retval on the penultimate line. Thanks again, Frederik On Thu, Apr 14, 2011 at 12:56 PM, William Dunlap wdun...@tibco.com wrote: I have two suggestions to speed up your code, if you must use a loop. First, don't grow your output dataset at each iteration. Instead of cases - 0 output - numeric(cases) while(length(line - readLines(input, n=1))==1) { cases - cases + 1 output[cases] - as.numeric(line) } preallocate the output vector to be about the size of its eventual length (slightly bigger is better), replacing output - numeric(0) with the likes of output - numeric(50) and when you are done with the loop trim down the length if it is too big if (cases length(output)) length(output) - cases Growing your dataset in a loop can cause quadratic or worse growth in time with problem size and the above sort of code should make the time grow linearly with problem size. Second, don't do data.frame subscripting inside your loop. Instead of data - data.frame(Id=numeric(cases)) while(...) { data[cases, 1] - newValue } do Id - numeric(cases) while(...) { Id[cases] - newValue } data - data.frame(Id = Id) This is just the general principal that you don't want to repeat the same operation over and over in a loop. dataFrame[i,j] first extracts column j then extracts element i from that column. Since the column is the same every iteration you may as well extract the column outside of the loop. Avoiding the loop altogether is the fastest. E.g., the code you showed does the same thing as idLines - grep(value=TRUE, Id:, readLines(file)) data.frame(Id = as.numeric(sub(^.*Id:[[:space:]]*, , idLines))) You can also use an external process (perl or grep) to filter out the lines that are not of interest. Bill
[R] incremental learning for LOESS time series model
Hi All, I am currently working on some time series data, I know I can use LOESS/ARIMA model. The data is written to a vector whose length is 1000, which is a queue, updating every 15 minutes, Thus the old data will pop out while the new data push in the vector. I can rerun the whole model on a scheduler, e.g. retrain the model every 15 minutes, that is, Use the whole 1000 value to train The LOESS model, However it is very inefficient, as every time only one value is insert while another 999 vlaues still same as last time. So how can I achieve better performance? Many thanks Ying [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Gene, You might want to look at function read.csv.ffdf from package ff which can read large csv-files into a ffdf object. That's kind of data.frame which is stored on disk resp. in the file-system-cache. Once you subscript part of it, you get a regular data.frame. Jens Oehlschlägel -- Jetzt kostenlos herunterladen: Internet Explorer 8 und Mozilla Firefox 3.5 - sicherer, schneller und einfacher! http://portal.gmx.net/de/go/chbrowser __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
If the headers all start with the same letter, A say, and the data only contain numbers on their lines then just use read.table(..., comment = A) On Mon, Nov 2, 2009 at 2:03 PM, Gene Leynes gleyne...@gmail.com wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26)) # 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Incremental ReadLines
I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
On 11/2/2009 2:03 PM, Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records You can open the connection before reading, then read in blocks of lines and process those. You don't need to reopen it every time. For example, ff - file(fname, open=rt) # rt is read text for (block in 1:nblocks) { x - readLines(ff, n=121) # process this block } close(ff) Duncan Murdoch This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
Hi Gene, Rather than using R to parse this file, have you considered using either grep or sed to pre-process the file and then read it in? It looks like you just want lines starting with numbers, so something like grep '^[0-9]\+' thefile.csv otherfile.csv should be much faster, and then you can just read in otherfile.csv using read.csv(). Best, Jim Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incremental ReadLines
James, I think those are Unix commands? I'm on Windows, so that's not an option (for now) Also the suggestions posed by Duncan and Phil seem to be working. Thank you so much, such a simple thing to add the r or rt to the file connection. I read about blocking, but I didn't imagine that it meant chunks. I was thinking something more like blocking out, or guarding (perhaps for security). On Mon, Nov 2, 2009 at 1:47 PM, James W. MacDonald jmac...@med.umich.eduwrote: Hi Gene, Rather than using R to parse this file, have you considered using either grep or sed to pre-process the file and then read it in? It looks like you just want lines starting with numbers, so something like grep '^[0-9]\+' thefile.csv otherfile.csv should be much faster, and then you can just read in otherfile.csv using read.csv(). Best, Jim Gene Leynes wrote: I've been trying to figure out how to read in a large file for a few days now, and after extensive research I'm still not sure what to do. I have a large comma delimited text file that contains 59 fields in each record. There is also a header every 121 records This function works well for smallish records getcsv=function(fname){ ff=file(description = fname) x - readLines(ff) closeAllConnections() x - x[x != ] # REMOVE BLANKS x=x[grep(^[-0-9], x)] # REMOVE ALL TEXT spl=strsplit(x,',') # THIS PART IS SLOW, BUT MANAGABLE xx=t(sapply(1:length(spl),function(temp)as.vector(na.omit(as.numeric(spl[[temp]]) return(xx) } It's not elegant, but it works. For 121,000 records it completes in 2.3 seconds For 121,000*5 records it completes in 63 seconds For 121,000*10 records it doesn't complete When I try other methods to read the file in chunks (using scan), the process breaks down because I have to start at the beginning of the file on every iteration. For example: fnn=function(n,col){ a=122*(n-1)+2 xx=scan(fname,skip=a-1,nlines=121,sep=',',quiet=TRUE,what=character(0)) xx=xx[xx!=''] xx=matrix(xx,ncol=49,byrow=TRUE) xx[,col] } system.time(sapply(1:10,fnn,c=26)) # 0.31 Seconds system.time(sapply(91:90,fnn,c=26))# 1.09 Seconds system.time(sapply(901:910,fnn,c=26)) # 5.78 Seconds Even though I'm only getting the 26th column for 10 sets of records, it takes a lot longer the further into the file I go. How can I tell scan to pick up where it left off, without it starting at the beginning?? There must be a good example somewhere. I have done a lot of research (in fact, thank you to Michael J. Crawley and others for your help thus far) Thanks, Gene [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- James W. MacDonald, M.S. Biostatistician Douglas Lab University of Michigan Department of Human Genetics 5912 Buhl 1241 E. Catherine St. Ann Arbor MI 48109-5618 734-615-7826 [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.