Re: [R] Efficiency challenge: MANY subsets
Many thanks for this example, which doesn't entirely cover my case since I have as many indexes entries as sequences entries. It was very educational none the less and I used it to come up with something a bit faster than what I had before. The main trick I used though was naming all entries in sequences and indexes likes so name(indexes) - seq(length(indexes) and then do a lapply on names(indexes), which allows me to access both lists easily. What I end up with is this: fragments - lapply( names(indexes), function(x){ lapply( indexes[[x]], function(.range){ .range - seq.int( .range[1], .range[2] ) unlist(lapply(sequences[x], '[', .range),use.names=FALSE) } ) } ) Although this is still quite slow, it's much faster than what I had before. Any further comments are highly welcome. I can send the real sequences and indexes as exported R objects ... Thanks, Joh jim holtman wrote: Try this one; it is doing a list of 7000 in under 2 seconds: sequences - list( + + + c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I + ,M, + + + N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y,F, N,I,N,I,N,I,D,K,M,Y,I,H,*) + ) indexes - list( + list( + c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) + ) + ) indexes - rep(indexes,10) sequences - rep(sequences,7000) system.time({ + fragments - lapply(indexes, function(.seq){ + lapply(.seq, function(.range){ + .range - seq(.range[1], .range[2]) # save since we use several times + lapply(sequences, '[', .range) + }) + }) + }) user system elapsed 1.240.001.26 On Fri, Jan 16, 2009 at 3:16 PM, Johannes Graumann johannes_graum...@web.de wrote: Thanks. Very elegant, but doesn't solve the problem of the outer for loop, since I now would rewrite the code like so: fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) fragments[[iN]] - lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) } still very slow for length(sequences) ~ 7000. Joh On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote: Try this: lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann johannes_graum...@web.de wrote: Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I ,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y, F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Efficiency challenge: MANY subsets
Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y,F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Efficiency challenge: MANY subsets
Dear Johannes, Try this: sequences - c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T, Y,L,L,I,M,N,H,K,L,L,L,I,N,N,N,N,L,T,E,V, H,T,Y,F,N,I,N,I,N,I,D,K,M,Y,I,H,*) indexes - matrix(c(1,22,22,46,46,51,1,46,22,51,1,51),ncol=2,byrow=TRUE) apply(indexes,1,function(x){ ind- x[1]:x[2] sequences[ind] } ) HTH, Jorge On Fri, Jan 16, 2009 at 8:06 AM, Johannes Graumann johannes_graum...@web.de wrote: Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y,F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Efficiency challenge: MANY subsets
Try this: lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann johannes_graum...@web.de wrote: Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y,F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Henrique Dallazuanna Curitiba-Paraná-Brasil 25° 25' 40 S 49° 16' 22 O [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Efficiency challenge: MANY subsets
Thanks. Very elegant, but doesn't solve the problem of the outer for loop, since I now would rewrite the code like so: fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) fragments[[iN]] - lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) } still very slow for length(sequences) ~ 7000. Joh On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote: Try this: lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann johannes_graum...@web.de wrote: Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I ,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y, F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. signature.asc Description: This is a digitally signed message part. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Efficiency challenge: MANY subsets
Try this one; it is doing a list of 7000 in under 2 seconds: sequences - list( + + + c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I + ,M, + + + N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y,F, N,I,N,I,N,I,D,K,M,Y,I,H,*) + ) indexes - list( + list( + c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) + ) + ) indexes - rep(indexes,10) sequences - rep(sequences,7000) system.time({ + fragments - lapply(indexes, function(.seq){ + lapply(.seq, function(.range){ + .range - seq(.range[1], .range[2]) # save since we use several times + lapply(sequences, '[', .range) + }) + }) + }) user system elapsed 1.240.001.26 On Fri, Jan 16, 2009 at 3:16 PM, Johannes Graumann johannes_graum...@web.de wrote: Thanks. Very elegant, but doesn't solve the problem of the outer for loop, since I now would rewrite the code like so: fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) fragments[[iN]] - lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) } still very slow for length(sequences) ~ 7000. Joh On Friday 16 January 2009 14:23:47 Henrique Dallazuanna wrote: Try this: lapply(indexes[[1]], function(g)sequences[[1]][do.call(seq, as.list(g))]) On Fri, Jan 16, 2009 at 11:06 AM, Johannes Graumann johannes_graum...@web.de wrote: Hello, I have a list of character vectors like this: sequences - list( c(M,G,L,W,I,S,F,G,T,P,P,S,Y,T,Y,L,L,I ,M, N,H,K,L,L,L,I,N,N,N,N,L,T,E,V,H,T,Y, F, N,I,N,I,N,I,D,K,M,Y,I,H,*) ) and another list of subset ranges like this: indexes - list( list( c(1,22),c(22,46),c(46, 51),c(1,46),c(22,51),c(1,51) ) ) What I now want to do is to subset each entry in sequences (sequences[[1]]) with all ranges in the corresponding low level list in indexes (indexes[[1]]). Here is what I came up with. fragments - list() for(iN in seq(length(sequences))){ cat(paste(iN,\n)) tmpFragments - sapply( indexes[[iN]], function(x){ sequences[[iN]][seq.int(x[1],x[2])] } ) fragments[[iN]] - tmpFragments } This works fine, but sequences contains thousands of entries and the corresponding indexes are sometimes hundreds of ranges long, so this whole process is EXTREMELY inefficient. Does somebody out there take the challenge and show me a way on how to speed this up? Thanks for any hints, Joh __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem that you are trying to solve? __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.