Re: [R] Maximum number of patterns and speed in grep
Hi, I have a minor follow-up question: In the example below, ann and nn in the third element of text are matched. I would like to ignore all matches in which the character following the match is one of [:alpha:]. How do I do this without removing the ignore.case = TRUE argument of the strapply function? So the output should be: [[1]] [1] Santa Fe Gold Corp [[2]] [1] Starpharma Holdings [[3]] NULL Rather than: [[1]] [1] Santa Fe Gold Corp [[2]] [1] Starpharma Holdings [[3]] [1] ann nn Thanks! require(gsubfn) # read in data data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header = T, sep = ,) # define the object to be searched text - c(the first is Santa Fe Gold Corp, the second is Starpharma Holdings, the annual earnings exceed those of last year) k - 3000 # chunk size f - function(from, text) { to - min(from + k - 1, nrow(data)) r - paste(data[seq(from, to), 1], collapse = |) r - gsub([().*?+{}], , r) strapply(text, r, ignore.case = TRUE) } ix - seq(1, nrow(data), k) out - lapply(text, function(text) unlist(lapply(ix, f, text))) -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4637458.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
Thanks! That worked like a charm. Math Gabor Grothendieck wrote On Fri, Jul 13, 2012 at 1:41 PM, mdvaan lt;mathijsdevaan@gt; wrote: Here's some data (which should give you the error messages): # read in data data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header = T, sep = ,) # first paste all data data1 - paste(data[,1], collapse = |) # second paste subsets of the data data2a - paste(data[1:750,1], collapse = |) data2b - paste(data[751:1500,1], collapse = |) # define the object to be searched text - c(the first is Santa Fe Gold Corp, the second is Starpharma Holdings) # match strapplyc(text, data1) strapplyc(text, data2a) strapplyc(text, data2b) Thanks in advance! Although it seems that strapplyc can handle larger regular expressions than grep in R it seems neither can handle as many as in your example so process it in chunks: k - 3000 # chunk size f - function(from, text) { to - min(from + k - 1, nrow(data)) r - paste(data[seq(from, to), 1], collapse = |) r - gsub([().*?+{}], , r) strapply(text, r) } ix - seq(1, nrow(data), k) out - lapply(text, function(text) unlist(lapply(ix, f, text))) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636657.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
Here's some data (which should give you the error messages): # read in data data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header = T, sep = ,) # first paste all data data1 - paste(data[,1], collapse = |) # second paste subsets of the data data2a - paste(data[1:750,1], collapse = |) data2b - paste(data[751:1500,1], collapse = |) # define the object to be searched text - c(the first is Santa Fe Gold Corp, the second is Starpharma Holdings) # match strapplyc(text, data1) strapplyc(text, data2a) strapplyc(text, data2b) Thanks in advance! Math Gabor Grothendieck wrote On Fri, Jul 13, 2012 at 9:40 AM, mdvaan lt;mathijsdevaan@gt; wrote: Thanks, I see that it is working in the sample data. My data, however, gives me an error message: data - strapplyc(text, batch[[l]]) Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class = tclObj) : [tcl] couldn't compile regular expression pattern: parentheses () not balanced. batch[[l]] is similar to your re string except that there is a larger variety of characters. I haven't been able to figure out which characters are causing trouble here. Any thoughts? Thank you very much. Math ... __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Note part on last line about posting reproducible code. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636472.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
On Fri, Jul 13, 2012 at 1:41 PM, mdvaan mathijsdev...@gmail.com wrote: Here's some data (which should give you the error messages): # read in data data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header = T, sep = ,) # first paste all data data1 - paste(data[,1], collapse = |) # second paste subsets of the data data2a - paste(data[1:750,1], collapse = |) data2b - paste(data[751:1500,1], collapse = |) # define the object to be searched text - c(the first is Santa Fe Gold Corp, the second is Starpharma Holdings) # match strapplyc(text, data1) strapplyc(text, data2a) strapplyc(text, data2b) Thanks in advance! Although it seems that strapplyc can handle larger regular expressions than grep in R it seems neither can handle as many as in your example so process it in chunks: k - 3000 # chunk size f - function(from, text) { to - min(from + k - 1, nrow(data)) r - paste(data[seq(from, to), 1], collapse = |) r - gsub([().*?+{}], , r) strapply(text, r) } ix - seq(1, nrow(data), k) out - lapply(text, function(text) unlist(lapply(ix, f, text))) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
Thanks, I see that it is working in the sample data. My data, however, gives me an error message: data - strapplyc(text, batch[[l]]) Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class = tclObj) : [tcl] couldn't compile regular expression pattern: parentheses () not balanced. batch[[l]] is similar to your re string except that there is a larger variety of characters. I haven't been able to figure out which characters are causing trouble here. Any thoughts? Thank you very much. Math Gabor Grothendieck wrote On Fri, Jul 6, 2012 at 10:45 AM, mdvaan lt;mathijsdevaan@gt; wrote: Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data - array() for (j in 1:length(x)) { array[j] - length(grep(paste(patterns[1:7700], collapse = |), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data - array() for (j in 1:length(x)) { array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Try strapplyc in gsubfn and see http://gsubfn.googlecode.com for more info. # test data x - c(abcd, z, dbef) # re is regexp with 7700 alternatives # to test with g - expand.grid(letters, letters, letters) gp - do.call(paste0, g) gp7700 - head(gp, 7700) re - paste(gp7700, collapse = |) # grep gives error message grep.out - grep(re, x) # strapplyc works library(gsubfn) which(sapply(strapplyc(x, re), length) 0) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636437.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
On Fri, Jul 13, 2012 at 9:40 AM, mdvaan mathijsdev...@gmail.com wrote: Thanks, I see that it is working in the sample data. My data, however, gives me an error message: data - strapplyc(text, batch[[l]]) Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class = tclObj) : [tcl] couldn't compile regular expression pattern: parentheses () not balanced. batch[[l]] is similar to your re string except that there is a larger variety of characters. I haven't been able to figure out which characters are causing trouble here. Any thoughts? Thank you very much. Math ... __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Note part on last line about posting reproducible code. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Maximum number of patterns and speed in grep
Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data - array() for (j in 1:length(x)) { array[j] - length(grep(paste(patterns[1:7700], collapse = |), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data - array() for (j in 1:length(x)) { array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Thanks. Math Sorry for not providing a reproducible example. It's a size issue which makes it difficult to provide an example. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
Hi, Given that you can't provide a full example, please at least provide str() on your data, more complete information on the problem, and ideally a small toy example that demonstrates precisely what you are doing. For instance, you tell us that you get an error message but you never tell us what it is. Don't you think we might need to know what the error is to be able to diagnose and fix it? Also, note that your working example simply overwrites array$chunk1[j] four times. Sarah On Fri, Jul 6, 2012 at 10:45 AM, mdvaan mathijsdev...@gmail.com wrote: Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data - array() for (j in 1:length(x)) { array[j] - length(grep(paste(patterns[1:7700], collapse = |), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data - array() for (j in 1:length(x)) { array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Thanks. Math Sorry for not providing a reproducible example. It's a size issue which makes it difficult to provide an example. -- Sarah Goslee http://www.functionaldiversity.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
On Fri, Jul 6, 2012 at 10:45 AM, mdvaan mathijsdev...@gmail.com wrote: Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data - array() for (j in 1:length(x)) { array[j] - length(grep(paste(patterns[1:7700], collapse = |), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data - array() for (j in 1:length(x)) { array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Try strapplyc in gsubfn and see http://gsubfn.googlecode.com for more info. # test data x - c(abcd, z, dbef) # re is regexp with 7700 alternatives # to test with g - expand.grid(letters, letters, letters) gp - do.call(paste0, g) gp7700 - head(gp, 7700) re - paste(gp7700, collapse = |) # grep gives error message grep.out - grep(re, x) # strapplyc works library(gsubfn) which(sapply(strapplyc(x, re), length) 0) -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Maximum number of patterns and speed in grep
Thanks for the quick response. I should phrase my question differently because everything is working fine, I am just trying to find a more efficient approach: 1. What's the maximum size of the patterns argument in grep? Can't find it online. 2. I am trying to match 7,700 character strings to about 10,000 vectors each containing about 5,000 strings using grep. Is there a way to do this faster? It is very slow. Thanks Sarah Goslee wrote Hi, Given that you can't provide a full example, please at least provide str() on your data, more complete information on the problem, and ideally a small toy example that demonstrates precisely what you are doing. For instance, you tell us that you get an error message but you never tell us what it is. Don't you think we might need to know what the error is to be able to diagnose and fix it? Also, note that your working example simply overwrites array$chunk1[j] four times. Sarah On Fri, Jul 6, 2012 at 10:45 AM, mdvaan lt;mathijsdevaan@gt; wrote: Hi, I am using R's grep function to find patterns in vectors of strings. The number of patterns I would like to match is 7,700 (of different sizes). I noticed that I get an error message when I do the following: data - array() for (j in 1:length(x)) { array[j] - length(grep(paste(patterns[1:7700], collapse = |), x[j], value = T)) } When I break this up into 4 chunks of patterns it works: data - array() for (j in 1:length(x)) { array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), x[j], value = T)) array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), x[j], value = T)) } My questions: what's the maximum size of the patterns argument in grep? Is there a way to do this faster? It is very slow. Thanks. Math Sorry for not providing a reproducible example. It's a size issue which makes it difficult to provide an example. -- Sarah Goslee http://www.functionaldiversity.org __ R-help@ mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- View this message in context: http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4635626.html Sent from the R help mailing list archive at Nabble.com. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.