Re: [R] Maximum number of patterns and speed in grep

2012-07-23 Thread mdvaan
Hi,

I have a minor follow-up question:

In the example below, ann and nn in the third element of text are
matched. I would like to ignore all matches in which the character following
the match is one of [:alpha:]. How do I do this without removing the
ignore.case = TRUE argument of the strapply function?

So the output should be:

[[1]]
[1] Santa Fe Gold Corp

[[2]]
[1] Starpharma Holdings

[[3]]
NULL

Rather than:

[[1]]
[1] Santa Fe Gold Corp

[[2]]
[1] Starpharma Holdings

[[3]]
[1] ann nn

Thanks!


require(gsubfn)

# read in data 
data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header = T,
sep = ,) 

# define the object to be searched 
text - c(the first is Santa Fe Gold Corp, the second is Starpharma
Holdings, the annual earnings exceed those of last year) 

k - 3000 # chunk size 

f - function(from, text) { 
  to - min(from + k - 1, nrow(data)) 
  r - paste(data[seq(from, to), 1], collapse = |) 
  r - gsub([().*?+{}], , r) 
  strapply(text, r, ignore.case = TRUE) 
} 
ix - seq(1, nrow(data), k) 
out - lapply(text, function(text) unlist(lapply(ix, f, text))) 



--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4637458.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-16 Thread mdvaan
Thanks! That worked like a charm.

Math


Gabor Grothendieck wrote
 
 On Fri, Jul 13, 2012 at 1:41 PM, mdvaan lt;mathijsdevaan@gt; wrote:
 Here's some data (which should give you the error messages):

 # read in data
 data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header
 =
 T, sep = ,)

 # first paste all data
 data1 - paste(data[,1], collapse = |)

 # second paste subsets of the data
 data2a - paste(data[1:750,1], collapse = |)
 data2b - paste(data[751:1500,1], collapse = |)

 # define the object to be searched
 text - c(the first is Santa Fe Gold Corp, the second is
 Starpharma
 Holdings)

 # match
 strapplyc(text, data1)
 strapplyc(text, data2a)
 strapplyc(text, data2b)

 Thanks in advance!

 
 Although it seems that strapplyc can handle larger regular expressions
 than grep in R it seems neither can handle as many as in your example
 so process it in chunks:
 
 k - 3000 # chunk size
 
 f - function(from, text) {
   to - min(from + k - 1, nrow(data))
   r - paste(data[seq(from, to), 1], collapse = |)
   r - gsub([().*?+{}], , r)
   strapply(text, r)
 }
 ix - seq(1, nrow(data), k)
 out - lapply(text, function(text) unlist(lapply(ix, f, text)))
 
 
 -- 
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com
 
 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636657.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-15 Thread mdvaan
Here's some data (which should give you the error messages):

# read in data
data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header =
T, sep = ,)

# first paste all data
data1 - paste(data[,1], collapse = |)

# second paste subsets of the data
data2a - paste(data[1:750,1], collapse = |)
data2b - paste(data[751:1500,1], collapse = |)

# define the object to be searched
text - c(the first is Santa Fe Gold Corp, the second is Starpharma
Holdings)

# match
strapplyc(text, data1)
strapplyc(text, data2a)
strapplyc(text, data2b)

Thanks in advance!

Math



Gabor Grothendieck wrote
 
 On Fri, Jul 13, 2012 at 9:40 AM, mdvaan lt;mathijsdevaan@gt; wrote:
 Thanks, I see that it is working in the sample data. My data, however,
 gives
 me an error message:

 data - strapplyc(text, batch[[l]])
 Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class =
 tclObj) :
   [tcl] couldn't compile regular expression pattern: parentheses () not
 balanced.

 batch[[l]] is similar to your re string except that there is a larger
 variety of characters. I haven't been able to figure out which characters
 are causing trouble here. Any thoughts?

 Thank you very much.

 Math
 ...

 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 
 Note part on last line about posting reproducible code.
 
 -- 
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com
 
 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636472.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-15 Thread Gabor Grothendieck
On Fri, Jul 13, 2012 at 1:41 PM, mdvaan mathijsdev...@gmail.com wrote:
 Here's some data (which should give you the error messages):

 # read in data
 data - read.csv(https://dl.dropbox.com/u/13631687/data.csv;, header =
 T, sep = ,)

 # first paste all data
 data1 - paste(data[,1], collapse = |)

 # second paste subsets of the data
 data2a - paste(data[1:750,1], collapse = |)
 data2b - paste(data[751:1500,1], collapse = |)

 # define the object to be searched
 text - c(the first is Santa Fe Gold Corp, the second is Starpharma
 Holdings)

 # match
 strapplyc(text, data1)
 strapplyc(text, data2a)
 strapplyc(text, data2b)

 Thanks in advance!


Although it seems that strapplyc can handle larger regular expressions
than grep in R it seems neither can handle as many as in your example
so process it in chunks:

k - 3000 # chunk size

f - function(from, text) {
to - min(from + k - 1, nrow(data))
r - paste(data[seq(from, to), 1], collapse = |)
r - gsub([().*?+{}], , r)
strapply(text, r)
}
ix - seq(1, nrow(data), k)
out - lapply(text, function(text) unlist(lapply(ix, f, text)))


-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-13 Thread mdvaan
Thanks, I see that it is working in the sample data. My data, however, gives
me an error message: 

data - strapplyc(text, batch[[l]]) 
Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class =
tclObj) : 
  [tcl] couldn't compile regular expression pattern: parentheses () not
balanced.

batch[[l]] is similar to your re string except that there is a larger
variety of characters. I haven't been able to figure out which characters
are causing trouble here. Any thoughts?

Thank you very much.

Math 




Gabor Grothendieck wrote
 
 On Fri, Jul 6, 2012 at 10:45 AM, mdvaan lt;mathijsdevaan@gt; wrote:
 Hi,

 I am using R's grep function to find patterns in vectors of strings. The
 number of patterns I would like to match is 7,700 (of different sizes). I
 noticed that I get an error message when I do the following:

 data - array()
 for (j in 1:length(x))
 {
 array[j] - length(grep(paste(patterns[1:7700], collapse = |),  x[j],
 value = T))
 }

 When I break this up into 4 chunks of patterns it works:

 data - array()
 for (j in 1:length(x))
 {
 array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse =
 |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse =
 |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse =
 |),
 x[j], value = T))
 }

 My questions: what's the maximum size of the patterns argument in grep?
 Is
 there a way to do this faster? It is very slow.
 
 Try strapplyc in gsubfn and see
   http://gsubfn.googlecode.com
 for more info.
 
 # test data
 x - c(abcd, z, dbef)
 
 # re is regexp with 7700 alternatives
 #  to test with
 g - expand.grid(letters, letters, letters)
 gp - do.call(paste0, g)
 gp7700 - head(gp, 7700)
 re - paste(gp7700, collapse = |)
 
 # grep gives error message
 grep.out - grep(re, x)
 
 # strapplyc works
 library(gsubfn)
 which(sapply(strapplyc(x, re), length)  0)
 
 
 -- 
 Statistics  Software Consulting
 GKX Group, GKX Associates Inc.
 tel: 1-877-GKX-GROUP
 email: ggrothendieck at gmail.com
 
 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 

--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4636437.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-13 Thread Gabor Grothendieck
On Fri, Jul 13, 2012 at 9:40 AM, mdvaan mathijsdev...@gmail.com wrote:
 Thanks, I see that it is working in the sample data. My data, however, gives
 me an error message:

 data - strapplyc(text, batch[[l]])
 Error in structure(.External(dotTcl, ..., PACKAGE = tcltk), class =
 tclObj) :
   [tcl] couldn't compile regular expression pattern: parentheses () not
 balanced.

 batch[[l]] is similar to your re string except that there is a larger
 variety of characters. I haven't been able to figure out which characters
 are causing trouble here. Any thoughts?

 Thank you very much.

 Math
...

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.

Note part on last line about posting reproducible code.

-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Maximum number of patterns and speed in grep

2012-07-06 Thread mdvaan
Hi,

I am using R's grep function to find patterns in vectors of strings. The
number of patterns I would like to match is 7,700 (of different sizes). I
noticed that I get an error message when I do the following: 

data - array()
for (j in 1:length(x))
{
array[j] - length(grep(paste(patterns[1:7700], collapse = |),  x[j],
value = T))
}

When I break this up into 4 chunks of patterns it works:

data - array()
for (j in 1:length(x))
{
array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |), 
x[j], value = T))
array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |), 
x[j], value = T))
array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |), 
x[j], value = T))
array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |), 
x[j], value = T))
} 

My questions: what's the maximum size of the patterns argument in grep? Is
there a way to do this faster? It is very slow.

Thanks.

Math

Sorry for not providing a reproducible example. It's a size issue which
makes it difficult to provide an example.

 

--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-06 Thread Sarah Goslee
Hi,

Given that you can't provide a full example, please at least provide
str() on your data, more complete information on the problem, and
ideally a small toy example that demonstrates precisely what you are
doing.

For instance, you tell us that you get an error message but you
never tell us what it is. Don't you think we might need to know what
the error is to be able to diagnose and fix it?

Also, note that your working example simply overwrites
array$chunk1[j] four times.

Sarah

On Fri, Jul 6, 2012 at 10:45 AM, mdvaan mathijsdev...@gmail.com wrote:
 Hi,

 I am using R's grep function to find patterns in vectors of strings. The
 number of patterns I would like to match is 7,700 (of different sizes). I
 noticed that I get an error message when I do the following:

 data - array()
 for (j in 1:length(x))
 {
 array[j] - length(grep(paste(patterns[1:7700], collapse = |),  x[j],
 value = T))
 }

 When I break this up into 4 chunks of patterns it works:

 data - array()
 for (j in 1:length(x))
 {
 array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |),
 x[j], value = T))
 }

 My questions: what's the maximum size of the patterns argument in grep? Is
 there a way to do this faster? It is very slow.

 Thanks.

 Math

 Sorry for not providing a reproducible example. It's a size issue which
 makes it difficult to provide an example.



-- 
Sarah Goslee
http://www.functionaldiversity.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-06 Thread Gabor Grothendieck
On Fri, Jul 6, 2012 at 10:45 AM, mdvaan mathijsdev...@gmail.com wrote:
 Hi,

 I am using R's grep function to find patterns in vectors of strings. The
 number of patterns I would like to match is 7,700 (of different sizes). I
 noticed that I get an error message when I do the following:

 data - array()
 for (j in 1:length(x))
 {
 array[j] - length(grep(paste(patterns[1:7700], collapse = |),  x[j],
 value = T))
 }

 When I break this up into 4 chunks of patterns it works:

 data - array()
 for (j in 1:length(x))
 {
 array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse = |),
 x[j], value = T))
 }

 My questions: what's the maximum size of the patterns argument in grep? Is
 there a way to do this faster? It is very slow.

Try strapplyc in gsubfn and see
  http://gsubfn.googlecode.com
for more info.

# test data
x - c(abcd, z, dbef)

# re is regexp with 7700 alternatives
#  to test with
g - expand.grid(letters, letters, letters)
gp - do.call(paste0, g)
gp7700 - head(gp, 7700)
re - paste(gp7700, collapse = |)

# grep gives error message
grep.out - grep(re, x)

# strapplyc works
library(gsubfn)
which(sapply(strapplyc(x, re), length)  0)


-- 
Statistics  Software Consulting
GKX Group, GKX Associates Inc.
tel: 1-877-GKX-GROUP
email: ggrothendieck at gmail.com

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Maximum number of patterns and speed in grep

2012-07-06 Thread mdvaan
Thanks for the quick response. I should phrase my question differently
because everything is working fine, I am just trying to find a more
efficient approach:

1. What's the maximum size of the patterns argument in grep? Can't find it
online. 
2. I am trying to match 7,700 character strings to about 10,000 vectors each
containing about 5,000 strings using grep. Is there a way to do this faster?
It is very slow. 

Thanks


Sarah Goslee wrote
 
 Hi,
 
 Given that you can't provide a full example, please at least provide
 str() on your data, more complete information on the problem, and
 ideally a small toy example that demonstrates precisely what you are
 doing.
 
 For instance, you tell us that you get an error message but you
 never tell us what it is. Don't you think we might need to know what
 the error is to be able to diagnose and fix it?
 
 Also, note that your working example simply overwrites
 array$chunk1[j] four times.
 
 Sarah
 
 On Fri, Jul 6, 2012 at 10:45 AM, mdvaan lt;mathijsdevaan@gt; wrote:
 Hi,

 I am using R's grep function to find patterns in vectors of strings. The
 number of patterns I would like to match is 7,700 (of different sizes). I
 noticed that I get an error message when I do the following:

 data - array()
 for (j in 1:length(x))
 {
 array[j] - length(grep(paste(patterns[1:7700], collapse = |),  x[j],
 value = T))
 }

 When I break this up into 4 chunks of patterns it works:

 data - array()
 for (j in 1:length(x))
 {
 array$chunk1[j] - length(grep(paste(patterns[1:2500], collapse = |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[2501:5000], collapse =
 |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[5001:7500], collapse =
 |),
 x[j], value = T))
 array$chunk1[j] - length(grep(paste(patterns[7501:7700], collapse =
 |),
 x[j], value = T))
 }

 My questions: what's the maximum size of the patterns argument in grep?
 Is
 there a way to do this faster? It is very slow.

 Thanks.

 Math

 Sorry for not providing a reproducible example. It's a size issue which
 makes it difficult to provide an example.

 
 
 -- 
 Sarah Goslee
 http://www.functionaldiversity.org
 
 __
 R-help@ mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


--
View this message in context: 
http://r.789695.n4.nabble.com/Maximum-number-of-patterns-and-speed-in-grep-tp4635613p4635626.html
Sent from the R help mailing list archive at Nabble.com.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.