Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-13 Thread Christopher W Ryan
Interesting thoughts about the partial-word matches, and speed On another real data set, about 73,000 records and 6 columns to search through for matches (one column of which contains very long character strings--several paragraphs each), I ran both John's and Bert's solutions. John's was

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-11 Thread Bert Gunter
Note that John's solution probably includes incorrect partial matches and that mine fails to match red in this is red. If you change my proposal to sapply(strsplit(do.call(paste,zz[,2:3]),\\W), function(x)any(x %in% alarm.words)) it should agree with Jeff's. Note, however, that you have missed

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-10 Thread Christopher W Ryan
Thanks everyone. John's original solution worked great. And with 27,000 records, 65 alarm.words, and 6 columns to search, it takes only about 15 seconds. That is certainly adequate for my needs. But I will try out the other strategies too. And thanks also for lot's of new R things to

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-10 Thread Bert Gunter
Yes. This is one of the fundamental challenges in text searching -- defining exactly what text defines a match and what doesn't. So, continuing your example, one might imagine that heroin and heroine might both be matches, but maybe heroines shouldn't be (e.g. if the text contains movie reviews).

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Bert Gunter
Here's a way to do it that uses %in% (i.e. match() ) and uses only a single, not a double, loop. It should be more efficient. sapply(strsplit(do.call(paste,zz[,2:3]),[[:space:]]+), + function(x)any(x %in% alarm.words)) [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE TRUE The

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Jeff Newmiller
I think grep is better suited to this: zz$v5 - grepl( paste0( alarm.words, collapse=| ), do.call( paste, zz[ , 2:3 ] ) ) ) --- Jeff NewmillerThe . . Go Live...

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Bert Gunter
Jeff: Well, it would be much better (no loops!) except, I think, for one issue: red would match barred and I don't think that this is what is wanted: the matches should be on whole words not just string patterns. So you would need to fix up the matching pattern to make this work, but it may be a

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Jeff Newmiller
Just add a word break marker before and after: zz$v5 - grepl( paste0( \\b(, paste0( alarm.words, collapse=| ), )\\b ), do.call( paste, zz[ , 2:3 ] ) ) ) --- Jeff NewmillerThe . . Go

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Bert Gunter
Yup, that does it. Let grep figure out what's a word rather than doing it manually. Forgot about \b Cheers, Bert Bert Gunter Data is not information. Information is not knowledge. And knowledge is certainly not wisdom. -- Clifford Stoll On Thu, Jul 9, 2015 at 10:30 AM, Jeff Newmiller

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Christopher W. Ryan
Running R 3.1.1 on windows 7 I want to identify as a case any record in a dataframe that contains any of several keywords in any of several variables. Example: # create a dataframe with 4 variables and 10 records v2 - c(white bird, blue bird, green turtle, quick brown fox, big black dog, waffle

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread John Fox
Dear Chris, If I understand correctly what you want, how about the following? rows - apply(zz[, 2:3], 1, function(x) any(sapply(alarm.words, grepl, x=x))) zz[rows, ] v1 v2v3 v4 3 -1.022329green turtleronald

[R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread Christopher W Ryan
Running R 3.1.1 on windows 7 I want to identify as a case any record in a dataframe that contains any of several keywords in any of several variables. Example: # create a dataframe with 4 variables and 10 records v2 - c(white bird, blue bird, green turtle, quick brown fox, big black dog, waffle

Re: [R] detecting any element in a vector of strings, appearing anywhere in any of several character variables in a dataframe

2015-07-09 Thread John Fox
Dear Christopher, My usual orientation to this kind of one-off problem is that I'm looking for a simple correct solution. Computing time is usually much smaller than programming time. That said, Bert Gunter's solution was about 5 times faster in a simple check that I ran with microbenchmark,