[R] Identify and extract a whole word of variable length using regular expressions
Hi everybody, I'm quite weak with regular expression, and I need some help... I have strings of the type a [1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c [2,] ppe16 Rv1135c MT1168 [3,] ppe21 Rv1548c MT1599 MTCY48.17 [4,] ppe12 Rv0755c MT0779 [5,] PE_PGRS51 Rv3367 [etc..for several hundreds] I want have instead only: [1,] Rv3018c [2,] Rv1135c [3,] Rv1548c [4,] Rv0755c [5,] Rv3367 Besides these examples, the only thing I know for sure is that the magic substrings I want to extract are entire word all starting by Rv. So Rvx, preceded and followed by a space, and of a variable length. I don't have any other infos. Do you know how to pick them? I checked for their presence using grep, and \\Rv*\\ expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... Could you please give me some suggestions? Thanks a lot, Giulio _ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify and extract a whole word of variable length using regular expressions
Giulio - This sub('^.* ?(Rv[^ ]*) ?.*$','\\1',a) [1] Rv3018c Rv1135c Rv1548c Rv0755c Rv3367 seems to do what you want. - Phil Spector Statistical Computing Facility Department of Statistics UC Berkeley spec...@stat.berkeley.edu On Mon, 28 Jun 2010, Giulio Di Giovanni wrote: Hi everybody, I'm quite weak with regular expression, and I need some help... I have strings of the type a [1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c [2,] ppe16 Rv1135c MT1168 [3,] ppe21 Rv1548c MT1599 MTCY48.17 [4,] ppe12 Rv0755c MT0779 [5,] PE_PGRS51 Rv3367 [etc..for several hundreds] I want have instead only: [1,] Rv3018c [2,] Rv1135c [3,] Rv1548c [4,] Rv0755c [5,] Rv3367 Besides these examples, the only thing I know for sure is that the magic substrings I want to extract are entire word all starting by Rv. So Rvx, preceded and followed by a space, and of a variable length. I don't have any other infos. Do you know how to pick them? I checked for their presence using grep, and \\Rv*\\ expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... Could you please give me some suggestions? Thanks a lot, Giulio _ [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Identify and extract a whole word of variable length using regular expressions
On Mon, Jun 28, 2010 at 7:17 PM, Giulio Di Giovanni perimessagg...@hotmail.com wrote: Hi everybody, I'm quite weak with regular expression, and I need some help... I have strings of the type a [1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c [2,] ppe16 Rv1135c MT1168 [3,] ppe21 Rv1548c MT1599 MTCY48.17 [4,] ppe12 Rv0755c MT0779 [5,] PE_PGRS51 Rv3367 [etc..for several hundreds] I want have instead only: [1,] Rv3018c [2,] Rv1135c [3,] Rv1548c [4,] Rv0755c [5,] Rv3367 Besides these examples, the only thing I know for sure is that the magic substrings I want to extract are entire word all starting by Rv. So Rvx, preceded and followed by a space, and of a variable length. I don't have any other infos. Do you know how to pick them? I checked for their presence using grep, and \\Rv*\\ expression, I tried with some string functions from Hmisc, or in the other way, by substituting with empty strings everything except the Rv word, but I didn't achieve that much... Could you please give me some suggestions? You can use strapply in gsubfn to pick out strings by content. The regular expression says match a word bound followed by R followed by v followed by 0 or more non-spaces: library(gsubfn) strapply(a, \\bRv\\S*, c, perl = TRUE, simplify = TRUE) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.