[R] Identify and extract a whole word of variable length using regular expressions

2010-06-28 Thread Giulio Di Giovanni


Hi everybody,

I'm quite weak with regular expression, and I need some help...
I have strings of the type

a

[1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c
[2,] ppe16 Rv1135c MT1168  
[3,] ppe21 Rv1548c MT1599 MTCY48.17
[4,] ppe12 Rv0755c MT0779  
[5,] PE_PGRS51 Rv3367  
[etc..for several hundreds]

I want have instead only:

[1,] Rv3018c

[2,] Rv1135c  

[3,] Rv1548c

[4,] Rv0755c  

[5,] Rv3367  


Besides these examples, the only thing I know for sure is that the magic 
substrings I want to extract are entire word all starting by Rv. So 
Rvx, preceded and followed by a space, and of a variable length. I don't 
have any other infos. 

Do you know how to pick them? I checked for their presence using grep, and 
\\Rv*\\ expression, I tried with some string functions from Hmisc, or in 
the other way, by substituting with empty strings everything except the Rv 
word, but I didn't achieve that much...
Could you please give me some suggestions?

Thanks a lot,


Giulio
  
_


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify and extract a whole word of variable length using regular expressions

2010-06-28 Thread Phil Spector

Giulio -
   This


sub('^.* ?(Rv[^ ]*) ?.*$','\\1',a)

[1] Rv3018c Rv1135c Rv1548c Rv0755c Rv3367

seems to do what you want.
- Phil Spector
 Statistical Computing Facility
 Department of Statistics
 UC Berkeley
 spec...@stat.berkeley.edu


On Mon, 28 Jun 2010, Giulio Di Giovanni wrote:




Hi everybody,

I'm quite weak with regular expression, and I need some help...
I have strings of the type


a


[1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c
[2,] ppe16 Rv1135c MT1168
[3,] ppe21 Rv1548c MT1599 MTCY48.17
[4,] ppe12 Rv0755c MT0779
[5,] PE_PGRS51 Rv3367
[etc..for several hundreds]

I want have instead only:

[1,] Rv3018c

[2,] Rv1135c

[3,] Rv1548c

[4,] Rv0755c

[5,] Rv3367


Besides these examples, the only thing I know for sure is that the magic substrings I want to 
extract are entire word all starting by Rv. So Rvx, preceded and followed by a 
space, and of a variable length. I don't have any other infos.

Do you know how to pick them? I checked for their presence using grep, and 
\\Rv*\\ expression, I tried with some string functions from Hmisc, or in 
the other way, by substituting with empty strings everything except the Rv word, but I didn't 
achieve that much...
Could you please give me some suggestions?

Thanks a lot,


Giulio

_


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Identify and extract a whole word of variable length using regular expressions

2010-06-28 Thread Gabor Grothendieck
On Mon, Jun 28, 2010 at 7:17 PM, Giulio Di Giovanni
perimessagg...@hotmail.com wrote:


 Hi everybody,

 I'm quite weak with regular expression, and I need some help...
 I have strings of the type

a

 [1,] ppe46 Rv3018c MT3098/MT3101 MTV012.32c
 [2,] ppe16 Rv1135c MT1168
 [3,] ppe21 Rv1548c MT1599 MTCY48.17
 [4,] ppe12 Rv0755c MT0779
 [5,] PE_PGRS51 Rv3367
 [etc..for several hundreds]

 I want have instead only:

 [1,] Rv3018c

 [2,] Rv1135c

 [3,] Rv1548c

 [4,] Rv0755c

 [5,] Rv3367


 Besides these examples, the only thing I know for sure is that the magic 
 substrings I want to extract are entire word all starting by Rv. So 
 Rvx, preceded and followed by a space, and of a variable length. I 
 don't have any other infos.

 Do you know how to pick them? I checked for their presence using grep, and 
 \\Rv*\\ expression, I tried with some string functions from Hmisc, or in 
 the other way, by substituting with empty strings everything except the Rv 
 word, but I didn't achieve that much...
 Could you please give me some suggestions?


You can use strapply in gsubfn to pick out strings by content.  The
regular expression says match a word bound followed by R followed by v
followed by 0 or more non-spaces:

library(gsubfn)
strapply(a, \\bRv\\S*, c, perl = TRUE, simplify = TRUE)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.