Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

2006-07-25 Thread Greg Snow
Using regular expression matching for this case may be overkill (the RE
engine will be doing a lot of backtracking looking at a lot of
non-matches).  Here is an alternative that splits the text into a vector
of words, extracts the last 2 letters of each word (remember if the last
3 letters match, then the last 2 have to match, so we only need to
consider the last 2), then looks at all pairwise comparisons for
matches, then pastes everything back together with the marked matches:

text-And this is a second rand  sentence

tmp1 - strsplit(text, ' ')[[1]]
tmp2 - nchar(tmp1)
tmp3 - substr(tmp1,tmp2-1,tmp2)

tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 - rep('', length(tmp1))
count - 1
for( i in which(tmp5) ){
tmp6[ tmp4[i,1] ] - paste(tmp6[ tmp4[i,1] ],
'r',count,'',sep='')
tmp6[ tmp4[i,2] ] - paste(tmp6[ tmp4[i,2] ],
'r',count,'',sep='')
count - count + 1
}

out.text - paste( tmp1,tmp6, sep='',collapse=' ')


If you are doing a lot of text processing like this, I would suggest
doing it in Perl rather than R.  S Poetry by Dr. Burns has a function to
take a vector of character strings in R and run a Perl script on it and
return the results.

Hope this helps,




-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111
 

-Original Message-
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Stefan Th. Gries
Sent: Saturday, July 22, 2006 7:49 PM
To: r-help@stat.math.ethz.ch
Subject: [R] RfW 2.3.1: regular expressions to detect pairs of identical
word-final character sequences

Dear all

I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine
and I have two related regular expression problems.

platform   i386-pc-mingw32   
arch   i386  
os mingw32   
system i386, mingw32 
status   
major  2 
minor  3.1   
year   2006  
month  06
day01
svn rev38247 
language   R 
version.string Version 2.3.1 (2006-06-01)


I would like to find cases of words in elements of character vectors
that end in the same character sequences; if I find such cases, I want
to add r to both potentially rhyming sequences. An example:

INPUT:This is my dog.
DESIRED OUTPUT: Thisr isr my dog.

I found a solution for cases where the potentially rhyming words are
adjacent:

text-This is my dog.
gsub((\\w+?)(\\W\\w+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

However, with another text vector, I came across two problems I cannot
seem to solve and for which I would love to get some input.

(i) While I know what to do for non-adjacent words in general

gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, This not is my
dog, perl=TRUE) # I know this is not proper English ;-)

this runs into problems with overlapping matches:

text-And this is the second sentence
gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
[1] Andr this is the secondr sentence

It finds the nd match, but since the is match is within the two
nd's, it doesn't get it. Any ideas on how to get all pairwise matches?

(ii) How would one tell R to match only when there are 2+ characters
matching? If the above expression is applied to another character string

text-this is an example sentence.
gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

it also matches the e's at the end of example and sentence. It's not
possible to get rid of that by specifying a range such as {2,}

text-this is an example sentence.
gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text,
perl=TRUE)

because, as I understand it, this requires the 2+ cases of \\w to be
identical characters:

text-doo yoo see mee?
gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text,
perl=TRUE)

Again, any ideas?

I'd really appreciate any snippets of codes, pointers, etc.
Thanks so much,
STG
--
Stefan Th. Gries
---
University of California, Santa Barbara
http://www.linguistics.ucsb.edu/faculty/stgries

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

2006-07-25 Thread Gabor Grothendieck
Regarding having to do a lot of backtracking one can just
look at the relative comparison of speeds and we see
that they are comparable in speed.

In fact the bottleneck is not the backtacking but strapply.
I had coded the regexp version for compactness of code but if we replace
the strapply with custom gsub/strapply code for speed, the new
rexexp version is twice as fast as the for loop version.

Below f1 is the for loop version, f2 is the original regexp version
with strapply and f3 is the revised version using gsub/strsplit instead.

f1 - function() {
tmp1 - strsplit(text, ' ')[[1]]
tmp2 - nchar(tmp1)
tmp3 - substr(tmp1,tmp2-1,tmp2)

tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 - rep('', length(tmp1))
count - 1
for( i in which(tmp5) ){
   tmp6[ tmp4[i,1] ] - paste(tmp6[ tmp4[i,1] ],
'r',count,'',sep='')
   tmp6[ tmp4[i,2] ] - paste(tmp6[ tmp4[i,2] ],
'r',count,'',sep='')
   count - count + 1
}

out.text - paste( tmp1,tmp6, sep='',collapse=' ')
}

# places ... around first occurrences of repeated suffixes

library(gsubfn)
f2 - function() {
text - And this is the second sentence

pat - (\\w+)(?=\\b.+\\1\\b)
# pat - (\\w\\w+)(?=\\b.+\\1\\b)
out - gsub(pat, 1\\, text, perl = TRUE)

suff - strapply(out, ([^]+), function(x,y)y)[[1]]
gsub(paste((, paste(suff, collapse = |), )\\b, sep = ), 
\\1r, text)
}


f3 - function() {
text - And this is the second sentence

pat - (\\w+)(?=\\b.+\\1\\b)
# pat - (\\w\\w+)(?=\\b.+\\1\\b)
out - gsub(pat, 1\\, text, perl = TRUE)

# redo this strapply by hand for speed purposes
# suff - strapply(out, ([^]+), function(x,y)y)[[1]]
suff - gsub([^]*|[^]*|[^]*$, , out)
suff - gsub(^|$, , suff)
suff - strsplit(suff, )[[1]]
gsub(paste((, paste(suff, collapse = |), )\\b, sep = ), 
\\1r, text)
}


# for loop version
system.time(for (i in 1:100) f1())  #  0.32 0.00 0.36   NA   NA

# original regexp version with strapply
system.time(for (i in 1:100) f2()) #  0.36 0.00 0.38   NA   NA

# regexp version with strapply replaced with gsub/strsplit
system.time(for (i in 1:100) f3()) # 0.15 0.00 0.16   NA   NA




On 7/25/06, Greg Snow [EMAIL PROTECTED] wrote:
 Using regular expression matching for this case may be overkill (the RE
 engine will be doing a lot of backtracking looking at a lot of
 non-matches).  Here is an alternative that splits the text into a vector
 of words, extracts the last 2 letters of each word (remember if the last
 3 letters match, then the last 2 have to match, so we only need to
 consider the last 2), then looks at all pairwise comparisons for
 matches, then pastes everything back together with the marked matches:

 text-And this is a second rand  sentence

 tmp1 - strsplit(text, ' ')[[1]]
 tmp2 - nchar(tmp1)
 tmp3 - substr(tmp1,tmp2-1,tmp2)

 tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
 tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

 tmp6 - rep('', length(tmp1))
 count - 1
 for( i in which(tmp5) ){
tmp6[ tmp4[i,1] ] - paste(tmp6[ tmp4[i,1] ],
 'r',count,'',sep='')
tmp6[ tmp4[i,2] ] - paste(tmp6[ tmp4[i,2] ],
 'r',count,'',sep='')
count - count + 1
 }

 out.text - paste( tmp1,tmp6, sep='',collapse=' ')


 If you are doing a lot of text processing like this, I would suggest
 doing it in Perl rather than R.  S Poetry by Dr. Burns has a function to
 take a vector of character strings in R and run a Perl script on it and
 return the results.

 Hope this helps,




 --
 Gregory (Greg) L. Snow Ph.D.
 Statistical Data Center
 Intermountain Healthcare
 [EMAIL PROTECTED]
 (801) 408-8111


 -Original Message-
 From: [EMAIL PROTECTED]
 [mailto:[EMAIL PROTECTED] On Behalf Of Stefan Th. Gries
 Sent: Saturday, July 22, 2006 7:49 PM
 To: r-help@stat.math.ethz.ch
 Subject: [R] RfW 2.3.1: regular expressions to detect pairs of identical
 word-final character sequences

 Dear all

 I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine
 and I have two related regular expression problems.

 platform   i386-pc-mingw32
 arch   i386
 os mingw32
 system i386, mingw32
 status
 major  2
 minor  3.1
 year   2006
 month  06
 day01
 svn rev38247
 language   R
 version.string Version 2.3.1 (2006-06-01)


 I would like to find cases of words in elements of character vectors
 that end in the same character sequences; if I find such cases, I want
 to add r to both potentially rhyming sequences. An example:

 INPUT:This is my dog.
 DESIRED OUTPUT: Thisr isr my dog.

 I found a solution for cases where the potentially rhyming words are
 adjacent:

 text-This is my dog.
 gsub((\\w+?)(\\W\\w+?)\\1(\\W), \\1r\\2\\1r\\3, text, 

Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

2006-07-25 Thread Greg Snow
Before comparing times we should make sure that they functions return
the same thing.  My original function (f1 below) labels the potential
rymes with match numbers as well as finding possible rymes, if you just
want the r flag then the for loop can be eliminated giving f4 as
follows:

 f4 - function(text) {
tmp1 - strsplit(text, ' ')[[1]]
tmp2 - nchar(tmp1)
tmp3 - substr(tmp1,tmp2-1,tmp2)

tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 - rep('', length(tmp1))
tmp6[ unique(c(tmp4[tmp5,])) ] - 'r'
paste( tmp1,tmp6, sep='',collapse=' ') }

The speed of f4 is similar to the speed of f3 (even after correcting f3,
the original one just returns the original text string).

But that is on the sample string, what if a longer string is used (more
potential for backtracking).

Try the string generated by:

set.seed(1)
text - paste( sample(c(letters,' ',' ',' '), 1000, replace=T),
collapse='')
text - gsub( {2,}, ,text)

Now f4 is much faster than f3.  However f3 can be optimized by replacing
\\w+ in pat by \\w{2} and that makes it faster than f4 again

It would probably be even faster to use gregexpr to just find the
matching endings then create the new regexp based on those endings and
do one substitute rather than using multiple gsubs.



-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
[EMAIL PROTECTED]
(801) 408-8111
 

-Original Message-
From: Gabor Grothendieck [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, July 25, 2006 11:41 AM
To: Greg Snow
Cc: Stefan Th. Gries; r-help@stat.math.ethz.ch
Subject: Re: [R] RfW 2.3.1: regular expressions to detect pairs of
identical word-final character sequences

Regarding having to do a lot of backtracking one can just look at the
relative comparison of speeds and we see that they are comparable in
speed.

In fact the bottleneck is not the backtacking but strapply.
I had coded the regexp version for compactness of code but if we replace
the strapply with custom gsub/strapply code for speed, the new rexexp
version is twice as fast as the for loop version.

Below f1 is the for loop version, f2 is the original regexp version with
strapply and f3 is the revised version using gsub/strsplit instead.

f1 - function() {
tmp1 - strsplit(text, ' ')[[1]]
tmp2 - nchar(tmp1)
tmp3 - substr(tmp1,tmp2-1,tmp2)

tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

tmp6 - rep('', length(tmp1))
count - 1
for( i in which(tmp5) ){
   tmp6[ tmp4[i,1] ] - paste(tmp6[ tmp4[i,1] ],
'r',count,'',sep='')
   tmp6[ tmp4[i,2] ] - paste(tmp6[ tmp4[i,2] ],
'r',count,'',sep='')
   count - count + 1
}

out.text - paste( tmp1,tmp6, sep='',collapse=' ') }

# places ... around first occurrences of repeated suffixes

library(gsubfn)
f2 - function() {
text - And this is the second sentence

pat - (\\w+)(?=\\b.+\\1\\b)
# pat - (\\w\\w+)(?=\\b.+\\1\\b)
out - gsub(pat, 1\\, text, perl = TRUE)

suff - strapply(out, ([^]+), function(x,y)y)[[1]]
gsub(paste((, paste(suff, collapse = |), )\\b, sep = ),
\\1r, text) }


f3 - function() {
text - And this is the second sentence

pat - (\\w+)(?=\\b.+\\1\\b)
# pat - (\\w\\w+)(?=\\b.+\\1\\b)
out - gsub(pat, 1\\, text, perl = TRUE)

# redo this strapply by hand for speed purposes
# suff - strapply(out, ([^]+), function(x,y)y)[[1]]
suff - gsub([^]*|[^]*|[^]*$, , out)
suff - gsub(^|$, , suff)
suff - strsplit(suff, )[[1]]
gsub(paste((, paste(suff, collapse = |), )\\b, sep = ),
\\1r, text) }


# for loop version
system.time(for (i in 1:100) f1())  #  0.32 0.00 0.36   NA   NA

# original regexp version with strapply
system.time(for (i in 1:100) f2()) #  0.36 0.00 0.38   NA   NA

# regexp version with strapply replaced with gsub/strsplit
system.time(for (i in 1:100) f3()) # 0.15 0.00 0.16   NA   NA




On 7/25/06, Greg Snow [EMAIL PROTECTED] wrote:
 Using regular expression matching for this case may be overkill (the 
 RE engine will be doing a lot of backtracking looking at a lot of 
 non-matches).  Here is an alternative that splits the text into a 
 vector of words, extracts the last 2 letters of each word (remember if

 the last
 3 letters match, then the last 2 have to match, so we only need to 
 consider the last 2), then looks at all pairwise comparisons for 
 matches, then pastes everything back together with the marked matches:

 text-And this is a second rand  sentence

 tmp1 - strsplit(text, ' ')[[1]]
 tmp2 - nchar(tmp1)
 tmp3 - substr(tmp1,tmp2-1,tmp2)

 tmp4 - which(lower.tri(diag(length(tmp3))), arr.ind=TRUE)
 tmp5 - tmp3[ tmp4[,1] ] == tmp3[ tmp4[,2] ]

 tmp6 - rep('', length(tmp1))
 count - 1
 for( i

Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

2006-07-25 Thread Gabor Grothendieck
Here is yet another solution.  This one consists only of
two gsubs and a function to reverse a string.  It runs
at about the same speed as f3 but its main advantage
is how compact it is.

pat could be the same as before however we have made
use of Greg's discussion to use \\w\\w to
avail ourself of his speedup idea.  If single letter
endings are ok use \\w instead of \\w\\w.
This time the first gsub simply appends r to the first in any
duplicated ending.  Then we reverse the string.
In the second gsub we look for any sequence at the
start of a word for which r followed by that sequence
is found later in the string and prepend r to that.
Finally we reverse the result.

text - And this is the second sentence
strrev - function(x) paste(rev(strsplit(x, )[[1]]), collapse = )

pat - (\\w\\w)(?=\\b.+\\1\\b)
out - strrev(gsub(pat, \\1\\r, text, perl = TRUE))
strrev(gsub(\\b(\\w+)(?=.*r\\1), r\\1, out, perl = TRUE))


On 7/23/06, Gabor Grothendieck [EMAIL PROTECTED] wrote:
 The following requires more than just a single gsub but it does solve
 the problem.  Modify to suit.

 The first gsub places ... around the first occurrence of any
 duplicated suffixes.  We use the (?=...) zero width regexp
 to circumvent the nesting problem.

 Then we use strapply from the gsubfn package to extract
 the suffixes so marked and paste them together to pass
 to a second gsub which locates them in the original
 string appending an r to each.   Uncomment the commented
 pat if you only want to match 2+ character suffixes.

 library(gsubfn)
 # places ... around first occurrences of repeated suffixes
 text - And this is the second sentence
 pat - (\\w+)(?=\\b.+\\1\\b)
 # pat - (\\w\\w+)(?=\\b.+\\1\\b)
 out - gsub(pat, 1\\, text, perl = TRUE)

 suff - strapply(out, ([^]+), function(x,y)y)[[1]]
 gsub(paste((, paste(suff, collapse = |), )\\b, sep = ), \\1r, 
 text)


 On 7/22/06, Stefan Th. Gries [EMAIL PROTECTED] wrote:
  Dear all
 
  I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine 
  and I have two related regular expression problems.
 
  platform   i386-pc-mingw32
  arch   i386
  os mingw32
  system i386, mingw32
  status
  major  2
  minor  3.1
  year   2006
  month  06
  day01
  svn rev38247
  language   R
  version.string Version 2.3.1 (2006-06-01)
 
 
  I would like to find cases of words in elements of character vectors that 
  end in the same character sequences; if I find such cases, I want to add 
  r to both potentially rhyming sequences. An example:
 
  INPUT:This is my dog.
  DESIRED OUTPUT: Thisr isr my dog.
 
  I found a solution for cases where the potentially rhyming words are 
  adjacent:
 
  text-This is my dog.
  gsub((\\w+?)(\\W\\w+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
 
  However, with another text vector, I came across two problems I cannot seem 
  to solve and for which I would love to get some input.
 
  (i) While I know what to do for non-adjacent words in general
 
  gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, This not is my dog, 
  perl=TRUE) # I know this is not proper English ;-)
 
  this runs into problems with overlapping matches:
 
  text-And this is the second sentence
  gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
  [1] Andr this is the secondr sentence
 
  It finds the nd match, but since the is match is within the two nd's, 
  it doesn't get it. Any ideas on how to get all pairwise matches?
 
  (ii) How would one tell R to match only when there are 2+ characters 
  matching? If the above expression is applied to another character string
 
  text-this is an example sentence.
  gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
 
  it also matches the e's at the end of example and sentence. It's not 
  possible to get rid of that by specifying a range such as {2,}
 
  text-this is an example sentence.
  gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
 
  because, as I understand it, this requires the 2+ cases of \\w to be 
  identical characters:
 
  text-doo yoo see mee?
  gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
 
  Again, any ideas?
 
  I'd really appreciate any snippets of codes, pointers, etc.
  Thanks so much,
  STG
  --
  Stefan Th. Gries
  ---
  University of California, Santa Barbara
  http://www.linguistics.ucsb.edu/faculty/stgries
 
  __
  R-help@stat.math.ethz.ch mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, 

Re: [R] RfW 2.3.1: regular expressions to detect pairs of identical word-final character sequences

2006-07-22 Thread Gabor Grothendieck
The following requires more than just a single gsub but it does solve
the problem.  Modify to suit.

The first gsub places ... around the first occurrence of any
duplicated suffixes.  We use the (?=...) zero width regexp
to circumvent the nesting problem.

Then we use strapply from the gsubfn package to extract
the suffixes so marked and paste them together to pass
to a second gsub which locates them in the original
string appending an r to each.   Uncomment the commented
pat if you only want to match 2+ character suffixes.

library(gsubfn)
# places ... around first occurrences of repeated suffixes
text - And this is the second sentence
pat - (\\w+)(?=\\b.+\\1\\b)
# pat - (\\w\\w+)(?=\\b.+\\1\\b)
out - gsub(pat, 1\\, text, perl = TRUE)

suff - strapply(out, ([^]+), function(x,y)y)[[1]]
gsub(paste((, paste(suff, collapse = |), )\\b, sep = ), \\1r, text)


On 7/22/06, Stefan Th. Gries [EMAIL PROTECTED] wrote:
 Dear all

 I use R for Windows 2.3.1 on a fully updated Windows XP Home SP2 machine and 
 I have two related regular expression problems.

 platform   i386-pc-mingw32
 arch   i386
 os mingw32
 system i386, mingw32
 status
 major  2
 minor  3.1
 year   2006
 month  06
 day01
 svn rev38247
 language   R
 version.string Version 2.3.1 (2006-06-01)


 I would like to find cases of words in elements of character vectors that end 
 in the same character sequences; if I find such cases, I want to add r to 
 both potentially rhyming sequences. An example:

 INPUT:This is my dog.
 DESIRED OUTPUT: Thisr isr my dog.

 I found a solution for cases where the potentially rhyming words are adjacent:

 text-This is my dog.
 gsub((\\w+?)(\\W\\w+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

 However, with another text vector, I came across two problems I cannot seem 
 to solve and for which I would love to get some input.

 (i) While I know what to do for non-adjacent words in general

 gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, This not is my dog, 
 perl=TRUE) # I know this is not proper English ;-)

 this runs into problems with overlapping matches:

 text-And this is the second sentence
 gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)
 [1] Andr this is the secondr sentence

 It finds the nd match, but since the is match is within the two nd's, 
 it doesn't get it. Any ideas on how to get all pairwise matches?

 (ii) How would one tell R to match only when there are 2+ characters 
 matching? If the above expression is applied to another character string

 text-this is an example sentence.
 gsub((\\w+?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

 it also matches the e's at the end of example and sentence. It's not 
 possible to get rid of that by specifying a range such as {2,}

 text-this is an example sentence.
 gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

 because, as I understand it, this requires the 2+ cases of \\w to be 
 identical characters:

 text-doo yoo see mee?
 gsub((\\w{2,}?)(\\W.+?)\\1(\\W), \\1r\\2\\1r\\3, text, perl=TRUE)

 Again, any ideas?

 I'd really appreciate any snippets of codes, pointers, etc.
 Thanks so much,
 STG
 --
 Stefan Th. Gries
 ---
 University of California, Santa Barbara
 http://www.linguistics.ucsb.edu/faculty/stgries

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.