Re: [R] gsub syntax

2005-11-27 Thread Dimitris Rizopoulos
you could use something like:

dates - c(73, 74, 02, 1973, 1974, 2002)
###
nd - nchar(dates)
substr(dates, ifelse(nd == 2, 1, 3), nd)


I hope it helps.

Best,
Dimitris


Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://www.med.kuleuven.be/biostat/
 http://www.student.kuleuven.be/~m0390867/dimitris.htm


- Original Message - 
From: John Logsdon [EMAIL PROTECTED]
To: r-help@stat.math.ethz.ch
Sent: Sunday, November 27, 2005 11:04 AM
Subject: [R] gsub syntax


 Hello

 I know that R's string functions are not as extensive as those of 
 Unix but
 I need to do some text handling totally within an R environment 
 because
 the target is a Windows system which will not have the corresponding 
 shell
 utilities, sed, awk etc.

 Can anyone explain the following gsub phenomenon to me:

 dates-c(73,74,02,1973,1974,2002)

 I want to take just the last two digits where it is a 4-digit year 
 and
 both digits when it is a 2-digit year.  I should be able to use 
 substr but
 measurement from the string end (with a negative counter or 
 something) is
 not implemented:

 substr(dates,3,4)
 [1]  73 74 02
 substr(dates,-2,4)
 [1] 73   74   02   1973 1974 2002
 substr(dates,4,-2)
 [1]  

 So I tried gsub:

 gsub([19|20]([0-9][0-9]),\\1,dates)
 [1] 73  74  02  973 974 002

 As I understand it (and comparing with sed), the \\1 should take the 
 first
 bracketed string but clearly this doesn't work.  If I try what 
 should also
 work:

 gsub([19|20]([0-9])([0-9]),\\1\\2,dates)
 [1] 73  74  02  973 974 002

 On the other hand the following does work:

 gsub([19|20]([0-9])([0-9]),\\2,dates)
 [1] 73 74 02 73 74 02

 So it appears that the substitution takes one character extra to the 
 left
 but the following indicates that the lower limit of the selected 
 range is
 also at fault:

 s-c(1,12,123,1234,12345,123456)
 gsub([12]([4-6]*),,s)
 [1]   334   345  3456

 Probably more elegant examples could be constructed that could home 
 in on
 the issue.

 The version is R 2.0.1 on Linux so perhaps it is a little old now.

 Questions:

 1) Am I misunderstanding the gsub use?

 2) Was it a bug that has since been corrected?

 3) Is it still a bug in the latest version?

 TIA

 JOhn

 John Logsdon   Try to make things as 
 simple
 Quantex Research Ltd, Manchester UK as possible but not 
 simpler
 [EMAIL PROTECTED] 
 [EMAIL PROTECTED]
 +44(0)161 445 4951/G:+44(0)7717758675   www.quantex-research.com

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide! 
 http://www.R-project.org/posting-guide.html
 


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] gsub syntax

2005-11-27 Thread Sundar Dorai-Raj


John Logsdon wrote:
 Hello
 
 I know that R's string functions are not as extensive as those of Unix but
 I need to do some text handling totally within an R environment because
 the target is a Windows system which will not have the corresponding shell
 utilities, sed, awk etc.
 
 Can anyone explain the following gsub phenomenon to me:
 
 
dates-c(73,74,02,1973,1974,2002)
 
 
 I want to take just the last two digits where it is a 4-digit year and
 both digits when it is a 2-digit year.  I should be able to use substr but
 measurement from the string end (with a negative counter or something) is
 not implemented:
 
 
substr(dates,3,4)
 
 [1]  73 74 02
 
substr(dates,-2,4)
 
 [1] 73   74   02   1973 1974 2002
 
substr(dates,4,-2)
 
 [1]  
 
 So I tried gsub:
 
 
gsub([19|20]([0-9][0-9]),\\1,dates)
 
 [1] 73  74  02  973 974 002
 
 As I understand it (and comparing with sed), the \\1 should take the first
 bracketed string but clearly this doesn't work.  If I try what should also
 work:
 
 
gsub([19|20]([0-9])([0-9]),\\1\\2,dates)
 
 [1] 73  74  02  973 974 002
 
 On the other hand the following does work:
 
 
gsub([19|20]([0-9])([0-9]),\\2,dates) 
 
 [1] 73 74 02 73 74 02
 
 So it appears that the substitution takes one character extra to the left
 but the following indicates that the lower limit of the selected range is
 also at fault:
 
 
s-c(1,12,123,1234,12345,123456)
gsub([12]([4-6]*),,s)
 
 [1]   334   345  3456
 
 Probably more elegant examples could be constructed that could home in on
 the issue.
 
 The version is R 2.0.1 on Linux so perhaps it is a little old now.
 
 Questions:
 
 1) Am I misunderstanding the gsub use?
 
 2) Was it a bug that has since been corrected?
 
 3) Is it still a bug in the latest version?
 
 TIA
 
 JOhn


Hi, John,

I cannot comment on your questions since I'm no regexpr guru. However, 
it seems to me you can do the following instead:

gsub(.*([0-9][0-9]), \\1, dates)

This works fine on Linux  Windows, R-2.2.0.

HTH,

--sundar

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] gsub syntax

2005-11-27 Thread Gabor Grothendieck
On 11/27/05, John Logsdon [EMAIL PROTECTED] wrote:
 Hello

 I know that R's string functions are not as extensive as those of Unix but

I don't think this statement is true although I have seen it repeated.

 I need to do some text handling totally within an R environment because
 the target is a Windows system which will not have the corresponding shell
 utilities, sed, awk etc.

Free versions of these utilities are available for Windows although they
don't come with Windows.  e.g. Google for gawk.


 Can anyone explain the following gsub phenomenon to me:

  dates-c(73,74,02,1973,1974,2002)

 I want to take just the last two digits where it is a 4-digit year and
 both digits when it is a 2-digit year.  I should be able to use substr but
 measurement from the string end (with a negative counter or something) is
 not implemented:

  substr(dates,3,4)
 [1]  73 74 02
  substr(dates,-2,4)
 [1] 73   74   02   1973 1974 2002
  substr(dates,4,-2)
 [1]  

 So I tried gsub:

  gsub([19|20]([0-9][0-9]),\\1,dates)
 [1] 73  74  02  973 974 002

 As I understand it (and comparing with sed), the \\1 should take the first
 bracketed string but clearly this doesn't work.  If I try what should also
 work:

  gsub([19|20]([0-9])([0-9]),\\1\\2,dates)
 [1] 73  74  02  973 974 002

 On the other hand the following does work:

  gsub([19|20]([0-9])([0-9]),\\2,dates)
 [1] 73 74 02 73 74 02

 So it appears that the substitution takes one character extra to the left
 but the following indicates that the lower limit of the selected range is
 also at fault:

  s-c(1,12,123,1234,12345,123456)
  gsub([12]([4-6]*),,s)
 [1]   334   345  3456

 Probably more elegant examples could be constructed that could home in on
 the issue.

 The version is R 2.0.1 on Linux so perhaps it is a little old now.

 Questions:

 1) Am I misunderstanding the gsub use?

 2) Was it a bug that has since been corrected?

 3) Is it still a bug in the latest version?


It works the same on my system which is 2.2.0 Windows patched
(2005-10-24). At first I too thought it was a bug but I noticed it
works the same in perl so now I am not sure. The following perl
program under Windows using perl 5.8.6 on Windows
gives 002 as the answer as the answer too:

   $_ = 2002;
   s/[19|20]([0-9])([0-9])/\1\2/g;
   print;

In any any case, it could be done like this:

   sub(.*(..)$, \\1, dates)

or

   substring(dates, nchar(dates)-1)

or the following which appends -01-01 to the year, converts it to Date
class, implicitly converts it back to character and then extracts
the 3rd to 4th character of the result:

   substring(as.Date(sprintf(%s-01-01, dates)), 3, 4)

or

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] gsub syntax

2005-11-27 Thread Prof Brian Ripley
R is blameless here: it works as documented and in the same way as 
POSIX tools.  It agrees with 'sed' using the same syntax (modulo the 
shell-specific quoting rules) e.g. in csh

% echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
973
% echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g'
-1-97-3
% echo 73 74 02 1973 1974 2002 | sed 's/[19|20]\([0-9][0-9]\)/\1/g'
73 74 02 973 974 002

so what happened when you were 'comparing with sed'?

[19|20] is a character class (containing five characters) matching one 
character, not a match for two characters as you seem to imagine.  It does 
not mean the same as 19|20, which is what you seem to have intended (and 
you seem only to want to do the substitution once on each string, so why 
use gsub?):

 sub(19|20([0-9][0-9]), \\1, dates)
[1] 73 74 02 73 74 02

A more direct way which would work e.g. for 1837 would be

sub(.*([0-9]{2}$), \\1, dates)

or even better (locale-independent)

sub(.*([[:digit:]]{2}$), \\1, dates)

Current versions of R have a help page ?regexp explaining what regexps 
are.  Even 2.0.1 did, although you were asked to update *before* posting 
(see the posting guide).  It was unambiguous:

A _character class_ is a list of characters enclosed by '[' and
']' matches any single character in that list ...
^^
...  Note that alternation does not work inside character classes,
where \code{|} has its literal meaning.


On Sun, 27 Nov 2005, John Logsdon wrote:

 Hello

 I know that R's string functions are not as extensive as those of Unix but
 I need to do some text handling totally within an R environment because
 the target is a Windows system which will not have the corresponding shell
 utilities, sed, awk etc.
 Can anyone explain the following gsub phenomenon to me:

 dates-c(73,74,02,1973,1974,2002)

 I want to take just the last two digits where it is a 4-digit year and
 both digits when it is a 2-digit year.  I should be able to use substr but
 measurement from the string end (with a negative counter or something) is
 not implemented:

Why 'should' it work in a different way to that documented?

 substr(dates,3,4)
 [1]  73 74 02
 substr(dates,-2,4)
 [1] 73   74   02   1973 1974 2002
 substr(dates,4,-2)
 [1]  

 So I tried gsub:

 gsub([19|20]([0-9][0-9]),\\1,dates)
 [1] 73  74  02  973 974 002

 As I understand it (and comparing with sed), the \\1 should take the first
 bracketed string but clearly this doesn't work.
 If I try what should also work:

 gsub([19|20]([0-9])([0-9]),\\1\\2,dates)
 [1] 73  74  02  973 974 002

 On the other hand the following does work:

 gsub([19|20]([0-9])([0-9]),\\2,dates)
 [1] 73 74 02 73 74 02

 So it appears that the substitution takes one character extra to the left
 but the following indicates that the lower limit of the selected range is
 also at fault:
 s-c(1,12,123,1234,12345,123456)
 gsub([12]([4-6]*),,s)
 [1]   334   345  3456

 Probably more elegant examples could be constructed that could home in on
 the issue.
 The version is R 2.0.1 on Linux so perhaps it is a little old now.

 Questions:

 1) Am I misunderstanding the gsub use?

Yes.

 2) Was it a bug that has since been corrected?

Unfortunately the bug reported two years ago in

 library(fortunes); fortune(WTFM)

still seems extant.  See the posting guide for advice on how to correct 
it.


-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html