Re: [R] gsub syntax
you could use something like: dates - c(73, 74, 02, 1973, 1974, 2002) ### nd - nchar(dates) substr(dates, ifelse(nd == 2, 1, 3), nd) I hope it helps. Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://www.med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: John Logsdon [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Sunday, November 27, 2005 11:04 AM Subject: [R] gsub syntax Hello I know that R's string functions are not as extensive as those of Unix but I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Can anyone explain the following gsub phenomenon to me: dates-c(73,74,02,1973,1974,2002) I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented: substr(dates,3,4) [1] 73 74 02 substr(dates,-2,4) [1] 73 74 02 1973 1974 2002 substr(dates,4,-2) [1] So I tried gsub: gsub([19|20]([0-9][0-9]),\\1,dates) [1] 73 74 02 973 974 002 As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work: gsub([19|20]([0-9])([0-9]),\\1\\2,dates) [1] 73 74 02 973 974 002 On the other hand the following does work: gsub([19|20]([0-9])([0-9]),\\2,dates) [1] 73 74 02 73 74 02 So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault: s-c(1,12,123,1234,12345,123456) gsub([12]([4-6]*),,s) [1] 334 345 3456 Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? 2) Was it a bug that has since been corrected? 3) Is it still a bug in the latest version? TIA JOhn John Logsdon Try to make things as simple Quantex Research Ltd, Manchester UK as possible but not simpler [EMAIL PROTECTED] [EMAIL PROTECTED] +44(0)161 445 4951/G:+44(0)7717758675 www.quantex-research.com __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] gsub syntax
John Logsdon wrote: Hello I know that R's string functions are not as extensive as those of Unix but I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Can anyone explain the following gsub phenomenon to me: dates-c(73,74,02,1973,1974,2002) I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented: substr(dates,3,4) [1] 73 74 02 substr(dates,-2,4) [1] 73 74 02 1973 1974 2002 substr(dates,4,-2) [1] So I tried gsub: gsub([19|20]([0-9][0-9]),\\1,dates) [1] 73 74 02 973 974 002 As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work: gsub([19|20]([0-9])([0-9]),\\1\\2,dates) [1] 73 74 02 973 974 002 On the other hand the following does work: gsub([19|20]([0-9])([0-9]),\\2,dates) [1] 73 74 02 73 74 02 So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault: s-c(1,12,123,1234,12345,123456) gsub([12]([4-6]*),,s) [1] 334 345 3456 Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? 2) Was it a bug that has since been corrected? 3) Is it still a bug in the latest version? TIA JOhn Hi, John, I cannot comment on your questions since I'm no regexpr guru. However, it seems to me you can do the following instead: gsub(.*([0-9][0-9]), \\1, dates) This works fine on Linux Windows, R-2.2.0. HTH, --sundar __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] gsub syntax
On 11/27/05, John Logsdon [EMAIL PROTECTED] wrote: Hello I know that R's string functions are not as extensive as those of Unix but I don't think this statement is true although I have seen it repeated. I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Free versions of these utilities are available for Windows although they don't come with Windows. e.g. Google for gawk. Can anyone explain the following gsub phenomenon to me: dates-c(73,74,02,1973,1974,2002) I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented: substr(dates,3,4) [1] 73 74 02 substr(dates,-2,4) [1] 73 74 02 1973 1974 2002 substr(dates,4,-2) [1] So I tried gsub: gsub([19|20]([0-9][0-9]),\\1,dates) [1] 73 74 02 973 974 002 As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work: gsub([19|20]([0-9])([0-9]),\\1\\2,dates) [1] 73 74 02 973 974 002 On the other hand the following does work: gsub([19|20]([0-9])([0-9]),\\2,dates) [1] 73 74 02 73 74 02 So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault: s-c(1,12,123,1234,12345,123456) gsub([12]([4-6]*),,s) [1] 334 345 3456 Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? 2) Was it a bug that has since been corrected? 3) Is it still a bug in the latest version? It works the same on my system which is 2.2.0 Windows patched (2005-10-24). At first I too thought it was a bug but I noticed it works the same in perl so now I am not sure. The following perl program under Windows using perl 5.8.6 on Windows gives 002 as the answer as the answer too: $_ = 2002; s/[19|20]([0-9])([0-9])/\1\2/g; print; In any any case, it could be done like this: sub(.*(..)$, \\1, dates) or substring(dates, nchar(dates)-1) or the following which appends -01-01 to the year, converts it to Date class, implicitly converts it back to character and then extracts the 3rd to 4th character of the result: substring(as.Date(sprintf(%s-01-01, dates)), 3, 4) or __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
Re: [R] gsub syntax
R is blameless here: it works as documented and in the same way as POSIX tools. It agrees with 'sed' using the same syntax (modulo the shell-specific quoting rules) e.g. in csh % echo 1973 | sed 's/[19|20]\([0-9][0-9]\)/\1/g' 973 % echo 1973 | sed 's/\([19|20]\)\([0-9][0-9]\)/-\1-\2-/g' -1-97-3 % echo 73 74 02 1973 1974 2002 | sed 's/[19|20]\([0-9][0-9]\)/\1/g' 73 74 02 973 974 002 so what happened when you were 'comparing with sed'? [19|20] is a character class (containing five characters) matching one character, not a match for two characters as you seem to imagine. It does not mean the same as 19|20, which is what you seem to have intended (and you seem only to want to do the substitution once on each string, so why use gsub?): sub(19|20([0-9][0-9]), \\1, dates) [1] 73 74 02 73 74 02 A more direct way which would work e.g. for 1837 would be sub(.*([0-9]{2}$), \\1, dates) or even better (locale-independent) sub(.*([[:digit:]]{2}$), \\1, dates) Current versions of R have a help page ?regexp explaining what regexps are. Even 2.0.1 did, although you were asked to update *before* posting (see the posting guide). It was unambiguous: A _character class_ is a list of characters enclosed by '[' and ']' matches any single character in that list ... ^^ ... Note that alternation does not work inside character classes, where \code{|} has its literal meaning. On Sun, 27 Nov 2005, John Logsdon wrote: Hello I know that R's string functions are not as extensive as those of Unix but I need to do some text handling totally within an R environment because the target is a Windows system which will not have the corresponding shell utilities, sed, awk etc. Can anyone explain the following gsub phenomenon to me: dates-c(73,74,02,1973,1974,2002) I want to take just the last two digits where it is a 4-digit year and both digits when it is a 2-digit year. I should be able to use substr but measurement from the string end (with a negative counter or something) is not implemented: Why 'should' it work in a different way to that documented? substr(dates,3,4) [1] 73 74 02 substr(dates,-2,4) [1] 73 74 02 1973 1974 2002 substr(dates,4,-2) [1] So I tried gsub: gsub([19|20]([0-9][0-9]),\\1,dates) [1] 73 74 02 973 974 002 As I understand it (and comparing with sed), the \\1 should take the first bracketed string but clearly this doesn't work. If I try what should also work: gsub([19|20]([0-9])([0-9]),\\1\\2,dates) [1] 73 74 02 973 974 002 On the other hand the following does work: gsub([19|20]([0-9])([0-9]),\\2,dates) [1] 73 74 02 73 74 02 So it appears that the substitution takes one character extra to the left but the following indicates that the lower limit of the selected range is also at fault: s-c(1,12,123,1234,12345,123456) gsub([12]([4-6]*),,s) [1] 334 345 3456 Probably more elegant examples could be constructed that could home in on the issue. The version is R 2.0.1 on Linux so perhaps it is a little old now. Questions: 1) Am I misunderstanding the gsub use? Yes. 2) Was it a bug that has since been corrected? Unfortunately the bug reported two years ago in library(fortunes); fortune(WTFM) still seems extant. See the posting guide for advice on how to correct it. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html