Re: [R] Substring and strsplit
On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote: If you are using 'only' English then str - dog strsplit(str,NULL)[[1]] works perfectly and it is fast. It does also work 'perfectly' and fast in 'Unicode' in all major European and CJK languages (and many others): extending the iconv example xx [1] façile strsplit(xx, NULL) [[1]] [1] f a ç i l e charToRaw(strsplit(xx, NULL)[[1]][3]) [1] c3 a7 on a UTF-8 system. But if you also dealing with Unicode character have a look at http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:decomposestring That is a misleading reference (to your own opinion, and it is usual in science to make clear what your source is when citing, especially if it is yourself). Unicode itself has combining diacritical marks as separate entries in the 'character code tables' at e.g. http://www.unicode.org/charts/, so your understanding of 'character' seems to differ from Unicode's. You write about 'combined Unicode diacritics (accents)', which is misleading, as these are not accents (and it is 'combining' not 'combined', a crucial difference). To quote Alan Wood (http://www.alanwood.net/unicode/combining_diacritical_marks.html) The _characters_ in this range are designed to be used in combination with alphanumeric _characters_, to produce a character+diacritic that is not present in any of the Unicode ranges. For example, a#777; to produce a lower case a with a hook above. So they are used for very rare glyphs made up from two Unicode characters, and R correctly views them as two characters. (Actually R relies on the OS services to correctly identify characters, but that appears to have happened on the example on the RWiki page.) You could have just thanked the R developers for ensuring that strsplit() does work as documented even in Unicode locales. -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595__ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substring and strsplit
On 1 Sep 2006, at 08:22, Prof Brian Ripley wrote: On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote: If you are using 'only' English then str - dog strsplit(str,NULL)[[1]] works perfectly and it is fast. It does also work 'perfectly' and fast in 'Unicode' in all major European and CJK languages (and many others): extending the iconv example YES, of course, you are right. R supports Unicode and other encodings very well. This is one of the reasons why I've chosen R for my purposes. If you look at my first example at this Rwiki-site, it contains Russian, German, and two Chinese characters to illustrate that the R function strsplit can handle this perfectly. If I wrote about 'English' and 'Unicode' my only intention was to put it simply. My experience is if I'm writing about 'combining diacritics' or 'combining vowels' etc. some people don't understand these topics. If I'm writing about 'Unicode' some have a vage association what I'm writing about. Of course, in a scientific context this is absolutely wrong and misleading! http://www.unicode.org/charts/, so your understanding of 'character' seems to differ from Unicode's. Well, the term 'character' is highly ambiguous. So a better term would be glyph to emphasise that I mean a representation of a grapheme. But still, even the terms 'gylph', 'grapheme', 'phoneme', etc. are also ambiguous. Of course, my fault was that I didn't clarify my terminology in beforehand. You write about 'combined Unicode diacritics (accents)', which is misleading, as these are not accents (and it is 'combining' not 'combined', a crucial difference). This was my grammatical fault. Sorry. I corrected this. To quote Alan Wood (http://www.alanwood.net/unicode/combining_diacritical_marks.html) The _characters_ in this range are designed to be used in combination with alphanumeric _characters_, to produce a character+diacritic that is not present in any of the Unicode ranges. For example, a#777; to produce a lower case a with a hook above. Yes! This is right, but ... To illustrate MY problem I use your French example with 'façile'. xx [1] façile strsplit(xx, NULL) [[1]] [1] f a ç i l e charToRaw(strsplit(xx, NULL)[[1]][3]) [1] c3 a7 on a UTF-8 system. There are two possibilities by using Unicode to write 'façile': 1) f a ç i l e 2) f a c combining cedilla (\u0327) i l e Now I use the R function strsplit and I will get two different results. a - façile strsplit(a,NULL) [[1]] [1] f a ç i l e b - façile strsplit(b,NULL) [[1]] [1] f a c ̧ i l e On the computer screen you don't see any difference in 1) and 2) {if your system supports this rendering}. Always, the questions are: 'What do I want to split?' 'What is a character/glyph in my context?' An other nice example I added to the wiki-site http://wiki.r-project.org/rwiki/doku.php?id=tips:data- strings:decomposestring So they are used for very rare glyphs made up from two Unicode characters, and R correctly views them as two characters. R views them correctly if a character is defined as a single code point. On the other hand, in my research I'm using hundreds of languages using these 'rare' glyphs! To summarise: - My intention was only to put it simply and short. - It was NOT my intention to state that the R function strsplit doesn't support Unicode. The R developers did and still doing a great job! Thank you so much! - Last but not least, SORRY for my incompleteness! With regards, Hans __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substring and strsplit
you can also use substring(), e.g., substring(x3, 1:nchar(x3), 1:nchar(x3)) Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: Erin Hodgess [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Wednesday, August 30, 2006 12:25 AM Subject: [R] Substring and strsplit Dear R People: I am trying to split a character vector into a set of individual letters: Ideal: x3 - c(dog) d o g I tried the following: strsplit(x3) Error in strsplit(x3) : argument split is missing, with no default strsplit(x3,1) [[1]] [1] dog I know that this is incredibly simple, but what am I doing wrong? Either Windows or Linux 2.3.1 Thanks in advance! Sincerely, Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substring and strsplit
If you are using 'only' English then str - dog strsplit(str,NULL)[[1]] works perfectly and it is fast. But if you also dealing with Unicode character have a look at http://wiki.r-project.org/rwiki/doku.php?id=tips:data- strings:decomposestring Cheers, Hans you can also use substring(), e.g., substring(x3, 1:nchar(x3), 1:nchar(x3)) Best, Dimitris Dimitris Rizopoulos Ph.D. Student Biostatistical Centre School of Public Health Catholic University of Leuven Address: Kapucijnenvoer 35, Leuven, Belgium Tel: +32/(0)16/336899 Fax: +32/(0)16/337015 Web: http://med.kuleuven.be/biostat/ http://www.student.kuleuven.be/~m0390867/dimitris.htm - Original Message - From: Erin Hodgess [EMAIL PROTECTED] To: r-help@stat.math.ethz.ch Sent: Wednesday, August 30, 2006 12:25 AM Subject: [R] Substring and strsplit Dear R People: I am trying to split a character vector into a set of individual letters: Ideal: x3 - c(dog) d o g I tried the following: strsplit(x3) Error in strsplit(x3) : argument split is missing, with no default strsplit(x3,1) [[1]] [1] dog I know that this is incredibly simple, but what am I doing wrong? Either Windows or Linux 2.3.1 Thanks in advance! Sincerely, Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Substring and strsplit
Dear R People: I am trying to split a character vector into a set of individual letters: Ideal: x3 - c(dog) d o g I tried the following: strsplit(x3) Error in strsplit(x3) : argument split is missing, with no default strsplit(x3,1) [[1]] [1] dog I know that this is incredibly simple, but what am I doing wrong? Either Windows or Linux 2.3.1 Thanks in advance! Sincerely, Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substring and strsplit
On Tue, 29 Aug 2006, Erin Hodgess wrote: Dear R People: I am trying to split a character vector into a set of individual letters: Ideal: x3 - c(dog) d o g I tried the following: strsplit(x3) Error in strsplit(x3) : argument split is missing, with no default strsplit(x3,1) [[1]] [1] dog I know that this is incredibly simple, but what am I doing wrong? This is the first example on the help page for strsplit. -thomas __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substring and strsplit
Use '' as parameter to strsplit x3 - 'dog' strsplit(x3, '') [[1]] [1] d o g On 8/29/06, Erin Hodgess [EMAIL PROTECTED] wrote: Dear R People: I am trying to split a character vector into a set of individual letters: Ideal: x3 - c(dog) d o g I tried the following: strsplit(x3) Error in strsplit(x3) : argument split is missing, with no default strsplit(x3,1) [[1]] [1] dog I know that this is incredibly simple, but what am I doing wrong? Either Windows or Linux 2.3.1 Thanks in advance! Sincerely, Erin Hodgess Associate Professor Department of Computer and Mathematical Sciences University of Houston - Downtown mailto: [EMAIL PROTECTED] __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Cincinnati, OH +1 513 646 9390 What is the problem you are trying to solve? __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.