Re: [R] Substring and strsplit

2006-09-01 Thread Prof Brian Ripley
On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:

 If you are using 'only' English then
 
 str - dog
 strsplit(str,NULL)[[1]]
 
 works perfectly and it is fast.

It does also work 'perfectly' and fast in 'Unicode' in all major European 
and CJK languages (and many others): extending the iconv example

 xx
[1] façile
 strsplit(xx, NULL)
[[1]]
[1] f a ç i l e
 charToRaw(strsplit(xx, NULL)[[1]][3])
[1] c3 a7

on a UTF-8 system.

 But if you also dealing with Unicode character have a look at

http://wiki.r-project.org/rwiki/doku.php?id=tips:data-strings:decomposestring

That is a misleading reference (to your own opinion, and it is usual in 
science to make clear what your source is when citing, especially if it is 
yourself).  Unicode itself has combining diacritical marks as separate 
entries in the 'character code tables' at e.g. 
http://www.unicode.org/charts/, so your understanding of 'character' seems 
to differ from Unicode's.

You write about 'combined Unicode diacritics (accents)', which is 
misleading, as these are not accents (and it is 'combining' not 
'combined', a crucial difference).  To quote Alan Wood 
(http://www.alanwood.net/unicode/combining_diacritical_marks.html)

  The _characters_ in this range are designed to be used in combination 
  with alphanumeric _characters_, to produce a character+diacritic that
  is not present in any of the Unicode ranges. For example, a#777; 
  to produce a lower case a with a hook above.

So they are used for very rare glyphs made up from two Unicode characters, 
and R correctly views them as two characters.  (Actually R relies on the 
OS services to correctly identify characters, but that appears to have 
happened on the example on the RWiki page.)

You could have just thanked the R developers for ensuring that strsplit() 
does work as documented even in Unicode locales.

-- 
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Substring and strsplit

2006-09-01 Thread Hans-Joerg Bibiko

On 1 Sep 2006, at 08:22, Prof Brian Ripley wrote:

 On Wed, 30 Aug 2006, Hans-Joerg Bibiko wrote:

 If you are using 'only' English then

 str - dog
 strsplit(str,NULL)[[1]]

 works perfectly and it is fast.

 It does also work 'perfectly' and fast in 'Unicode' in all major  
 European
 and CJK languages (and many others): extending the iconv example


YES, of course, you are right. R supports Unicode and other encodings  
very well. This is one of the reasons why I've chosen R for my purposes.

If you look at my first example at this Rwiki-site, it contains  
Russian, German, and two Chinese characters to illustrate that the R  
function strsplit can handle this perfectly.


If I wrote about 'English' and 'Unicode' my only intention was to put  
it simply.
My experience is if I'm writing about 'combining diacritics' or  
'combining vowels' etc. some people don't understand these topics.
If I'm writing about 'Unicode' some have a vage association what I'm  
writing about.
Of course, in a scientific context this is absolutely wrong and  
misleading!

 http://www.unicode.org/charts/, so your understanding of  
 'character' seems
 to differ from Unicode's.


Well, the term 'character' is highly ambiguous. So a better term  
would be glyph to emphasise that I mean a representation of a grapheme.
But still, even the terms 'gylph', 'grapheme', 'phoneme', etc. are  
also ambiguous.
Of course, my fault was that I didn't clarify my terminology in  
beforehand.

 You write about 'combined Unicode diacritics (accents)', which is
 misleading, as these are not accents (and it is 'combining' not
 'combined', a crucial difference).

This was my grammatical fault. Sorry. I corrected this.

 To quote Alan Wood
 (http://www.alanwood.net/unicode/combining_diacritical_marks.html)

   The _characters_ in this range are designed to be used in  
 combination
   with alphanumeric _characters_, to produce a character+diacritic  
 that
   is not present in any of the Unicode ranges. For example, a#777;
   to produce a lower case a with a hook above.


Yes! This is right, but ...

To illustrate MY problem I use your French example with 'façile'.


 xx
 [1] façile
 strsplit(xx, NULL)
 [[1]]
 [1] f a ç i l e
 charToRaw(strsplit(xx, NULL)[[1]][3])
 [1] c3 a7

 on a UTF-8 system.



There are two possibilities by using Unicode to write 'façile':
1) f a ç i l e
2) f a c combining cedilla (\u0327) i l e

Now I use the R function strsplit and I will get two different results.

  a - façile
  strsplit(a,NULL)
[[1]]
[1] f a ç i l e

  b - façile
  strsplit(b,NULL)
[[1]]
[1] f a c ̧   i l e


On the computer screen you don't see any difference in 1) and 2) {if  
your system supports this rendering}.

Always, the questions are: 'What do I want to split?' 'What is a  
character/glyph in my context?'

An other nice example I added to the wiki-site
http://wiki.r-project.org/rwiki/doku.php?id=tips:data- 
strings:decomposestring


 So they are used for very rare glyphs made up from two Unicode  
 characters,
 and R correctly views them as two characters.

R views them correctly if a character is defined as a single code point.
On the other hand, in my research I'm using hundreds of languages  
using these 'rare' glyphs!

To summarise:
- My intention was only to put it simply and short.
- It was NOT my intention to state that the R function strsplit  
doesn't support Unicode.
   The R developers did and still doing a great job! Thank you so much!
- Last but not least, SORRY for my incompleteness!

With regards,

Hans

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Substring and strsplit

2006-08-30 Thread Dimitris Rizopoulos
you can also use substring(), e.g.,

substring(x3, 1:nchar(x3), 1:nchar(x3))


Best,
Dimitris


Dimitris Rizopoulos
Ph.D. Student
Biostatistical Centre
School of Public Health
Catholic University of Leuven

Address: Kapucijnenvoer 35, Leuven, Belgium
Tel: +32/(0)16/336899
Fax: +32/(0)16/337015
Web: http://med.kuleuven.be/biostat/
 http://www.student.kuleuven.be/~m0390867/dimitris.htm


- Original Message - 
From: Erin Hodgess [EMAIL PROTECTED]
To: r-help@stat.math.ethz.ch
Sent: Wednesday, August 30, 2006 12:25 AM
Subject: [R] Substring and strsplit


 Dear R People:

 I am trying to split a character vector into a set of individual
 letters:

 Ideal:
 x3 - c(dog)
 d o g

 I tried the following:
 strsplit(x3)
 Error in strsplit(x3) : argument split is missing, with no default
 strsplit(x3,1)
 [[1]]
 [1] dog

 I know that this is incredibly simple, but what am I doing wrong?

 Either Windows or Linux 2.3.1

 Thanks in advance!


 Sincerely,
 Erin Hodgess
 Associate Professor
 Department of Computer and Mathematical Sciences
 University of Houston - Downtown
 mailto: [EMAIL PROTECTED]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide 
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


Disclaimer: http://www.kuleuven.be/cwis/email_disclaimer.htm

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Substring and strsplit

2006-08-30 Thread Hans-Joerg Bibiko
If you are using 'only' English then

str - dog
strsplit(str,NULL)[[1]]

works perfectly and it is fast.

But if you also dealing with Unicode character have a look at

http://wiki.r-project.org/rwiki/doku.php?id=tips:data- 
strings:decomposestring

Cheers,

Hans



 you can also use substring(), e.g.,

 substring(x3, 1:nchar(x3), 1:nchar(x3))


 Best,
 Dimitris

 
 Dimitris Rizopoulos
 Ph.D. Student
 Biostatistical Centre
 School of Public Health
 Catholic University of Leuven

 Address: Kapucijnenvoer 35, Leuven, Belgium
 Tel: +32/(0)16/336899
 Fax: +32/(0)16/337015
 Web: http://med.kuleuven.be/biostat/
  http://www.student.kuleuven.be/~m0390867/dimitris.htm


 - Original Message -
 From: Erin Hodgess [EMAIL PROTECTED]
 To: r-help@stat.math.ethz.ch
 Sent: Wednesday, August 30, 2006 12:25 AM
 Subject: [R] Substring and strsplit



 Dear R People:

 I am trying to split a character vector into a set of individual
 letters:

 Ideal:
 x3 - c(dog)
 d o g

 I tried the following:

 strsplit(x3)

 Error in strsplit(x3) : argument split is missing, with no default

 strsplit(x3,1)

 [[1]]
 [1] dog

 I know that this is incredibly simple, but what am I doing wrong?

 Either Windows or Linux 2.3.1

 Thanks in advance!


 Sincerely,
 Erin Hodgess
 Associate Professor
 Department of Computer and Mathematical Sciences
 University of Houston - Downtown
 mailto: [EMAIL PROTECTED]




__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Substring and strsplit

2006-08-29 Thread Erin Hodgess
Dear R People:

I am trying to split a character vector into a set of individual
letters:

Ideal:
x3 - c(dog)
d o g

I tried the following:
 strsplit(x3)
Error in strsplit(x3) : argument split is missing, with no default
 strsplit(x3,1)
[[1]]
[1] dog

I know that this is incredibly simple, but what am I doing wrong?

Either Windows or Linux 2.3.1

Thanks in advance!


Sincerely,
Erin Hodgess
Associate Professor
Department of Computer and Mathematical Sciences
University of Houston - Downtown
mailto: [EMAIL PROTECTED]

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Substring and strsplit

2006-08-29 Thread Thomas Lumley
On Tue, 29 Aug 2006, Erin Hodgess wrote:

 Dear R People:

 I am trying to split a character vector into a set of individual
 letters:

 Ideal:
 x3 - c(dog)
 d o g

 I tried the following:
 strsplit(x3)
 Error in strsplit(x3) : argument split is missing, with no default
 strsplit(x3,1)
 [[1]]
 [1] dog

 I know that this is incredibly simple, but what am I doing wrong?


This is the first example on the help page for strsplit.

-thomas

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Substring and strsplit

2006-08-29 Thread jim holtman
Use '' as parameter to strsplit

 x3 - 'dog'
 strsplit(x3, '')
[[1]]
[1] d o g




On 8/29/06, Erin Hodgess [EMAIL PROTECTED] wrote:
 Dear R People:

 I am trying to split a character vector into a set of individual
 letters:

 Ideal:
 x3 - c(dog)
 d o g

 I tried the following:
  strsplit(x3)
 Error in strsplit(x3) : argument split is missing, with no default
  strsplit(x3,1)
 [[1]]
 [1] dog

 I know that this is incredibly simple, but what am I doing wrong?

 Either Windows or Linux 2.3.1

 Thanks in advance!


 Sincerely,
 Erin Hodgess
 Associate Professor
 Department of Computer and Mathematical Sciences
 University of Houston - Downtown
 mailto: [EMAIL PROTECTED]

 __
 R-help@stat.math.ethz.ch mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem you are trying to solve?

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.