Re: [R] sort() depends on locale (and platform and build)

2014-06-15 Thread Marius Hofert
Hi,

... so something like this? [in foo.R]

old.coll <- Sys.getlocale("LC_COLLATE")
Sys.setlocale("LC_COLLATE", locale="C")

Sys.setlocale("LC_COLLATE", locale=old.coll)

Cheers,

Marius

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale (and platform and build)

2014-06-15 Thread Prof Brian Ripley

On 15/06/2014 17:34, Marius Hofert wrote:

Hi,

Thanks for you help. I use R-devel under Ubuntu 14.04, here is the output of
sessionInfo():


sessionInfo()

R Under development (unstable) (2014-06-02 r65832)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
  [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
  [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
  [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
  [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.2.0 tools_3.2.0


I assume ICU was not found/installed when R was installed as executing the first
couple of lines of the examples section of ?icuSetCollate leads to:

Warning message:
In icuSetCollate(case_first = "upper") : ICU is not supported on this build
[1] "aarhus" "Aarhus" "safe"   "test"   "Zoo"


Since only the (default) locale "C" gives the order I expected, I consider
changing my ~/.Rprofile. But it certainly had a reason why I changed it to
"en_US.UTF-8" at some point... hope that does not break anything else. Is there
any "recommendation" what to use in ~/.Rprofile (the default?)? And is the
'recommended approach' to have ICU installed and change the sorting order via
icuSetCollate if necessary?


Yes.  (You can use the locale category LC_COLLATE or icuSetCollate, but 
the recommended way to do the first is via the environment variables, 
not in .Rprofile.)




I would have not expected any influence of the locale on the sorting order,
that's quite good to know. In fact, the example came up after I tried to sort
students' grades in a class with several students having the same last name
(which I made unique by adding the first names with a '.' separator)... quite a
'delicate' issue...

Cheers,

Marius




--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale (and platform and build)

2014-06-15 Thread Marius Hofert
Hi,

Thanks for you help. I use R-devel under Ubuntu 14.04, here is the output of
sessionInfo():

> sessionInfo()
R Under development (unstable) (2014-06-02 r65832)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_3.2.0 tools_3.2.0


I assume ICU was not found/installed when R was installed as executing the first
couple of lines of the examples section of ?icuSetCollate leads to:

Warning message:
In icuSetCollate(case_first = "upper") : ICU is not supported on this build
[1] "aarhus" "Aarhus" "safe"   "test"   "Zoo"


Since only the (default) locale "C" gives the order I expected, I consider
changing my ~/.Rprofile. But it certainly had a reason why I changed it to
"en_US.UTF-8" at some point... hope that does not break anything else. Is there
any "recommendation" what to use in ~/.Rprofile (the default?)? And is the
'recommended approach' to have ICU installed and change the sorting order via
icuSetCollate if necessary?

I would have not expected any influence of the locale on the sorting order,
that's quite good to know. In fact, the example came up after I tried to sort
students' grades in a class with several students having the same last name
(which I made unique by adding the first names with a '.' separator)... quite a
'delicate' issue...

Cheers,

Marius

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale

2014-06-15 Thread Prof Brian Ripley

On 15/06/2014 12:16, Duncan Murdoch wrote:

On 15/06/2014, 1:15 AM, Marius Hofert wrote:

Hi,

If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then


sort(c("L.Y", "Lu", "L.Q"))

[1] "L.Q" "L.Y" "Lu"

whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in


sort(c("L.Y", "Lu", "L.Q"))

[1] "L.Q" "Lu"  "L.Y"

I know this issue has appeared already
(https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
just don't see a reason for the second output: either '.' comes before
letters, then the result should be
"L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
"L.Y" -- the above result thus seems inconsistent to any useful notion
of 'sort' (?)


I don't see this either, but it appears that on your platform the "." is
simply being ignored, which might be a useful kind of sorting in some
contexts.


ICU implements that:

icuSetCollate(locale="en_US", alternate_handling="shifted")
sort(c("L.Y", "Lu", "L.Q"))

See ?icuSetCollate and the references there and in ?Comparison.


--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale

2014-06-15 Thread Duncan Murdoch
On 15/06/2014, 1:15 AM, Marius Hofert wrote:
> Hi,
> 
> If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then
> 
>> sort(c("L.Y", "Lu", "L.Q"))
> [1] "L.Q" "L.Y" "Lu"
> 
> whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in
> 
>> sort(c("L.Y", "Lu", "L.Q"))
> [1] "L.Q" "Lu"  "L.Y"
> 
> I know this issue has appeared already
> (https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
> just don't see a reason for the second output: either '.' comes before
> letters, then the result should be
> "L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
> "L.Y" -- the above result thus seems inconsistent to any useful notion
> of 'sort' (?)

I don't see this either, but it appears that on your platform the "." is
simply being ignored, which might be a useful kind of sorting in some
contexts.

Duncan Murdoch

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale (and platform and build)

2014-06-14 Thread Prof Brian Ripley

On 15/06/2014 07:45, Pascal Oettli wrote:

Hello,

Please provide your sessionInfo(). I don't see this issue with R 3.1.0
Patched on Linux.


Nor on any of my platforms.

We would also need to know if ICU was found when R was installed: see 
?Comparison .




Regards,
Pascal

On Sun, Jun 15, 2014 at 2:15 PM, Marius Hofert
 wrote:

Hi,

If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then


sort(c("L.Y", "Lu", "L.Q"))

[1] "L.Q" "L.Y" "Lu"

whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in


sort(c("L.Y", "Lu", "L.Q"))

[1] "L.Q" "Lu"  "L.Y"

I know this issue has appeared already
(https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
just don't see a reason for the second output: either '.' comes before
letters, then the result should be
"L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
"L.Y" -- the above result thus seems inconsistent to any useful notion
of 'sort' (?)

Cheers,

Marius

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.







--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] sort() depends on locale

2014-06-14 Thread Pascal Oettli
Hello,

Please provide your sessionInfo(). I don't see this issue with R 3.1.0
Patched on Linux.

Regards,
Pascal

On Sun, Jun 15, 2014 at 2:15 PM, Marius Hofert
 wrote:
> Hi,
>
> If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then
>
>> sort(c("L.Y", "Lu", "L.Q"))
> [1] "L.Q" "L.Y" "Lu"
>
> whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in
>
>> sort(c("L.Y", "Lu", "L.Q"))
> [1] "L.Q" "Lu"  "L.Y"
>
> I know this issue has appeared already
> (https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
> just don't see a reason for the second output: either '.' comes before
> letters, then the result should be
> "L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
> "L.Y" -- the above result thus seems inconsistent to any useful notion
> of 'sort' (?)
>
> Cheers,
>
> Marius
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Pascal Oettli
Project Scientist
JAMSTEC
Yokohama, Japan

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] sort() depends on locale

2014-06-14 Thread Marius Hofert
Hi,

If I use invisible(Sys.setlocale("LC_COLLATE", "C")) in ~/.Rprofile, then

> sort(c("L.Y", "Lu", "L.Q"))
[1] "L.Q" "L.Y" "Lu"

whereas using invisible(Sys.setlocale("LC_COLLATE", "en_US.UTF-8")) results in

> sort(c("L.Y", "Lu", "L.Q"))
[1] "L.Q" "Lu"  "L.Y"

I know this issue has appeared already
(https://stat.ethz.ch/pipermail/r-help//2012-February/304089.html), I
just don't see a reason for the second output: either '.' comes before
letters, then the result should be
"L.Q" "L.Y" "Lu" or it comes afterwards, then it should be "Lu" "L.Q"
"L.Y" -- the above result thus seems inconsistent to any useful notion
of 'sort' (?)

Cheers,

Marius

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.