Re: [Rd] Bug in rank with utf8?

2015-08-14 Thread peter dalgaard

> On 14 Aug 2015, at 08:10 , Prof Brian Ripley  wrote:
> 
> E.g. on my Yosemite system in en_US.UTF-8
> 
>> rank(c(x, y))
> [1] 1.5 1.5
> 

..which differs from my Mavericks system but not my Yosemite system, both in 
en_US.UTF-8, both with icuGetCollate returning "root"... Oh, well.

-pd

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in rank with utf8?

2015-08-13 Thread Prof Brian Ripley

On 13/08/2015 15:19, peter dalgaard wrote:

Yes, collation is a strange thing, and?


And remember that on some platforms (including yours) ICU is used, so 
LC_COLLATE is not particularly relevant (unless it is 'C').  See 
?Comparisons and ?icuGetCollate.


E.g. on my Yosemite system in en_US.UTF-8


rank(c(x, y))

[1] 1.5 1.5

icuGetCollate()

[1] "root"

icuSetCollate(locale="ASCII")
rank(c(x, y))

[1] 2 1

whereas on Fedora 21


rank(c(x, y))

[1] 2 1

 icuGetCollate()

[1] "root"





Collation order will depend on locale settings, and there are quite a few cases 
where the collation order of two items is not defined.

To add to the confusion, on OSX Mavericks, I see


x <- "\u0663"
y <- 3

x == y

[1] FALSE

rank(c(x, y))

[1] 2 1

x

[1] "٣"

x == y

[1] FALSE

x > y

[1] TRUE

x < y

[1] FALSE


Sys.getlocale()

[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"

Sys.getlocale("LC_COLLATE")

[1] "en_US.UTF-8"

Notice the differences from en_US.UTF8 (sans hyphen) on your system

-pd

On 13 Aug 2015, at 16:01 , John McKown  wrote:


2015-08-13 8:39 GMT-05:00 Hadley Wickham :


x <- "\u0663"
y <- 3

x == y
# FALSE
rank(c(x, y))
# c(1.5, 1.5)



​also interesting, and confusing to me:


x == y

[1] FALSE

x > y

[1] FALSE

x < y

[1] FALSE




With some slight changes:


x <- "\u0663"
y <- "3"
xy <- c(x,y)
rank(xy);

[1] 1.5 1.5

Sys.getlocale();

[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"

Sys.setlocale(category="LC_COLLATE", locale="C");

[1] "C"

rank(xy);

[1] 2 1







--
Brian D. Ripley,  rip...@stats.ox.ac.uk
Emeritus Professor of Applied Statistics, University of Oxford
1 South Parks Road, Oxford OX1 3TG, UK

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in rank with utf8?

2015-08-13 Thread peter dalgaard
Yes, collation is a strange thing, and? 

Collation order will depend on locale settings, and there are quite a few cases 
where the collation order of two items is not defined. 

To add to the confusion, on OSX Mavericks, I see

> x <- "\u0663"
> y <- 3
> 
> x == y
[1] FALSE
> rank(c(x, y))
[1] 2 1
> x
[1] "٣"
> x == y
[1] FALSE
> x > y
[1] TRUE
> x < y
[1] FALSE

> Sys.getlocale()
[1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8"
> Sys.getlocale("LC_COLLATE")
[1] "en_US.UTF-8"

Notice the differences from en_US.UTF8 (sans hyphen) on your system

-pd

On 13 Aug 2015, at 16:01 , John McKown  wrote:

> 2015-08-13 8:39 GMT-05:00 Hadley Wickham :
> 
>> x <- "\u0663"
>> y <- 3
>> 
>> x == y
>> # FALSE
>> rank(c(x, y))
>> # c(1.5, 1.5)
>> 
> 
> ​also interesting, and confusing to me:
> 
>> x == y
> [1] FALSE
>> x > y
> [1] FALSE
>> x < y
> [1] FALSE
>> 
> 
> With some slight changes:
> 
>> x <- "\u0663"
>> y <- "3"
>> xy <- c(x,y)
>> rank(xy);
> [1] 1.5 1.5
>> Sys.getlocale();
> [1]
> "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
>> Sys.setlocale(category="LC_COLLATE", locale="C");
> [1] "C"
>> rank(xy);
> [1] 2 1
>> 
> 
> 
> 
>> --
>> http://had.co.nz/
>> 
>> 
> -- 
> 
> Schrodinger's backup: The condition of any backup is unknown until a
> restore is attempted.
> 
> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
> 
> He's about as useful as a wax frying pan.
> 
> 10 to the 12th power microphones = 1 Megaphone
> 
> Maranatha! <><
> John McKown
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in rank with utf8?

2015-08-13 Thread John McKown
Once again again, I did not read the Subject correctly. I switched away
from UTF8 in my second test.

On Thu, Aug 13, 2015 at 9:01 AM, John McKown 
wrote:

> 2015-08-13 8:39 GMT-05:00 Hadley Wickham :
>
>> x <- "\u0663"
>> y <- 3
>>
>> x == y
>> # FALSE
>> rank(c(x, y))
>> # c(1.5, 1.5)
>>
>
> ​also interesting, and confusing to me:
>
> > x == y
> [1] FALSE
> > x > y
> [1] FALSE
> > x < y
> [1] FALSE
> >
>
> With some slight changes:
>
> > x <- "\u0663"
> > y <- "3"
> > xy <- c(x,y)
> > rank(xy);
> [1] 1.5 1.5
> > Sys.getlocale();
> [1]
> "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
> > Sys.setlocale(category="LC_COLLATE", locale="C");
> [1] "C"
> > rank(xy);
> [1] 2 1
> >
>
>
>
>> --
>> http://had.co.nz/
>>
>>
> --
>
> Schrodinger's backup: The condition of any backup is unknown until a
> restore is attempted.
>
> Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.
>
> He's about as useful as a wax frying pan.
>
> 10 to the 12th power microphones = 1 Megaphone
>
> Maranatha! <><
> John McKown
>



-- 

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Bug in rank with utf8?

2015-08-13 Thread John McKown
2015-08-13 8:39 GMT-05:00 Hadley Wickham :

> x <- "\u0663"
> y <- 3
>
> x == y
> # FALSE
> rank(c(x, y))
> # c(1.5, 1.5)
>

​also interesting, and confusing to me:

> x == y
[1] FALSE
> x > y
[1] FALSE
> x < y
[1] FALSE
>

With some slight changes:

> x <- "\u0663"
> y <- "3"
> xy <- c(x,y)
> rank(xy);
[1] 1.5 1.5
> Sys.getlocale();
[1]
"LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C"
> Sys.setlocale(category="LC_COLLATE", locale="C");
[1] "C"
> rank(xy);
[1] 2 1
>



> --
> http://had.co.nz/
>
>
-- 

Schrodinger's backup: The condition of any backup is unknown until a
restore is attempted.

Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be.

He's about as useful as a wax frying pan.

10 to the 12th power microphones = 1 Megaphone

Maranatha! <><
John McKown

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Bug in rank with utf8?

2015-08-13 Thread Hadley Wickham
x <- "\u0663"
y <- 3

x == y
# FALSE
rank(c(x, y))
# c(1.5, 1.5)



-- 
http://had.co.nz/

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel