Re: [Rd] Bug in rank with utf8?
> On 14 Aug 2015, at 08:10 , Prof Brian Ripley wrote: > > E.g. on my Yosemite system in en_US.UTF-8 > >> rank(c(x, y)) > [1] 1.5 1.5 > ..which differs from my Mavericks system but not my Yosemite system, both in en_US.UTF-8, both with icuGetCollate returning "root"... Oh, well. -pd -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug in rank with utf8?
On 13/08/2015 15:19, peter dalgaard wrote: Yes, collation is a strange thing, and? And remember that on some platforms (including yours) ICU is used, so LC_COLLATE is not particularly relevant (unless it is 'C'). See ?Comparisons and ?icuGetCollate. E.g. on my Yosemite system in en_US.UTF-8 rank(c(x, y)) [1] 1.5 1.5 icuGetCollate() [1] "root" icuSetCollate(locale="ASCII") rank(c(x, y)) [1] 2 1 whereas on Fedora 21 rank(c(x, y)) [1] 2 1 icuGetCollate() [1] "root" Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined. To add to the confusion, on OSX Mavericks, I see x <- "\u0663" y <- 3 x == y [1] FALSE rank(c(x, y)) [1] 2 1 x [1] "٣" x == y [1] FALSE x > y [1] TRUE x < y [1] FALSE Sys.getlocale() [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8" Sys.getlocale("LC_COLLATE") [1] "en_US.UTF-8" Notice the differences from en_US.UTF8 (sans hyphen) on your system -pd On 13 Aug 2015, at 16:01 , John McKown wrote: 2015-08-13 8:39 GMT-05:00 Hadley Wickham : x <- "\u0663" y <- 3 x == y # FALSE rank(c(x, y)) # c(1.5, 1.5) also interesting, and confusing to me: x == y [1] FALSE x > y [1] FALSE x < y [1] FALSE With some slight changes: x <- "\u0663" y <- "3" xy <- c(x,y) rank(xy); [1] 1.5 1.5 Sys.getlocale(); [1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C" Sys.setlocale(category="LC_COLLATE", locale="C"); [1] "C" rank(xy); [1] 2 1 -- Brian D. Ripley, rip...@stats.ox.ac.uk Emeritus Professor of Applied Statistics, University of Oxford 1 South Parks Road, Oxford OX1 3TG, UK __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug in rank with utf8?
Yes, collation is a strange thing, and? Collation order will depend on locale settings, and there are quite a few cases where the collation order of two items is not defined. To add to the confusion, on OSX Mavericks, I see > x <- "\u0663" > y <- 3 > > x == y [1] FALSE > rank(c(x, y)) [1] 2 1 > x [1] "٣" > x == y [1] FALSE > x > y [1] TRUE > x < y [1] FALSE > Sys.getlocale() [1] "en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8" > Sys.getlocale("LC_COLLATE") [1] "en_US.UTF-8" Notice the differences from en_US.UTF8 (sans hyphen) on your system -pd On 13 Aug 2015, at 16:01 , John McKown wrote: > 2015-08-13 8:39 GMT-05:00 Hadley Wickham : > >> x <- "\u0663" >> y <- 3 >> >> x == y >> # FALSE >> rank(c(x, y)) >> # c(1.5, 1.5) >> > > also interesting, and confusing to me: > >> x == y > [1] FALSE >> x > y > [1] FALSE >> x < y > [1] FALSE >> > > With some slight changes: > >> x <- "\u0663" >> y <- "3" >> xy <- c(x,y) >> rank(xy); > [1] 1.5 1.5 >> Sys.getlocale(); > [1] > "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C" >> Sys.setlocale(category="LC_COLLATE", locale="C"); > [1] "C" >> rank(xy); > [1] 2 1 >> > > > >> -- >> http://had.co.nz/ >> >> > -- > > Schrodinger's backup: The condition of any backup is unknown until a > restore is attempted. > > Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be. > > He's about as useful as a wax frying pan. > > 10 to the 12th power microphones = 1 Megaphone > > Maranatha! <>< > John McKown > > [[alternative HTML version deleted]] > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Office: A 4.23 Email: pd@cbs.dk Priv: pda...@gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug in rank with utf8?
Once again again, I did not read the Subject correctly. I switched away from UTF8 in my second test. On Thu, Aug 13, 2015 at 9:01 AM, John McKown wrote: > 2015-08-13 8:39 GMT-05:00 Hadley Wickham : > >> x <- "\u0663" >> y <- 3 >> >> x == y >> # FALSE >> rank(c(x, y)) >> # c(1.5, 1.5) >> > > also interesting, and confusing to me: > > > x == y > [1] FALSE > > x > y > [1] FALSE > > x < y > [1] FALSE > > > > With some slight changes: > > > x <- "\u0663" > > y <- "3" > > xy <- c(x,y) > > rank(xy); > [1] 1.5 1.5 > > Sys.getlocale(); > [1] > "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C" > > Sys.setlocale(category="LC_COLLATE", locale="C"); > [1] "C" > > rank(xy); > [1] 2 1 > > > > > >> -- >> http://had.co.nz/ >> >> > -- > > Schrodinger's backup: The condition of any backup is unknown until a > restore is attempted. > > Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be. > > He's about as useful as a wax frying pan. > > 10 to the 12th power microphones = 1 Megaphone > > Maranatha! <>< > John McKown > -- Schrodinger's backup: The condition of any backup is unknown until a restore is attempted. Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be. He's about as useful as a wax frying pan. 10 to the 12th power microphones = 1 Megaphone Maranatha! <>< John McKown [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Bug in rank with utf8?
2015-08-13 8:39 GMT-05:00 Hadley Wickham : > x <- "\u0663" > y <- 3 > > x == y > # FALSE > rank(c(x, y)) > # c(1.5, 1.5) > also interesting, and confusing to me: > x == y [1] FALSE > x > y [1] FALSE > x < y [1] FALSE > With some slight changes: > x <- "\u0663" > y <- "3" > xy <- c(x,y) > rank(xy); [1] 1.5 1.5 > Sys.getlocale(); [1] "LC_CTYPE=en_US.UTF8;LC_NUMERIC=C;LC_TIME=en_US.UTF8;LC_COLLATE=en_US.UTF8;LC_MONETARY=en_US.UTF8;LC_MESSAGES=en_US.UTF8;LC_PAPER=en_US.UTF8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF8;LC_IDENTIFICATION=C" > Sys.setlocale(category="LC_COLLATE", locale="C"); [1] "C" > rank(xy); [1] 2 1 > > -- > http://had.co.nz/ > > -- Schrodinger's backup: The condition of any backup is unknown until a restore is attempted. Yoda of Borg, we are. Futile, resistance is, yes. Assimilated, you will be. He's about as useful as a wax frying pan. 10 to the 12th power microphones = 1 Megaphone Maranatha! <>< John McKown [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] Bug in rank with utf8?
x <- "\u0663" y <- 3 x == y # FALSE rank(c(x, y)) # c(1.5, 1.5) -- http://had.co.nz/ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel