Re: [Rd] RFC: hexadecimal constants and decimal points
On Sun, 17 Apr 2005, Jan T. Kim wrote: On Sun, Apr 17, 2005 at 12:38:10PM +0100, Prof Brian Ripley wrote: These are some points stimulated by reading about C history (and related in their implementation). 1) On some platforms as.integer(0xA) [1] 10 but not all (not on Solaris nor Windows). We do not define what is allowed, and rely on the OS's implementation of strtod (yes, not strtol). It seems that glibc does allow hex: C99 mandates it but C89 seems not to allow it. I think that was a mistake, and strtol should have been used. Then C89 does mandate the handling of hex constants and also octal ones. So changing to strtol would change the meaning of as.integer(011). I think interpretation of a leading 0 as a prefix indicating an octal representation should indeed be avoided. People not familiar to C will have a hard time understanding and getting used to this concept, and in addition, it happens way too often that numeric data are provided left- padded with zeros. I agree with this: 011 should be 11, it should not be 9. Proposal: we handle this ourselves and define what values are acceptable, namely for as.integer: [+|-][0-9]+ NA 0[x|X][0-9A-fa-f]+ It can be a somewhat mixed blessing if the string representation of numeric values contain information about their base, in the form of the 0x prefix in this case. The base argument (#3) of C's strtol function can be set to to a base explicitly or to 0, which gives the prefix-based auto-selection behaviour. On the R level, such a base argument (to as.integer) could be included and a default could be set. A lot of this is internal, not at R level. Personally, I would be equally happy with the default being 0 (auto-select) or 10. Considering the perhaps limited spread of familiarity with C's 0x idiom, I somewhat favour a consistent and stubborn decimal behaviour (base defaults to 10), though. Some people already rely on it, and those who don't know about it are unliekly to ever enter what they think is an illegal value, surely? As long as we document it, I think the 0x prefix is fine. We should provide a way to use other bases on input and output. This could be through format specifiers, but it would be enough to have a pair of dedicated functions to do the conversions. Duncan Murdoch __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
BDR == Prof Brian Ripley [EMAIL PROTECTED] on Sun, 17 Apr 2005 12:38:10 +0100 (BST) writes: BDR These are some points stimulated by reading about C history (and BDR related in their implementation). . BDR 2) R does not have integer constants. It would be BDR convenient if it did, and I can see no difficulty in BDR allowing the same conversions when parsing as when BDR coercing. This would have the side effect that 100 BDR would be integer (but the coercion rules would come BDR into play) but 20 would be double. And BDR x - 0xce80 would be valid. Hmm, I'm not sure if this (parser change, mainly) is worth the potential problems. Of course you (Brian) know better than anyone here that, when that change was implemented for S-plus, I think Mathsoft (the predecessor of 'Insightful') did also change all their legacy S code and translate all 'n' to 'n.' just in order to make sure that things stayed back compatible. And, IIRC, they recommended users to do so similarly with their own S source files. I had found this extremely ugly at the time, but it was mandated by the fact they didn't want to break existing code which in some places did assume that e.g. '0' was a double but became an integer in the new version of S-plus {and e.g., as.double(.) became absolutely mandated before passing things to C --- of course, using as.double(.) ``everywhere'' before passing to C has been recommended for a long time which didn't prevent people to rely on the current behavior (in R) that almost all numbers are double}. We (or rather the less sophisticated members of the R community) may get into similar problems when, e.g., matrix(0, 3,4) suddenly produces an integer matrix instead of a double precision one. BDR 3) We do allow setting LC_NUMERIC, but it partially breaks R if the BDR decimal point is not .. (I know of no locale in which it is not . or BDR ,, and we cannot allow , as part of numeric constants when parsing.) BDR E.g.: Sys.setlocale(LC_NUMERIC, fr_FR) BDR [1] fr_FR BDR Warning message: BDR setting 'LC_NUMERIC' may cause R to function strangely in: BDR setlocale(category, locale) x - 3.12 x BDR [1] 3 as.numeric(3,12) BDR [1] 3,12 as.numeric(3.12) BDR [1] NA BDR Warning message: BDR NAs introduced by coercion BDR We could do better by insisting that . was the decimal point in all BDR interval conversions _to_ numeric. Then the effect of setting LC_NUMERIC BDR would primarily be on conversions _from_ numeric, especially printing and BDR graphical output. (One issue would be what to do with scan(), which has a BDR `dec' argument but is implemented assuming LC_NUMERIC=C. I would hope to BDR continue to have `dec' but perhaps with a locale-dependent default.) The BDR resulting asymmetry (R would not be able to parse its own output) would be BDR unhappy, but seems inevitable. (This could be implemented easily by having BDR a `dec' arg to EncodeReal and EncodeComplex, and using LC_NUMERIC to BDR control that rather than actually setting the local category. For BDR example, deparsing needs to be done in LC_NUMERIC=C.) Yes, I like this quite a bit: - Only allow . as decimal point in conversions to numeric. - Allowing , (or other locale settings if there are) for conversions _from_ numeric will be very attractive to some (not to me) and will make the use of R's ``reporting facility' much more natural to them. That the asymmetry is bit unhappy -- and that will be a good reason to advocate (to the user community) that using , for decimal point may be a bad idea in general. Martin Maechler ETH Zurich BDR All of these could be implemented by customized versions of BDR strtod/strtol. BDR -- BDR Brian D. Ripley, [EMAIL PROTECTED] BDR Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ BDR University of Oxford, Tel: +44 1865 272861 (self) BDR 1 South Parks Road, +44 1865 272866 (PA) BDR Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
Martin Maechler [EMAIL PROTECTED] writes: BDR We could do better by insisting that . was the decimal BDR point in all interval conversions _to_ numeric. Then the BDR effect of setting LC_NUMERIC would primarily be on BDR conversions _from_ numeric, especially printing and BDR graphical output. (One issue would be what to do with BDR scan(), which has a `dec' argument but is implemented BDR assuming LC_NUMERIC=C. I would hope to continue to have BDR `dec' but perhaps with a locale-dependent default.) The BDR resulting asymmetry (R would not be able to parse its own BDR output) would be unhappy, but seems inevitable. (This could BDR be implemented easily by having a `dec' arg to EncodeReal BDR and EncodeComplex, and using LC_NUMERIC to control that BDR rather than actually setting the local category. For BDR example, deparsing needs to be done in LC_NUMERIC=C.) Yes, I like this quite a bit: - Only allow . as decimal point in conversions to numeric. - Allowing , (or other locale settings if there are) for conversions _from_ numeric will be very attractive to some (not to me) and will make the use of R's ``reporting facility' much more natural to them. That the asymmetry is bit unhappy -- and that will be a good reason to advocate (to the user community) that using , for decimal point may be a bad idea in general. Could I suggest that we tread very carefully here? This issue has caused several trip-ups historically: - The locale-dependent comma-separated variables format, in some cases not separated by commas. And it seems that you can still get Excel files that use comma both for separation and as decimal point (I thought that problem disappeared with early versions of Paradox, but apparently not, according to a resent query on r-help). - Exports from SAS as a text file cannot be read by SPSS and vice versa. etc. Quite possibly, the computer world missed the opportunity to agree on an international standard (what's the big deal with using commas anyway?). As it is we probably have to adjust to it, but we have to distinguish very carefully between reports, code, and data, and choose appropriate conventions for each case. -- O__ Peter Dalgaard Blegdamsvej 3 c/ /'_ --- Dept. of Biostatistics 2200 Cph. N (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
On Mon, 18 Apr 2005, Peter Dalgaard wrote: Martin Maechler [EMAIL PROTECTED] writes: BDR We could do better by insisting that . was the decimal BDR point in all interval conversions _to_ numeric. Then the BDR effect of setting LC_NUMERIC would primarily be on BDR conversions _from_ numeric, especially printing and BDR graphical output. (One issue would be what to do with BDR scan(), which has a `dec' argument but is implemented BDR assuming LC_NUMERIC=C. I would hope to continue to have BDR `dec' but perhaps with a locale-dependent default.) The BDR resulting asymmetry (R would not be able to parse its own BDR output) would be unhappy, but seems inevitable. (This could BDR be implemented easily by having a `dec' arg to EncodeReal BDR and EncodeComplex, and using LC_NUMERIC to control that BDR rather than actually setting the local category. For BDR example, deparsing needs to be done in LC_NUMERIC=C.) Yes, I like this quite a bit: - Only allow . as decimal point in conversions to numeric. - Allowing , (or other locale settings if there are) for conversions _from_ numeric will be very attractive to some (not to me) and will make the use of R's ``reporting facility' much more natural to them. That the asymmetry is bit unhappy -- and that will be a good reason to advocate (to the user community) that using , for decimal point may be a bad idea in general. Could I suggest that we tread very carefully here? This issue has caused several trip-ups historically: - The locale-dependent comma-separated variables format, in some cases not separated by commas. And it seems that you can still get Excel files that use comma both for separation and as decimal point (I thought that problem disappeared with early versions of Paradox, but apparently not, according to a resent query on r-help). - Exports from SAS as a text file cannot be read by SPSS and vice versa. etc. Quite possibly, the computer world missed the opportunity to agree on an international standard (what's the big deal with using commas anyway?). As it is we probably have to adjust to it, but we have to distinguish very carefully between reports, code, and data, and choose appropriate conventions for each case. I was treading _very_ carefully. Nowhere did I suggest altering any of write.table and friends. I did not even suggest altering read.table. I tentatively suggested the default in scan() might be locale-specific, but was otherwise leaving import/export completely alone. The aim is to allow people to have commas in printed output and graph labels if they want. Note, nothing would be done unless people explicitly did something like Sys.setlocale(LC_MISSING, fr_FR) so this would not affect naive users in any way. Brian -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
On 4/17/05, Prof Brian Ripley [EMAIL PROTECTED] wrote: These are some points stimulated by reading about C history (and related in their implementation). 1) On some platforms as.integer(0xA) [1] 10 but not all (not on Solaris nor Windows). We do not define what is allowed, and rely on the OS's implementation of strtod (yes, not strtol). It seems that glibc does allow hex: C99 mandates it but C89 seems not to allow it. I think that was a mistake, and strtol should have been used. Then C89 does mandate the handling of hex constants and also octal ones. So changing to strtol would change the meaning of as.integer(011). In the windows batch language the following (translated to R): month - substr(20050817,5,2) must be further processed to removed any leading zero. Mostly people don't even realize this and just wind up writing erroneous programs. Its actually a big nuisance IMHO. __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
On Sun, Apr 17, 2005 at 12:38:10PM +0100, Prof Brian Ripley wrote: These are some points stimulated by reading about C history (and related in their implementation). 1) On some platforms as.integer(0xA) [1] 10 but not all (not on Solaris nor Windows). We do not define what is allowed, and rely on the OS's implementation of strtod (yes, not strtol). It seems that glibc does allow hex: C99 mandates it but C89 seems not to allow it. I think that was a mistake, and strtol should have been used. Then C89 does mandate the handling of hex constants and also octal ones. So changing to strtol would change the meaning of as.integer(011). I think interpretation of a leading 0 as a prefix indicating an octal representation should indeed be avoided. People not familiar to C will have a hard time understanding and getting used to this concept, and in addition, it happens way too often that numeric data are provided left- padded with zeros. Proposal: we handle this ourselves and define what values are acceptable, namely for as.integer: [+|-][0-9]+ NA 0[x|X][0-9A-fa-f]+ It can be a somewhat mixed blessing if the string representation of numeric values contain information about their base, in the form of the 0x prefix in this case. The base argument (#3) of C's strtol function can be set to to a base explicitly or to 0, which gives the prefix-based auto-selection behaviour. On the R level, such a base argument (to as.integer) could be included and a default could be set. Personally, I would be equally happy with the default being 0 (auto-select) or 10. Considering the perhaps limited spread of familiarity with C's 0x idiom, I somewhat favour a consistent and stubborn decimal behaviour (base defaults to 10), though. Best regards, Jan -- +- Jan T. Kim ---+ |*NEW*email: [EMAIL PROTECTED] | |*NEW*WWW: http://www.cmp.uea.ac.uk/people/jtk | *-= hierarchical systems are for files, not for humans =-* __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] RFC: hexadecimal constants and decimal points
On Sun, 17 Apr 2005, Jan T. Kim wrote: On Sun, Apr 17, 2005 at 12:38:10PM +0100, Prof Brian Ripley wrote: These are some points stimulated by reading about C history (and related in their implementation). 1) On some platforms as.integer(0xA) [1] 10 but not all (not on Solaris nor Windows). We do not define what is allowed, and rely on the OS's implementation of strtod (yes, not strtol). It seems that glibc does allow hex: C99 mandates it but C89 seems not to allow it. I think that was a mistake, and strtol should have been used. Then C89 does mandate the handling of hex constants and also octal ones. So changing to strtol would change the meaning of as.integer(011). I think interpretation of a leading 0 as a prefix indicating an octal representation should indeed be avoided. People not familiar to C will have a hard time understanding and getting used to this concept, and in addition, it happens way too often that numeric data are provided left- padded with zeros. Proposal: we handle this ourselves and define what values are acceptable, namely for as.integer: [+|-][0-9]+ NA 0[x|X][0-9A-fa-f]+ It can be a somewhat mixed blessing if the string representation of numeric values contain information about their base, in the form of the 0x prefix in this case. The base argument (#3) of C's strtol function can be set to to a base explicitly or to 0, which gives the prefix-based auto-selection behaviour. On the R level, such a base argument (to as.integer) could be included and a default could be set. A lot of this is internal, not at R level. Personally, I would be equally happy with the default being 0 (auto-select) or 10. Considering the perhaps limited spread of familiarity with C's 0x idiom, I somewhat favour a consistent and stubborn decimal behaviour (base defaults to 10), though. Some people already rely on it, and those who don't know about it are unliekly to ever enter what they think is an illegal value, surely? -- Brian D. Ripley, [EMAIL PROTECTED] Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/ University of Oxford, Tel: +44 1865 272861 (self) 1 South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG, UKFax: +44 1865 272595 __ R-devel@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-devel