Re: [R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)

2006-06-22 Thread Antti Arppe

Dear colleagues,

With the help of a colleague of mine here in Helsinki (Seppo Nyrkkö) 
who looked at innards of the R source code for Mac it turned out that 
this was indeed an issue concerning the Mac locale and its settings 
and not R.


Though we had tried this earlier by changing the LANG variable to 
'fi_FI', we hadn't looked hard enough in the available encodings (with 
locale -a) to select the exactly correct value, being:


LANG=fi_FI.IS08859-1;
export LANG;

With this configuration R was able to happily read in my original 
table with the Scandinavian characters in the header, without no 
fuss.


Thanks for your advice, and wishing all a good Midsummer,

-Antti Arppe

On Thu, 8 Jun 2006, Peter Dalgaard wrote:

so one can only guess that you have a local or Mac-specific setup
issue.

On Thu, 8 Jun 2006, Prof Brian Ripley wrote:

If so, you need to investigate the locale in use, as which letters are valid
depends on the locale: on Linux UTF-8 locales all letters in all languages
are valid in R names, but that is not necessarily the MacOS interpretation.
(Invalid characters in names will be converted to ., and if the locale is
wrong so may be the interpretation of bytes as characters.)
You might find more informed help on the r-sig-mac list.


__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Re: [R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)

2006-06-08 Thread Charles Plessy
Le Thu, Jun 08, 2006 at 04:10:08PM +0300, Antti Arppe a écrit :
 
 Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit 
 application)allows the file to be read in in its entirety, but still 
 the Scandinavian character in the heading is coerced to a period '.', 
 or two, in fact (i.e. 'miettiä' - 'miett..')

Dear Antti,

I may be wrong, but the unicode accented latin letters are not encoded
the same on linux and macintosh. On linux, ä is ä, but on Macintosh, it
is +a (please read the quotes as if there were an umlaut). Did you try
to just retype the headers with a macintosh text editor?

Good luck!

-- 
Charles Plessy
Wako, Saitama, Japan

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)

2006-06-08 Thread Peter Dalgaard
Antti Arppe [EMAIL PROTECTED] writes:

 Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit
 application)allows the file to be read in in its entirety, but still
 the Scandinavian character in the heading is coerced to a period '.',
 or two, in fact (i.e. 'miettiä' - 'miett..')

I think you probably need check.names=FALSE. (Presumably, you cannot
have Finnish characters in variable names either on the Mac?)
 
 Have I possibly misunderstood how the 'file' function should be used
 in conjunction with 'read.table', or might the problem with
 latin1-to-utf conversion be somewhere else?

Not really, text encodings are just a pain. The blame for this fact
can be shifted in various directions, but it doesn't really help...
(My personal angle is that ISO-8859 was terribly shortsighted, and
stuck in a Western Europe mindset. As soon as the iron curtain
disappeared and we started to deal with people from Slavic countries,
the weakness was revealed.)

The basic structure looks OK, and works for me on Linux:

 read.table(file(xx.data,encoding=latin1),header=TRUE)
  æh bøh
1  1   2

so one can only guess that you have a local or Mac-specific setup
issue. 

-- 
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html


Re: [R] Reading in a table with ISO-latin1 encoding in MacOS-X (Intel)

2006-06-08 Thread Prof Brian Ripley
You are using this as intended, although your email message came in latin9 
not latin1, which does not affect your examples.  Have you actually 
checked (e.g. via a hex dump) that the file is in latin1?


I assume that if you converted the file to UTF-8 you then used

read.table(R_data/hs+sfnet.T.060505.tbl4, header=TRUE)

If so, you need to investigate the locale in use, as which letters are 
valid depends on the locale: on Linux UTF-8 locales all letters in all 
languages are valid in R names, but that is not necessarily the MacOS 
interpretation.  (Invalid characters in names will be converted to ., and 
if the locale is wrong so may be the interpretation of bytes as 
characters.)


You might find more informed help on the r-sig-mac list.


On Thu, 8 Jun 2006, Antti Arppe wrote:


Dear colleages in R,

I have earlier been working with R in Linux, where reading in a table 
containing Scandinavian letters (ä, ö, and å) in the header as part of 
variable names has not caused any problem whatsoever.


However, when trying to do the same in R running on new MacOS-X (with an 
Intel processor) with the same original text table does not seem to work 
whichever way I try. Following the recommendations on the R site and using 
the 'file' function to set the encoding breaks down at the first encounter 
with a Scandinavian character:


THINK - read.table(file(R_data/hs+sfnet.T.060505.tbl4, 
encoding=latin1),header=TRUE)

Warning messages:
1: invalid input found on input connection 'R_data/hs+sfnet.T.060505.tbl4'
2: incomplete final line found by readTableHeader on 
'R_data/hs+sfnet.T.060505.tbl4'


A sample exemplifying such characters as variable labels is below (for which 
the behavior of R in Mac is the same as for the larger file referred to 
above):.


  ajatella miettiä pohtia
1 FALSE   FALSE   TRUE
2 FALSE   FALSE  FALSE
3 FALSETRUE  FALSE
4 FALSETRUE  FALSE
5  TRUE   FALSE  FALSE
6  TRUE   FALSE  FALSE
7 FALSE   FALSE  FALSE
8 FALSETRUE  FALSE
9 FALSETRUE  FALSE
10FALSE   FALSE  FALSE

Converting the the file from ISO-latin-1 to UTF8 (with Mac's TextEdit 
application)allows the file to be read in in its entirety, but still the 
Scandinavian character in the heading is coerced to a period '.', or two, in 
fact (i.e. 'miettiä' - 'miett..')


Have I possibly misunderstood how the 'file' function should be used in 
conjunction with 'read.table', or might the problem with latin1-to-utf 
conversion be somewhere else?


Appreciating any help on this matter,




--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html