Re: [R] Encoding() and strsplit()

2008-11-07 Thread Prof Brian Ripley

See the 'R Internals' manual.

ASCII characters are not marked as Latin-1 nor UTF-8.

On Fri, 7 Nov 2008, Heinz Tuechler wrote:


Dear All,

Encoding() goes beyond my understanding. See the example. I would expect from 
reading the help for Encoding() that strsplit preserves the encoding for each 
resulting element, but for simple letters it gets lost.
Also it seems that an Encoding() cannot be declared for simple letters. They 
remain in any case unknown. In paste() latin1 seems to dominate 
unknown.
What kind of characteristic of an object is the encoding? It does not show up 
as attribute and also str() does not give me any hint.

Where can I find some explanation regarding encoding?

Thanks

Heinz

###   Encoding() and strsplit
u - 'abcäöü'
Encoding(u)
[1] latin1
Encoding(u) - 'latin1' # to be sure about encoding
us - strsplit(u, '')[[1]] # split in single strings
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
Encoding(us) - rep('latin1', length(us))
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
pus - paste(us[1], us[5], sep='')
Encoding(pus)
[1] latin1

Version:
platform = i386-pc-mingw32
arch = i386
os = mingw32
system = i386, mingw32
status = Patched
major = 2
minor = 8.0
year = 2008
month = 11
day = 04
svn rev = 46830
language = R
version.string = R version 2.8.0 Patched (2008-11-04 r46830)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
.GlobalEnv, package:stats, package:graphics, package:grDevices, 
package:utils, package:datasets, package:methods, Autoloads, package:base


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Encoding() and strsplit()

2008-11-07 Thread Heinz Tuechler

At 09:15 07.11.2008, Prof Brian Ripley wrote:

See the 'R Internals' manual.


Thank you, now I understand a little more.
My real problem, however is a data frame produced 
by spss.get(). Is there a simple possibility to 
mark all characters in that data.frame (except 
ASCII characters), including levels of factors to latin1?


Heinz Tüchler



ASCII characters are not marked as Latin-1 nor UTF-8.

On Fri, 7 Nov 2008, Heinz Tuechler wrote:


Dear All,

Encoding() goes beyond my understanding. See 
the example. I would expect from reading the 
help for Encoding() that strsplit preserves the 
encoding for each resulting element, but for simple letters it gets lost.
Also it seems that an Encoding() cannot be 
declared for simple letters. They remain in any 
case unknown. In paste() latin1 seems to dominate unknown.
What kind of characteristic of an object is the 
encoding? It does not show up as attribute and 
also str() does not give me any hint.

Where can I find some explanation regarding encoding?

Thanks

Heinz

###   Encoding() and strsplit
u - 'abcäöü'
Encoding(u)
[1] latin1
Encoding(u) - 'latin1' # to be sure about encoding
us - strsplit(u, '')[[1]] # split in single strings
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
Encoding(us) - rep('latin1', length(us))
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
pus - paste(us[1], us[5], sep='')
Encoding(pus)
[1] latin1

Version:
platform = i386-pc-mingw32
arch = i386
os = mingw32
system = i386, mingw32
status = Patched
major = 2
minor = 8.0
year = 2008
month = 11
day = 04
svn rev = 46830
language = R
version.string = R version 2.8.0 Patched (2008-11-04 r46830)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
.GlobalEnv, package:stats, package:graphics, 
package:grDevices, package:utils, 
package:datasets, package:methods, Autoloads, package:base


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel:  +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UKFax:  +44 1865 272595


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Encoding() and strsplit()

2008-11-06 Thread Heinz Tuechler

Dear All,

Encoding() goes beyond my understanding. See the 
example. I would expect from reading the help for 
Encoding() that strsplit preserves the encoding 
for each resulting element, but for simple letters it gets lost.
Also it seems that an Encoding() cannot be 
declared for simple letters. They remain in any 
case unknown. In paste() latin1 seems to dominate unknown.
What kind of characteristic of an object is the 
encoding? It does not show up as attribute and 
also str() does not give me any hint.

Where can I find some explanation regarding encoding?

Thanks

Heinz

###   Encoding() and strsplit
u - 'abcäöü'
Encoding(u)
[1] latin1
Encoding(u) - 'latin1' # to be sure about encoding
us - strsplit(u, '')[[1]] # split in single strings
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
Encoding(us) - rep('latin1', length(us))
Encoding(us)
[1] unknown unknown unknown latin1  latin1  latin1
pus - paste(us[1], us[5], sep='')
Encoding(pus)
[1] latin1

Version:
 platform = i386-pc-mingw32
 arch = i386
 os = mingw32
 system = i386, mingw32
 status = Patched
 major = 2
 minor = 8.0
 year = 2008
 month = 11
 day = 04
 svn rev = 46830
 language = R
 version.string = R version 2.8.0 Patched (2008-11-04 r46830)

Windows XP (build 2600) Service Pack 2

Locale:
LC_COLLATE=German_Austria.1252;LC_CTYPE=German_Austria.1252;LC_MONETARY=German_Austria.1252;LC_NUMERIC=C;LC_TIME=German_Austria.1252

Search Path:
 .GlobalEnv, package:stats, package:graphics, 
package:grDevices, package:utils, 
package:datasets, package:methods, Autoloads, package:base


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.