Re: [R] Weird 'xmlEventParse' encoding issue

2013-07-16 Thread Duncan Temple Lang
Hi Sascha

 Your code gives the correct results on my machine (OS X),
either reading from the file directly or via readLines() and passing
the text to xmlEventParse().

 The problem might be the version of the XML package or your environment
settings.  And it is important to report the session information.
So you should provide the output from

   sessionInfo()
   Sys.getenv()
   libxmlVersion()


 D

On 7/15/13 4:41 AM, Sascha Wolfer wrote:
> Dear list,
> 
> I have got a weird encoding problem with the xmlEventParse() function from 
> the 'XML' package.
> 
> I tried finding an answer on the web for several hours and a Stack Exchange 
> question came back without success :(
> 
> So here's the problem. I created a small XML test file, which looks like this:
> 
> 
> 
> auch der Schulleiter steht dafür zur Verfügung. Das ist 
> seßhaft mit ä und ö...
> 
> This file is encoded with the iso-8859-1 encoding which is also defined in 
> its header.
> 
> I have 3 handler functions, definitions as follows:
> 
> sE2 <- function (name, attrs) {
>   if (name == "s") {
> get.text <<- T }
> }
> 
> eE2 <- function (name, attrs) {
>   if (name == "s") {
> get.text <<- F
>   }
> }
> 
> tS2 <- function (content, ...) {
>   if (get.text & nchar(content) > 0) {
> collected.text <<- c(collected.text, content)
>   }
> }
> 
> I have one wrapper function around xmlEventParse(), definition as follows:
> 
> get.all.text <- function (file) {
>   t1 <- Sys.time()
>   read.file <- paste(readLines(file, encoding = ""), collapse = " ")
>   print(read.file)
>   assign("collected.text", c(), env = .GlobalEnv)
>   assign("get.text", F, env = .GlobalEnv)
>   xmlEventParse(read.file, asText = T, list(startElement = sE2,
>endElement = eE2,
>text = tS2),
>error = function (...) { },
>saxVersion = 1)
>   t2 <- Sys.time()
>   cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
>   cat("Result of reading is in variable 'collected.text'.\n")
>   collected.text
> }
> 
> The output of calling get.all.text() is as follows:
> [1] "   type=\"manual\">auch der Schulleiter steht
> dafür zur Verfügung. Das ist seßhaft mit ä und ö... "
> That took 0 seconds.
> Result of reading is in variable 'collected.text'.
> [1] "auch der Schulleiter steht daf""ür zur 
> Verfügung. Das ist seßhaft mit ä und ö..."
> 
> Now the REALLY weird thing (for me) is that R obviously reads in the file 
> correctly (first output) with 'readLines()'.
> Then this output is passed to xmlEventParse. Afterwards the output is broken 
> and it sometimes also inserts weird breaks
> were special characters occur.
> 
> Do you have any ideas how to solve this problem?
> 
> I cannot use the xmlParse() function because I need the SAX functionality of 
> xmlEventParse(). I also tried reading the
> file with xmlEventParse() directly (with asText = F). No changes...
> 
> Thanks a lot,
> Sascha W.
> 
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Weird 'xmlEventParse' encoding issue

2013-07-15 Thread Sascha Wolfer

Dear list,

I have got a weird encoding problem with the xmlEventParse() function 
from the 'XML' package.


I tried finding an answer on the web for several hours and a Stack 
Exchange question came back without success :(


So here's the problem. I created a small XML test file, which looks like 
this:




auch der Schulleiter steht dafür zur Verfügung. Das ist 
seßhaft mit ä und ö...


This file is encoded with the iso-8859-1 encoding which is also defined 
in its header.


I have 3 handler functions, definitions as follows:

sE2 <- function (name, attrs) {
  if (name == "s") {
get.text <<- T }
}

eE2 <- function (name, attrs) {
  if (name == "s") {
get.text <<- F
  }
}

tS2 <- function (content, ...) {
  if (get.text & nchar(content) > 0) {
collected.text <<- c(collected.text, content)
  }
}

I have one wrapper function around xmlEventParse(), definition as follows:

get.all.text <- function (file) {
  t1 <- Sys.time()
  read.file <- paste(readLines(file, encoding = ""), collapse = " ")
  print(read.file)
  assign("collected.text", c(), env = .GlobalEnv)
  assign("get.text", F, env = .GlobalEnv)
  xmlEventParse(read.file, asText = T, list(startElement = sE2,
   endElement = eE2,
   text = tS2),
   error = function (...) { },
   saxVersion = 1)
  t2 <- Sys.time()
  cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
  cat("Result of reading is in variable 'collected.text'.\n")
  collected.text
}

The output of calling get.all.text() is as follows:
[1] "  
auch der Schulleiter steht dafür zur Verfügung. Das 
ist seßhaft mit ä und ö... "

That took 0 seconds.
Result of reading is in variable 'collected.text'.
[1] "auch der Schulleiter steht daf""ür zur 
Verfügung. Das ist seßhaft mit ä und ö..."


Now the REALLY weird thing (for me) is that R obviously reads in the 
file correctly (first output) with 'readLines()'. Then this output is 
passed to xmlEventParse. Afterwards the output is broken and it 
sometimes also inserts weird breaks were special characters occur.


Do you have any ideas how to solve this problem?

I cannot use the xmlParse() function because I need the SAX 
functionality of xmlEventParse(). I also tried reading the file with 
xmlEventParse() directly (with asText = F). No changes...


Thanks a lot,
Sascha W.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.