Manuel Collado wrote:
> Using XXE Personal Edition 4.1.0 on WindowsXP SP3 Spanish.
> 
> A problem arises when using external tools to make some complex
> transformations of XHTML document fragments. For instance, syntax
> hilighting of program listings.
> 
> The XHTML configuration has been customized and now includes a command
> for processing the selection with an external AWK script:
> 
>    <command name="awk">
>      <macro>
>        <sequence>
>          <command name="run" parameter='"%C\awk" "%0" "%F"' />

The run command is expected to print on stdout some bytes. If these
bytes are used to represent text, then the encoding of these bytes must
be the native encoding of your platform (windows-1252).

The problem comes from the fact that awk prints UTF-8 bytes on stdout
and command "run" thinks these are windows-1252 bytes.



>          <command name="paste" parameter="to %_" />

%_ is a Java string (that is, a sequence of 16-bit chars). There is no
concept of encoding at this stage.

if %_ contains a string starting with "<?xml version="1.0"
encoding="UTF-8"?>...", then encoding="UTF-8" is ignored because such
specification makes no sense in this context.



>        </sequence>
>      </macro>
>    </command>
> 
> It works OK for text fragments, but not for elements. After selecting a 
> simple paragraph, the exported %F file looks like:
> 
>    <?xml version="1.0" encoding="UTF-8"?>
>    <p
>    xmlns="http://www.w3.org/1999/xhtml";
>    xmlns:ns="http://www.w3.org/1999/xhtml";
>    >WriteString("??Cantidad ?");</p
>    >
> 
> If the external transformation does nothing, and just reproduce its
> input, non-ASCII characters are mangled. The original paragraph:
> 
>     WriteString("?Cantidad ?");
> 
> is changed into
> 
>     WriteString("??Cantidad ?");
> 
> The output of the external command keeps the original <?xml..>
> declaration, but it seems that XXE ignores it, or at least ignores the
> encoding specification, and pastes the text nodes as if they were
> encoded in the default platform encoding (windows-1252 ~= latin1).
> 
> Instead of using awk, the problem can be probaly reproduced by any 
> external utility that just copies the input to the output.
> 
> My workaround is to externally convert the data from UTF-8 to 
> ISO-8859-1, but this is just a dirty hack, probably unreliable.
> 
> Is this a bug? Please tell me if I'm missing something relevant.

No, it works as expected. You really need to convert UTF-8 to
windows-1252 to force awk print windows-1252 bytes on stdout.


---
PS: No there is no way to make command "run" use for "%F" an encoding
other than UTF-8.








Reply via email to