Hi,

On Wed, 2006-02-15 at 08:50 -0500, Rob Richards wrote:
> After reading this thread and the comments in the bug report I have a 
> few questions/comments.
> 
> Kasimier Buchcik wrote:
> >>   To me the most logical would be to do surgery on your input stream
> >> you are modifying it by changing its encoding, you should then also 
> >> change or remove the encoding declaration of the xmlDecl if present.
> >>     
> > We are doing this in our Delphi DOM-wrapper and lxml does it as well.
> > I guess PHP does something similar.
> >
> > Since in Delphi we defined the DOMString to be little-endian with
> > no BOM, we currently do the following if parsing a DOMString: 
> >   
> PHP doesn't play around with encoding or even implement a DOMString in 
> the DOM extension. If any special encoding needs to be handled using a 
> string it's up to the user to encode it as needed. The specified 
> document encoding or BOM  is what is used  to determine encoding as I 

This is not restricted to parsing of a DOMString.

With the DOM Load & Save module you can override the encoding
declaration of the XML entitity via the LSInput.encoding property:

"For other sources of input [other than DOMString], an encoding
specified by means of this attribute will override any encoding
specified in the XML declaration or the Text declaration, or an encoding
obtained from a higher level protocol, such as HTTP [IETF RFC 2616]."

http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html#LS-LSInput-encoding

> really dont agree with overriding encoding and haven't heard any 
> complaints yet.

Then PHP doesn't use (hasn't implemented) the LS module.

For LSInput.stringData (which is of type DOMString) it reads:
"String data to parse. If provided, this will always be treated as a
sequence of 16-bit units (UTF-16 encoded characters)."

http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-stringData

> I do have a question on Kasimier's latest comment in the bug report 
> about keeping any specified encoding if the document. If this value is 
> not kept, then what encoding is used when the document is serialized and 
> not explicitly passed to the save functions? Would it use the overriding 
> value rather than the origional one specified in the XMLDecl?

If one is using the Load & Save module [1], then this is defined in:
http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSSerializer-write

So the sequence of obtaining the encoding for serialization is:
1) LSOutput.encoding
2) Document.inputEncoding
3) Document.xmlEncoding

... with a fallback to UTF-8 if none of the above is specified.

> In any event whatever change is made to this I doubt it will have any 
> impact on my side in terms of breakage since I don't muck around with 
> encoding while parsing and use different I/O routines in the event any 
> changes are made here for some sort of encoding detection (i.e. http 
> headers, etc..).
> 
> Rob

Regards,

Kasimier


_______________________________________________
xml mailing list, project page  http://xmlsoft.org/
xml@gnome.org
http://mail.gnome.org/mailman/listinfo/xml

Reply via email to