Hi, On Wed, 2006-02-15 at 08:50 -0500, Rob Richards wrote: > After reading this thread and the comments in the bug report I have a > few questions/comments. > > Kasimier Buchcik wrote: > >> To me the most logical would be to do surgery on your input stream > >> you are modifying it by changing its encoding, you should then also > >> change or remove the encoding declaration of the xmlDecl if present. > >> > > We are doing this in our Delphi DOM-wrapper and lxml does it as well. > > I guess PHP does something similar. > > > > Since in Delphi we defined the DOMString to be little-endian with > > no BOM, we currently do the following if parsing a DOMString: > > > PHP doesn't play around with encoding or even implement a DOMString in > the DOM extension. If any special encoding needs to be handled using a > string it's up to the user to encode it as needed. The specified > document encoding or BOM is what is used to determine encoding as I
This is not restricted to parsing of a DOMString. With the DOM Load & Save module you can override the encoding declaration of the XML entitity via the LSInput.encoding property: "For other sources of input [other than DOMString], an encoding specified by means of this attribute will override any encoding specified in the XML declaration or the Text declaration, or an encoding obtained from a higher level protocol, such as HTTP [IETF RFC 2616]." http://www.w3.org/TR/2003/CR-DOM-Level-3-LS-20031107/load-save.html#LS-LSInput-encoding > really dont agree with overriding encoding and haven't heard any > complaints yet. Then PHP doesn't use (hasn't implemented) the LS module. For LSInput.stringData (which is of type DOMString) it reads: "String data to parse. If provided, this will always be treated as a sequence of 16-bit units (UTF-16 encoded characters)." http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSInput-stringData > I do have a question on Kasimier's latest comment in the bug report > about keeping any specified encoding if the document. If this value is > not kept, then what encoding is used when the document is serialized and > not explicitly passed to the save functions? Would it use the overriding > value rather than the origional one specified in the XMLDecl? If one is using the Load & Save module [1], then this is defined in: http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/load-save.html#LS-LSSerializer-write So the sequence of obtaining the encoding for serialization is: 1) LSOutput.encoding 2) Document.inputEncoding 3) Document.xmlEncoding ... with a fallback to UTF-8 if none of the above is specified. > In any event whatever change is made to this I doubt it will have any > impact on my side in terms of breakage since I don't muck around with > encoding while parsing and use different I/O routines in the event any > changes are made here for some sort of encoding detection (i.e. http > headers, etc..). > > Rob Regards, Kasimier _______________________________________________ xml mailing list, project page http://xmlsoft.org/ xml@gnome.org http://mail.gnome.org/mailman/listinfo/xml