DO NOT REPLY [Bug 14378] - Error parsing XML document with a leading white space character.

bugzilla Sun, 10 Nov 2002 20:34:09 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14378>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=14378

Error parsing XML document with a leading white space character.

[EMAIL PROTECTED] changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|REOPENED                    |RESOLVED
         Resolution|                            |INVALID



------- Additional Comments From [EMAIL PROTECTED]  2002-11-11 04:33 -------
This is part of the XML specification and all conforming processors must have 
this behavior.

The spec says:

2.1 Well-Formed XML Documents

[Definition: A textual object is a well-formed XML document if:]

1. Taken as a whole, it matches the production labeled document.
2. ...

Productions:

[1]    document    ::=    prolog element Misc* 
[22]    prolog     ::=    XMLDecl? Misc* (doctypedecl Misc*)? 
[23]    XMLDecl    ::=    '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>' 

As you can see, there is no whitespace permitted in these productions before 
the XML declaration.  An XML declaration, if present, must be the very first 
thing in the document and the very beginning of the XML declaration is the 
literal character sequence '<?xml'.

The rationale for this is well described in Appendix F:

"The XML encoding declaration functions as an internal label on each entity, 
indicating which character encoding is in use. Before an XML processor can read 
the internal label, however, it apparently has to know what character encoding 
is in use--which is what the internal label is trying to indicate. In the 
general case, this is a hopeless situation. It is not entirely hopeless in XML, 
however, because XML limits the general case in two ways: each implementation 
is assumed to support only a finite set of character encodings, and the XML 
encoding declaration is restricted in position and content in order to make it 
feasible to autodetect the character encoding in use in each entity in normal 
cases."

Therefore, the restriction that the XML declaration appear first in the 
document is quite intentional.

As to having a clearer error message, the confusion comes from the syntax of 
the XML declaration being significantly similar to the syntax for processing 
instruction.  There is no code to recognize an XML declaration anywhere other 
than at the very start of the document entity because that is the only place 
where it is allowed to occur.  What is legal at that point in the document are 
processing instructions, and the parser sees the '<?' and dispatches to the 
code to parse the syntax of a processing instruction, which is:

[16]    PI    ::=    '<?' PITarget (S (Char* - (Char* '?>' Char*)))? '?>' 
[17]    PITarget    ::=    Name - (('X' | 'x') ('M' | 'm') ('L' | 'l')) 

As you can see, the processing instruction target is specificly prohibited from 
being 'xml' and an appropriate error message is emitted that reflects that 
restriction.

In any case, continually reopening this same defect it not the best way to have 
a discussion of such questions about the XML spec or the behavior of Xerces as 
an implementation of that spec.  There are mailing lists for such discussions,
some of which are already getting copies of this exchange in Bugzilla, 
something that is considered generally to be quite inappropriate.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 14378] - Error parsing XML document with a leading white space character.

Reply via email to