Hi Matt,

I can reproduce the behaviour you experience, and the reason is this:  When
the parser is reading UTF-8 from some source, it reads it in chunks to
maximize performance as much as possible.  The routines that look through
the markup performing tokenization, well-formedness checking etc. operate
on this internal buffer--where everything's already in UTF-16.  The error
reporting routines work relative to the routines that are concerned with
the XML markup, since those are where most problems arise and that's the
natural specific domain of an XML parser.

When the markup routines have finished the XML declaration, they'll ask for
more text, which will cause the transcoding routines to go merrily along
their way to fill the requisite buffer.  When the transcoder finds
something it can't stomach it complains, but the error reporting logic only
knows where the parser left off looking for markup.

So yes, this is a bug.  But it wouldn't be all that easy to fix, especially
for transcoders that we don't own.  So I'm afraid the probability of this
being addressed in the near future isn't high.

You might want to file a bugzilla report to keep this on the radar scope,
in case anyone ever has the cycles to give it a serious run.

Cheers,
Neil
Neil Graham
XML Parser Development
IBM Toronto Lab
Phone:  905-413-3519, T/L 969-3519
E-mail:  [EMAIL PROTECTED]




                                                                                       
                                                
                      "Matt Nemenman"                                                  
                                                
                      <[EMAIL PROTECTED]        To:       [EMAIL PROTECTED]            
                                       
                      m>                       cc:                                     
                                                
                                               Subject:  Possible bug: invalid byte 1 
(...) of a 1-byte sequence.                      
                      09/23/2003 08:23                                                 
                                                
                      PM                                                               
                                                
                      Please respond to                                                
                                                
                      xerces-c-dev                                                     
                                                
                                                                                       
                                                
                                                                                       
                                                




Hi,

While trying to parse the file below (also in attachment), i got an
error at line 1, position 40: "An exception occurred!
Type:UTFDataFormatException, Message:invalid byte 1 () of a 1-byte
sequence."

<?xml version="1.0" encoding="utf-8" ?>
<tag>
Temperature 90F
</tag>


The file indeed contains an invalid UTF-8 character (a Latin1
character), however this character is at line 3, position 15 (completely
not where it is reported). I have seen this problem quite often: invalid
character error is often reported at the very end of xml declaration
(line 1), even if invalid character is thousand lines down the file.

Am I missing something, or is it a bug?

Thanks a lot,

      -- Matt


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

#### f has been removed from this note on September 23 2003 by Neil Graham



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to