On Mon, Sep 6, 2010 at 10:30 AM, Walter Underwood <wun...@wunderwood.org> wrote: > On Sep 6, 2010, at 1:49 AM, Lance Norskog wrote: > >> 1) The XML file must include the UTF-8 encoding metadata in the first line. > > If it requires that, it isn't a legal XML parser. The encoding declaration is > optional and it defaults to UTF-8.
Correct, the default is UTF-8. And actually... the charset *inside* the XML is currently ignored. We pay attention to the charset from the HTTP Content-type, and default to UTF-8 if that's not set. It would probably be better if we passed the raw byte stream to the XML parser if the charset is missing in Content-type (so it could presumably snoop the XML for the right charset), but it's never been a high priority issue. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8