Just a little refresher for how input sources are used in the C++ parser, for those new to Xerces-C and covering the recent changes for those who have been using it. The parser works completely in terms of 'input sources', as defined by the SAX API. An input source is a very abstract concept that just represents some source of XML data to be parsed. Where that data is and how its gotten to is something that the parser doesn't care about. It just wants to ask for data from an XML entity, so that it can parse it. The input source presents that '10,000 foot level' view of XML data that the parser cares about. The input source has a couple of jobs: 1) It holds the SYSTEM and PUBLIC ids that represent the XML data source. These may be parsed from an external entity declaration in the DTD (e.g. <!ENTITY FooBar SYSTEM "http://booger.com/path/someentity.xml">), or they may be provided programmatically by the client program. 2) It is the factory method for input streams that can read data from the source. The parser works in terms of input streams. For the C++ parser, it works in terms of the abstract class BinInputStream. When the parser wants to get to the data represented by an input source, it asks the input source object to create a new stream that can access that data. The parser doesn't care about the details of how that data gets read, it just knows that it has a stream that is capable of reading it. 3) It allows you to force the the encoding. Usually, the parser figures out what encoding an XML entity is in by poking around in the XML data. It looks for key beginning sequences and for the encoding="" statement to figure this out. However, if you know you are parsing XML with a particular encoding, you can force the encoding up front by calling setEncoding() on the input source before you give it to the parser to parse. The parser will then skip any internal auto-sense of the encoding and just take your word for it. This might be desireable if you know the file has been transcoded such that the encoding="" line is no longer correct. Since the base InputSource class is abstract, you must create a derived class that knows how to handle some sort of data source. The parser (as of the upcoming 3.1.0/1.1.0 release) comes with: 1 - LocalFileInputSource - For files on the local file system. This is the optimized way to parse local files. 2 - MemBufInputSource - For parsing from in memory file buffers (which you might have defined programmatically or just read into memory from some other source.) 3 - URLInputSource - For parsing from some URL based source. In many cases, you will just call the parser like this: myParser.parse("SomePathOrURLl"); Since the parser works only in terms of input sources, this type of API is merely a convenience for you. Internally, the parser must try to figure out if this is a local file or a URL and create the correct type of input source to use. If you know what type of entity you are parsing, it will generally be more efficient to create the input source yourself, to avoid this ambiguity. What the parser does to figure this out itself is: 1 - Try to parse it as a URL. 2 - If this works, assume its a URL and create a URLInputSource 3 - Else, assume its a file and give it our best shot with a LocalFileInputSource When any XML entity references another entity, it can choose to either give a fully qualified path or URL, or a partially qualified path or URL. Fully qualified SYSTEM ids are take on face value and used as is. Relative ids are assumed to be relative to the path of the entity which references it. So, if "foo.xml" references "../bar.xml" which references "../baz.xml", then baz.xml will be two directory levels above foo.xml. This is because ../bar.xml is relative foo.xml so its one level up, and ../baz.xml is relative to bar.xml so its up another level. * Note that this did not work correctly in versions prior to the upcoming 3.1.0/1.1.0 release. In previous versions, this was incorrectly implemented such that all relative references were relative to the primary document. So, though this is a change in semantics, it is in the direction of correctness. You can play many tricks with input sources. Since writing your own input source allows you to insert your own types if binary input streams into the parser, it allows you to get data into the parser from any source which can supply that data. You could have an input source for a database that extracts a record into memory and returns a memory input stream to parse it. You could have an input source which gets a mail message from another machine, pulls some XML payload from it, and returns a memory input stream to parse it. You can have an input source that effectivley parses some continuous stream of data which is broken into legal chunks of XML by having your stream class recognize some embedded end of entity marker and stop reading, so your input source just keeps handing out a handle to the same source of data over and over to each successive input stream requested by the parser. Be aware that, as the code stands right now, if you force the encoding on an entity, all internal smarts about the encoding are skipped. So, if you force the encoding to UTF-16LE, and there is a BOM, the parser won't try to skip it, and the parse will fail. So currently, if you do this, skip the BOM yourself. If there is a BOM, and you don't know the endianness, check the BOM and set the appropriate encoding for that endianness. The reasoning here is that by your forcing the encoding, you are telling the parser to stay out of it. If you installed your own transcoder, and set an encoding of "foobar", the parser wouldn't have any idea what that meant and whether an 0xFFFE at the start of the file is a BOM or not. Any attempt to second guess by special casing some forced encoding names would only be a partial solution. There is currently some discussion on this issue with one of our customers. If you have any thoughts or opinions on this, please post them. ---------------------------------------- Dean Roddey Software Weenie IBM Center for Java Technology - Silicon Valley [EMAIL PROTECTED]