Two of the long running shortcomings of the Xerces parser have been: 1) It does not handle multiply nested entities that use relative system ids (i.e. they all ended up being relative to the main XML entity.) 2) It does not support http or ftp type references to entities. The changes just checked in address these issues. This document is just a head's up as to what has changed in order to provide this new functionality. 1) File Movements - This is only semi-related, but... I wanted to get this done before we make a reference release, since we have to live with that for a while. The two input source derivatives (for files and memory buffers) that used to be internal/ were moved to framework/ where they should have been all along. This hopefully finally gets everything where it really should be. 2) The URL class was substantially reworked. It will now do pretty reasonably full featured parsing of file, http, and ftp type URLs, and break them into their correct constituent parts. It will surely get some improvements in the future, but its good enough for now to get the job done. 3) The NetAccessor abstraction is now wired into the system. This abstraction allows you to implement and plug in an object that will support http, ftp, and non-local file type URLs. The parser is now wired up to use this abstraction. So if, you implement this class, and the appropriate type of BinInputStream to go with it, and install it during your platform init, the parser will use it everything should all work magically to have the parser read data from your socket based stream. More on this below. 4) There is a new LocalFileInputSource if you want to provide an input source that is really a local file. You can always just pass a file name, but in the entity redirection stuff you need to be able to return an input source. So this class provides a way to do that. Also, if you want to force the encoding for an entity, you have to create an input source and set the encoding on it and pass that in. So this provides a way to do that for a local file, as apposed to URLs which are handled via URLInputSource. 5) Each entity on the entity stack now stores its full path. When a referenced entity is seen, a search is made up the entity stack to that last external entity, whose path is used as the base for completing a relative entity reference. 6) The getBasePath() method in the platform utiltities was changed to getFullPath() and now has different semantics. Whereas before, it would give back a completed path without the trailing file name, now it gives back the fully completed path with the file name still attached. So this will require a small change in each platform utility file. 7) A new method was added to the platform utilities called weavePaths(), which takes a base path and a (possibly) relative path and weaves them together. This code for this will probably be almost identical on most platforms. However, in order to provide maximum flexibility for all platforms, it is implemented per-platform. For many platforms, a simple ripoff of the code I did for the Win32 platform will be sufficient. So, if you give us just a system id for the main XML entity, the code looks like this: void XMLScanner::scanDocument( const XMLCh* const systemId , const bool reuseValidator) { // // First we try to parse it as a URL. If that fails, we assume its // a file and try it that way. // InputSource* srcToUse = 0; try { // // Create a temporary URL. Since this is the primary document, // it has to be fully qualified. If not, then assume we are just // mistaking a file for a URL. // URL tmpURL(systemId); if (tmpURL.isRelative()) ThrowXML(MalformedURLException, XML4CExcepts::URL_NoProtocolPresent); srcToUse = new URLInputSource(tmpURL); } catch(const MalformedURLException&) { srcToUse = new LocalFileInputSource(systemId); } catch(...) { // Just rethrow this, since its not our problem throw; } Janitor<InputSource> janSrc(srcToUse); scanDocument(*srcToUse, reuseValidator); } Note that for the primary entity, a URL must be fully qualified. If its a local file, the LocalFileInputSource() constructor that we call will automatically complete the path if its relative, thus creating a fully qualified path to begin with. All subsequent files or URLs are either fully qualified or they are relative to the path of the last external entity they were referenced from. When we see a reference to an external entity in the source and have to parse it, we call the reader manager to get it to create a reader for us, passing it the id. The code looks like this: XMLReader* ReaderMgr::createReader( const XMLCh* const sysId , const XMLCh* const pubId , const bool xmlDecl , const XMLReader::RefFrom refFrom , const XMLReader::Types type , const XMLReader::Sources source , InputSource*& srcToFill) { // Create a buffer for expanding the system id XMLBuffer expSysId; // // Allow the entity handler to expand the system id if they choose // to do so. // if (fEntityHandler) { if (!fEntityHandler->expandSystemId(sysId, expSysId)) expSysId.set(sysId); } else { expSysId.set(sysId); } // Call the entity resolver interface to get an input source srcToFill = 0; if (fEntityHandler) { srcToFill = fEntityHandler->resolveEntity ( pubId , expSysId.getRawBuffer() ); } // // If they didn't create a source via the entity resolver, then we // have to create one on our own. // if (!srcToFill) { LastExtEntityInfo lastInfo; getLastExtEntityInfo(lastInfo); try { URL urlTmp(lastInfo.systemId, expSysId.getRawBuffer()); srcToFill = new URLInputSource(urlTmp); } catch(const MalformedURLException&) { // Its not a URL, so lets assume its a local file name. srcToFill = new LocalFileInputSource ( lastInfo.systemId , expSysId.getRawBuffer() ); } } // Put a janitor on the input source Janitor<InputSource> janSrc(srcToFill); // // Now call the other version with the input source that we have, and // return the resulting reader. // XMLReader* retVal = createReader ( *srcToFill , xmlDecl , refFrom , type , source ); // Either way, we can release the input source now janSrc.orphan(); // If it failed for any reason, then return zero. if (!retVal) return 0; // Give this reader the next available reader number and return it retVal->setReaderNum(fNextReaderNum++); return retVal; } The primary difference involved is that in the latter, we get the last external entity info and use the system id of that entity as the base for the current entity. The URLInputSource and LocalFileInputSource classes have versions that take either a fully qualified path or a base and a (possibly) relative path. We don't know if we will have any implementation of the NetAccessor abstraction for the upcoming 1.1.0 (3.1.0 for XML4C) release. Perhaps we will try to do one based on LibWWW. But, at least the abstraction is fully wired in, so you can implement one for yourself using local services in the meantime. For instance, creating one for Windows using WinInet probably would be relatively simple. The abstraction works like this: 1) You define a derivative of NetAccessor and compile and link it in along with your platform driver file. 2) During your platform init, you create one of these and store it in the XMLPlatformUtils::fgNetAccessor static member. 3) When a URL is used as a system ID, a URLInputSource class is created. This object has a URL member to which the system id is given. If it is valid, its parsed into its constituent parts. 4) When its time to create a reader for this referenced entity, the input source is asked to create a stream for its source. In this case, it will just turn around and ask the URL to make a stream. 5) If its file::/// or file://localhost/ then its short circuited and a file input stream is created instrinsically. 6) Else, if no net accessor object is installed, an exception is thrown with an unsupported protocol error. 7) If one is installed, then its makeNew() method is called. This is the only method it has, and it just makes a new binary input stream and returns it. This class must derive from the BinInputStream class defined by the parser. The URL object is passed to this call, so you can easily get out the protocol, host, user, password, query, fragment, etc... parts of it. The implementation of the stream derivative is up to you. In most cases, you would probably want to provide a simple static API on your net accessor class to create a socket handle, read from it, and close it. You can then write your stream class in terms of this simple API. Or, since the implementation of your stream derivative lives in your implementation file, you can call directly to system services from the stream class. Whatever you want to do is fine since its your own implementation and only you look inside it. That's basically it. I've checked this into the repository, and the Win32 version should be happy with all this. Anapam and Arundhati are now going to update the platform drivers for the other platforms, and do the change to float XMLCh to wchar_t and get this checked back in hopefully by late tomorrow or early the next day, at which time all the platforms will be buildable again. Sorry about the disruption, but we really wanted to get these fundamental isues dealt with before we have to do a long lived reference release on this code. Let us know if you find anything wrong with these changes. ---------------------------------------- Dean Roddey Software Weenie IBM Center for Java Technology - Silicon Valley [EMAIL PROTECTED]