New changes to the Xerces C++ code base

roddey 12 Jan 2000 01:40:19 -0000

Two of the long running shortcomings of the Xerces parser have been:

1) It does not handle multiply nested entities that use relative system ids
(i.e. they all ended up being relative to the main XML entity.)
2) It does not support http or ftp type references to entities.

The changes just checked in address these issues. This document is just a
head's up as to what has changed in order to provide this new
functionality.

1) File Movements - This is only semi-related, but... I wanted to get this
done before we make a reference release, since we have to live with that
for a while. The two input source derivatives (for files and memory
buffers) that used to be internal/ were moved to framework/ where they
should have been all along. This hopefully finally gets everything where it
really should be.

2) The URL class was substantially reworked. It will now do pretty
reasonably full featured parsing of file, http, and ftp type URLs, and
break them into their correct constituent parts. It will surely get some
improvements in the future, but its good enough for now to get the job
done.

3) The NetAccessor abstraction is now wired into the system. This
abstraction allows you to implement and plug in an object that will support
http, ftp, and non-local file type URLs. The parser is now wired up to use
this abstraction. So if, you implement this class, and the appropriate type
of BinInputStream to go with it, and install it during your platform init,
the parser will use it everything should all work magically to have the
parser read data from your socket based stream. More on this below.

4) There is a new LocalFileInputSource if you want to provide an input
source that is really a local file. You can always just pass a file name,
but in the entity redirection stuff you need to be able to return an input
source. So this class provides a way to do that. Also, if you want to force
the encoding for an entity, you have to create an input source and set the
encoding on it and pass that in. So this provides a way to do that for a
local file, as apposed to URLs which are handled via URLInputSource.

5) Each entity on the entity stack now stores its full path. When a
referenced entity is seen, a search is made up the entity stack to that
last external entity, whose path is used as the base for completing a
relative entity reference.

6) The getBasePath() method in the platform utiltities was changed to
getFullPath() and now has different semantics. Whereas before, it would
give back a completed path without the trailing file name, now it gives
back the fully completed path with the file name still attached. So this
will require a small change in each platform utility file.

7) A new method was added to the platform utilities called weavePaths(),
which takes a base path and a (possibly) relative path and weaves them
together. This code for this will probably be almost identical on most
platforms. However, in order to provide maximum flexibility for all
platforms, it is implemented per-platform. For many platforms, a simple
ripoff of the code I did for the Win32 platform will be sufficient.


So, if you give us just a system id for the main XML entity, the code looks
like this:

void XMLScanner::scanDocument(  const   XMLCh* const    systemId
                                , const bool            reuseValidator)
{
    //
    //  First we try to parse it as a URL. If that fails, we assume its
    //  a file and try it that way.
    //
    InputSource* srcToUse = 0;
    try
    {
        //
        //  Create a temporary URL. Since this is the primary document,
        //  it has to be fully qualified. If not, then assume we are just
        //  mistaking a file for a URL.
        //
        URL tmpURL(systemId);
        if (tmpURL.isRelative())
            ThrowXML(MalformedURLException,
XML4CExcepts::URL_NoProtocolPresent);
        srcToUse = new URLInputSource(tmpURL);
    }

    catch(const MalformedURLException&)
    {
        srcToUse = new LocalFileInputSource(systemId);
    }

    catch(...)
    {
        // Just rethrow this, since its not our problem
        throw;
    }

    Janitor<InputSource> janSrc(srcToUse);
    scanDocument(*srcToUse, reuseValidator);
}


Note that for the primary entity, a URL must be fully qualified. If its a
local file, the LocalFileInputSource() constructor that we call will
automatically complete the path if its relative, thus creating a fully
qualified path to begin with. All subsequent files or URLs are either fully
qualified or they are relative to the path of the last external entity they
were referenced from.


When we see a reference to an external entity in the source and have to
parse it, we call the reader manager to get it to create a reader for us,
passing it the id. The code looks like this:

XMLReader* ReaderMgr::createReader( const   XMLCh* const        sysId
                                    , const XMLCh* const        pubId
                                    , const bool                xmlDecl
                                    , const XMLReader::RefFrom  refFrom
                                    , const XMLReader::Types    type
                                    , const XMLReader::Sources  source
                                    ,       InputSource*&       srcToFill)
{
    // Create a buffer for expanding the system id
    XMLBuffer expSysId;

    //
    //  Allow the entity handler to expand the system id if they choose
    //  to do so.
    //
    if (fEntityHandler)
    {
        if (!fEntityHandler->expandSystemId(sysId, expSysId))
            expSysId.set(sysId);
    }
     else
    {
        expSysId.set(sysId);
    }

    // Call the entity resolver interface to get an input source
    srcToFill = 0;
    if (fEntityHandler)
    {
        srcToFill = fEntityHandler->resolveEntity
        (
            pubId
            , expSysId.getRawBuffer()
        );
    }

    //
    //  If they didn't create a source via the entity resolver, then we
    //  have to create one on our own.
    //
    if (!srcToFill)
    {
        LastExtEntityInfo lastInfo;
        getLastExtEntityInfo(lastInfo);

        try
        {
            URL urlTmp(lastInfo.systemId, expSysId.getRawBuffer());
            srcToFill = new URLInputSource(urlTmp);
        }

        catch(const MalformedURLException&)
        {
            // Its not a URL, so lets assume its a local file name.
            srcToFill = new LocalFileInputSource
            (
                lastInfo.systemId
                , expSysId.getRawBuffer()
            );
        }
    }

    // Put a janitor on the input source
    Janitor<InputSource> janSrc(srcToFill);

    //
    //  Now call the other version with the input source that we have, and
    //  return the resulting reader.
    //
    XMLReader* retVal = createReader
    (
        *srcToFill
        , xmlDecl
        , refFrom
        , type
        , source
    );

    // Either way, we can release the input source now
    janSrc.orphan();

    // If it failed for any reason, then return zero.
    if (!retVal)
        return 0;

    // Give this reader the next available reader number and return it
    retVal->setReaderNum(fNextReaderNum++);
    return retVal;
}


The primary difference involved is that in the latter, we get the last
external entity info and use the system id of that entity as the base for
the current entity. The URLInputSource and LocalFileInputSource classes
have versions that take either a fully qualified path or a base and a
(possibly) relative path.


We don't know if we will have any implementation of the NetAccessor
abstraction for the upcoming 1.1.0 (3.1.0 for XML4C) release.  Perhaps we
will try to do one based on LibWWW. But, at least the abstraction is fully
wired in, so you can implement one for yourself using local services in the
meantime. For instance, creating one for Windows using WinInet probably
would be relatively simple.

The abstraction works like this:

1) You define a derivative of NetAccessor and compile and link it in along
with your platform driver file.
2) During your platform init, you create one of these and store it in the
XMLPlatformUtils::fgNetAccessor static member.
3) When a URL is used as a system ID, a URLInputSource class is created.
This object has a URL member to which the system id is given. If it is
valid, its parsed into its constituent parts.
4) When its time to create a reader for this referenced entity, the input
source is asked to create a stream for its source. In this case, it will
just turn around and ask the URL to make a stream.
5) If its file::/// or file://localhost/ then its short circuited and a
file input stream is created instrinsically.
6) Else, if no net accessor object is installed, an exception is thrown
with an unsupported protocol error.
7) If one is installed, then its makeNew() method is called. This is the
only method it has, and it just makes a new binary input stream and returns
it. This class must derive from the BinInputStream class defined by the
parser. The URL object is passed to this call, so you can easily get out
the protocol, host, user, password, query, fragment, etc... parts of it.

The implementation of the stream derivative is up to you. In most cases,
you would probably want to provide a simple static API on your net accessor
class to create a socket handle, read from it, and close it. You can then
write your stream class in terms of this simple API. Or, since the
implementation of your stream derivative lives in your implementation file,
you can call directly to system services from the stream class. Whatever
you want to do is fine since its your own implementation and only you look
inside it.



That's basically it. I've checked this into the repository, and the Win32
version should be happy with all this. Anapam and Arundhati are now going
to update the platform drivers for the other platforms, and do the change
to float XMLCh to wchar_t and get this checked back in hopefully by late
tomorrow or early the next day, at which time all the platforms will be
buildable again. Sorry about the disruption, but we really wanted to get
these fundamental isues dealt with before we have to do a long lived
reference release on this code.

Let us know if you find anything wrong with these changes.

----------------------------------------
Dean Roddey
Software Weenie
IBM Center for Java Technology - Silicon Valley
[EMAIL PROTECTED]
New changes to the Xerces C++ code base

Reply via email to