String sysId = resourceIdentifier.getExpandedSystemId();
Is there some particular reason this uses the expanded system ID rather than using getLiteralSystemId()?
I've got a problem with some XML files I'm processing with Cocoon. The files all contain a DOCTYPE that uses a relative path for the system ID i.e. <!DOCTYPE record SYSTEM "dcr4.5.dtd"> The documents are created by an another application, and I can't affect what it puts in there. Trying to read the files generates a parser error since the DTD isn't present in the directory containing the documents; no problem, I thought, just use a suitable entry in the catalog used by Cocoon's EntityResolver. So, following the other entries, I added
SYSTEM "dcr4.5.dtd" "interwoven/dcr4.5.dtd"
and copied the DTD into WEB-INF\entities\interwoven, however, it still doesn't find the DTD. Turning up the logging (and this is where it becomes more relevant to Xerces than Cocoon, and why I'm asking here rather than cocoon-user) I discovered that the system ID being passed in to the catalog resolver already had the full path to the file, so it's not matching the above entry in the catalog. Since the path to the documents could be more or less anything, I can't use a (prefix-based) rewrite entry in the catalog; likewise it's impractical to include a system entry for every possible path, since I don't know in advance what they're going to be. Digging through the Cocoon & Xerces source code, I discovered the path being received by the catalog resolver has come from the EntityResolverWrapper i.e. the resourceIdentifier.getExpandedSystemId() I mentioned above. Presumably, if that had used getLiteralSystemId() instead, the catalog resolver would have received just "dcr4.5.dtd" for the system ID rather than the full path, and would have matched it okay. But I'm wary of changing it myself, since I don't know what else might be affected (and I'd rather avoid using a custom-built Xerces in our Cocoon app, to minimise the risk of introducing other side-effects).
I notice in the current CVS HEAD, there's an EntityResolver2Wrapper class; this one does use getLiteralSystemId(), in fact the latest CVS log message on that class says
"Fixing a bug. The systemId passed to EntityResolver2.resolveEntity may be an absolute or relative URI. That is it should be the literal system identifier, not the expanded one which resolved from the base URI."
However, I also found an old (> 2 years) mailing list message (http://mail-archives.apache.org/eyebrowse/[EMAIL PROTECTED]&msgId=568021) which says that
"The reason Xerces now returns fully-expanded URI's to the Entity resolver is that SAX quite explicitly states that this is what XML processors are supposed to do."
So now I'm twice as confused. Do the SAX2 Extensions 1.1 say that EntityResolver2 should behave differently from EntityResolver? Or have things changed since EntityResolverWrapper switched to using getExpandedSystemId(), and should it now be using getLiteralSystemId() after all?
In the meantime I can work around my problem by plugging in a custom EntityResolver which replaces any system IDs ending with "dcr4.5.dtd" with just that string, before passing it on to the XML commons catalog resolver as before. But it'd be nice if it could be clarified how exactly Xerces' wrapper classes are supposed to work, so I know if I should be raising a bug :-)
Andrew. --
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]