John L. Clark wrote:
> I believe this is related to the problem I reported several days ago
> (regarding the mechanism by which XXE identifies and loads XML files
> with no associated DTD), but I wanted to make sure that the problem had
> been identified.  I discovered that with such documents (DTD-less
> DocBook files), if they contained the the 0xa0 (non-breaking-space)
> character, it would be translated automatically to the entity  
> when the file is saved.  However, the file is still saved without a DTD
> reference and so is invalid (the   entity is undefined).

DTD-less DocBook files are not *documents*. You can call them fragments, 
modules, external entities, whatever.



> For example, given the (well-formed) input file:
> 
> ---
> <?xml version="1.0" encoding="UTF-8"?>
> <article>
>   <title>Entity &amp;#160; problems</title>
> 
>   <para>We want a non-breaking-space.&#160; There should be two spaces between
>   the first sentence and the second.</para>
> </article>
> ---
> 
> If it is then opened with XXE and resaved, it is saved as:
> 
> ---
> <?xml version="1.0" encoding="UTF-8"?>
> <article>
>   <title>Entity &amp;#160; problems</title>
> 
>   <para>We want a non-breaking-space.&nbsp; There should be two spaces between
>   the first sentence and the second.</para>
> </article>
> ---
> 
> Which is clearly not well-formed. 

I don't agree: the above article is a perfectly valid external entity 
which is supposed to be referenced by a master document which has the 
proper <!DOCTYPE>.

If you use Emacs to write ``by hand'' a DocBook article which is 
intended to be an external entity referenced by a DocBook book, you 
would write "&nbsp;" not "&#160;".




> Again, I think this is related to the mechanism that XXE uses which includes 
> the original file as an external
> entity in order to validate it, but I wanted to make sure the scope of
> the problem was exposed to your development team.

We have already fixed your XML declaration problem. That was a real bug 
(because we have overlooked something in the way we trick XXE to treat 
mere fragments as first class documents). What you describe in this 
email is clearly *not a bug*.

* If your article is a ``module'' which is intended to be part of a 
master document, outputing "&nbsp;" is OK because other applications 
(saxon, xmllint, etc) are not supposed to load your article as a 
stand-alone *document*.

* If you feel uncomfortable with this (I don't see why, but...), never 
ever use <DOCTYPE>-less modules. Always add a <DOCTYPE> to all your 
document templates. If after that, you want to use them as modules, use 
XIncludes and not references to external entities (Options dialog box, 
Edit tab).

* If you still feel uncomfortable with this, another solution is to 
configure XXE to not save characters as entity references (Options 
dialog box, Save tab).



Reply via email to