On 11/12/2009 12:08 PM, Sebastien Koechlin wrote:
On Wed, Nov 11, 2009 at 03:50:12PM -0500, DM Smith wrote:
We have a few modules that have entities in them. These are of the fashion
  (a character entity),U (a numeric decimal entity) andÅ
(a numeric hex entity).

These cause various problems:
This is because osis2mod does not use an XML parser.

I'm not seeing the problem in OSIS modules, but in ThML modules. They are perfectly valid in ThML modules, but are problematic. I will be going over all the modules looking for these and will report problematic CrossWire modules in www.crosswire.org/bugs. And I'll pass along any problems I find in the Xiphos and Bible.org modules.

My understanding is that a true XML parser has strict requirements as to how it is to handle errors: put out an error message and die.

If we used a true XML parser for osis2mod, it would die on the first character entity that was not &, <, > or " unless it were defined in the schema. OSIS does not define additional character entities.

We make the assumption that input to osis2mod has been validated against the OSIS schema. If this is true then there are no character entities in the input.

  Character entitie is
just a useful way to write a characters you can not or you want not to
put in your XML file. When parsed and resolved, they must not be
distinguable from others characters. The same apply for CDATA sections.

I agree with the statement above as far as it goes. But what is the XML parser to do when it discovers a character entity that it cannot resolve?


osis2mod should not keep entities when reading an OSIS file. I think it's a
big mistake and we should not rely on external programs many people will
have trouble to run.

I'd agree that numeric entities should be converted. And I think that osis2mod should complain if it finds entities that are not valid for an OSIS document and prompt the user to validate the input document.

Regarding module writers having trouble running tools, we've talked about having a web service at CrossWire.org that would provide the appropriate validation, conversion, creation, .... of an OSIS text. We've just not had a volunteer step up to the task.


We also had troubles with non-canonical Unicode sequences and I think
osis2mod was corrected.

Named entities as nbsp came from HTML and should not be used in OSIS as they
are not declared in osisCore.2.1.1.xsd, it result in an invalid document.
BUT, as we do not use an XML parser, we can use the HTML DTD[1] to resolve its
and be more friendly with OSIS writers.

The problem with using entities that are not allowed in OSIS is that one cannot validate against the OSIS schema. And because OSIS is not HTML, one cannot validate against it either.

For osis2mod to handle other character entities other than the 4 mentioned above, means that it cannot expect valid OSIS.


[1] see thoses URL, for this a perl program can produce a .cc or .h file.
        http://www.w3.org/TR/html4/HTMLlat1.ent
        http://www.w3.org/TR/html4/HTMLsymbol.ent
        http://www.w3.org/TR/html4/HTMLspecial.ent

The code I provided does so many more than just these character entities.


(Sorry if my message look rude, I'm not native english speaker)

I didn't take your response as rude. I appreciate your input. I think our goals are the same, to produce the highest quality modules minimizing the effort to do so.
All for God's glory.

In Him,
    DM

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to