Re: [sword-devel] Are individual verses in a module "well formed"

DM Smith Wed, 20 Apr 2005 13:38:49 -0700

There are various contexts in which it makes sense to display individual verses. E.g. Quotations. Also, in the context of search, individual verses are returned. Adjacent verses may be merged into a passage. These are then presented to the user. Different clients do it differently. For these to be handled without error either we need more information (e.g. the boundaries of what constitutes a well-formed segment) or the code needs to do sophisticated error recovery.

When an individual verse is expanded to include other verses, the client needs to know that what was asked for and what was gotten are different and how it differs so it can be handled well and be communicated to the user in a useful fashion (e.g. show the context and highlight the hit)

The approach I was thinking about was to analyze any verse that failed to validate for unmatched tags. If an unmatched end tag was found then prefix the previous verse and try again, repeatedly until success or some defined threshold of pain was reached. Similar with a missing end tag: instead, get the following verses. If there are missing begin and end tags then grow each side until the problem is solved or it is too painful.

I mention the problem of pain, because in the case of the KJV, the notes are badly encoded and this technique won't work. E.g. <note/>...</note>, has a missing begin tag for which none will be found, as what was supposed to be the begin tag was encoded as an empty tag.

I also thought about adding artificial begin or end tags to the segment. In this case the text needs to be analyzed to determine what the invalid tags are. I think this is more difficult as some tags have required/expected attributes. While it may pass the well-formed test, it may be invalid and that may cause it to be rendered badly.

JSword's current technique is to gradually strip out stuff (bad characters, reserved characters and finally all xml stuff) and ultimately be left with something that frequently looks very bad.

What is going through my head as a possibly workable solution is to create another index for well-formed boundaries. The basic idea is that for every verse that is not well formed, that a start and end verse would be given for well-formedness. Essentially a map of book structure (book/section/paragraph/list/table/poetry...) from a verse perspective. In some cases, like poetry, it may be best to have the entire poem be a context as it may render better, than the smallest well-formed context in which a verse resides. Even in this case, the verse may be well formed, but be in a context that should be rendered as a whole. Personally, I would try to do this in lucene, with the verse being indexed and storing the start and end with the verse's doc.

Note, this is not just an XML problem. GBF has the notion of matched begin and end tags and these may be in different verses.

Troy A. Griffitts wrote:

DM, No, SWORD currently does no work to promise any retrievable segment of text as valid markup. I have talked with a few XML experts and have had a number of ideas brewing for the past few years how we might offer such information, as it is a necessary obstacle to overcome.
    The question, more generally, really is:
How can one package and send a segment of an XML document. Steve DeRose has pointed me to at least one project/standard which tries to address this issue. I need to review my email archives and study their solution. My ideas, very generally are either:

With each retrieved segment of text from the API, provide a context tag stack object which described the tag context at the start of the segment.
or
Do the actual work of returning valid XML for a segment of text, and provide an attribute in all supplied markup to designate it as such:

<verse osisID="Mat.6.10"><q who="Jesus" level="1" sID="Mat.5.3.q1" misc="phantom" /><q who="Jesus" level="2" sID="Mat.6.9.q1" misc="phantom" />Your kingdom come. Your will be done, On earth as it is in heaven<q eID="Mat.6.9.q1" misc="phantom" /><q eID="Mat.5.3.q1" misc="phantom" /></verse>

Note that this last example doesn't really supply any REQUIRED FOR XML VALIDITY, but does provide the more important tags required to represent the text correctly. And also not that any 'phantom' TRUE END TAGS will not be identifiable, as we cannot supply an attributed.

I think the first option works best for our engine design. When a client iterates a chapter, making 1 call for each verse, they aren't concerned with valid XML for each verse, but rather, they want any context when they start the segment (chapter in our example) and then they may want to close any remaining open tags when done rendering.

But it's all still just rumbling around in my mind, so any ideas are very welcome.
    -Troy.
 DM Smith wrote:
I asked this earlier on another thread, but it was lost in the noise of that thread.

Does Sword, in making a module, ensure (or try to ensure) that each verse is well formed? That is, for every begin feature marker, there is a corresponding end feature marker. In XML and ThML it would be a <tag>...</tag> or <tag/> but in gbf it might be a matched pair <TAG> <tAG>.

If not, is there some boundary (e.g. chapter) that is guaranteed to be a well-formed unit? And any suggestions on how to manage individual verses that are not well formed?

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Are individual verses in a module "well formed"

Reply via email to