Re: [sword-devel] XSLT vs. C++

DM Smith Wed, 01 Dec 2010 10:11:38 -0800

Not so much regarding Troy's comment about Plato's Form. Rather aboutthe model that JSword uses. It is meant for illumination.

JSword converts ThML, GBF, PlainText and OSIS on a verse by verse basisinto well-defined fragments of XML. These fragments use the tags ofOSIS, but might not produce a valid fragment. For ease of explanation,we say that it is converted into OSIS. If for some reason a verse inThML or OSIS is not well-formed, it is hacked by successively strippingout xml parts until it parses or until only the text remains. This hackis rather unfortunate and should be removed or improved. E.g. notes andxrefs should never be inlined as plain text if they are marked up properly.

Though it can, JSword does not use XSLT on a verse by verse basis torender a verse. Rather it gathers all the verses as XML fragments intoan XML document. Typically this is a chapter of verses, but it mightalso be the set of verses returned from a search result, specified bythe user, or given as a cross-reference. JSword will also collect versesfrom several modules into the document for parallel display.

It is this document that is rendered. How this document is rendered isup to the application. It could use SAX. It could walk the DOM. ButBible Desktop uses XSLT and many other JSword front-ends do so as well.In answer to an earlier question, the XSLT is read once and reused forall rendering of modules. It is way to expensive to do this frequently.Once per run or only when the underlying file changes is sufficient.

An aspect that JSword dictates on a processor of the document. Allrendering/filtering happens within it. The BD style sheet isparametrized for each render option. Using these it shows/hides notes,xrefs, strongs, and morph; does verse per line; changes in therepresentation of the verse number; and so forth.

There are several values in rendering a chapter as a whole. There aremany constructs that can include more than one verse. One can start atag in the middle of one verse and close it in another. If one onlyrendered verse-by-verse the start and end might not be matched upcorrectly. For example, SWORD's osishtmlhref filter has a quote stackand a highlight stack. If a quote starts in one verse and ends inanother, the stack is reset going from one verse to another. So thequote marks might not match up. (Note: osis2mod is aware of thisshortcoming and adjusts for it. However, if the module maker usesimp2mod or vpl2mod it can happen). For the <hi> tag when an opening tagis found, it is pushed on a stack (allowing for nesting). When an endtag is found, the stack is consulted to see what it was the start tagwas. If it were bold then it closes bold, otherwise it closes italics.However, if the stack is empty, it closes italics.

This spanning problem affects JSword's rendering of a collection ofarbitrary verses. A tag can be open in one verse, but because the verseis not show in context, it is never closed.

There is also an advantage of using XSLT over SAX, it is not limited toa single pass of the document. For example, this is used in BibleDesktop to show margin notes.

Regarding TEI, JSword pretends it is OSIS. This is not a far stretchsince OSIS was influenced by TEI. The XSLT has a few entries to be ableto display key elements. Since TEI is rather open, and in flux, not allof what we will use will be found in it. I haven't looked at it butChris has a TEI schema he uses for validation. That could be used toimprove the XSLT or for TEI modules to have their own XSLT.

Regarding ThML, JSword would do well to not convert it to OSIS but haveXSLT for it as well.

Regarding the speed of XSLT vs SAX vs SWORDs renderers. Except forhandhelds (pda, phone, ...) it is a moot point. I figure that 5-6 yearsis the maximum useful lifespan of a computer. The processing power of acomputer in these years, even a netbook, is sufficient to run XSLT fastenough over a chapter's worth of verses to satisfy end users. I have anold 486, Windows 98 laptop with limited memory that runs it acceptably.Even my OLPC (one laptop per child) is fast enough.

Beyond JSword and how it could be used in SWORD with out much change tothe current library:I'm not sure, but I think any SWORD front-end can try out XSLT if theylike on OSIS documents using the osisosis.cpp filter. The filter doesnot attempt to do too much except reconstruct verses. It might need tobe modified to output milestoned verse markers instead of the begin/endtags it does now. Using begin/end tags makes the assumption that a verseis a well-formed fragment. Just use it to "render" a chapter and thenpass that chapter to xslt.

I'm hearing that lots of people won't seriously look at XSLT. It has asteep but short learning curve. Kind of like Perl. There are two basicprogramming models using XSLT: one that understands the containmentmodel of the schema. The other handles the tags as they appear, notcaring whether the document is structured correctly. They have theirpros and cons. (BD's XSLT uses the latter model.) But there are more andmore systems that are using xpath notation and people are becoming morefamiliar with it. I think the audience of users that fairly easily buyinto XSLT are those that work with XML and DOM all the time. Thisincludes web developers.

As more and more of our front-ends are targeting browser engines fordisplay, it is or will become feasible for the transformation to be donedirectly by the browser. Today, all current browsers (IE, FireFox,Safari, Chrome, WebKit, Opera) can directly do the transformations. Foran example see:

http://www.w3schools.com/XSL/xsl_transformation.asp

I imagine it is possible, but I don't know how to pass parameters to thestylesheet when done this way.

I don't know if it works with embedded browsers (xulrunner, webkit, ie),but I'd guess it does.

There may be no need for SWORD to have html render filters. Justtransform the module into well-formed xml, feed it to a browser alongwith a stylesheet.

Some things are hard to do in XSLT. Some are not possible/feasible. Someare way too slow. So there will always be a need for a pre-processor todo some up front work. Or for the XSLT to call out to another program.


Hope this is helpful.

In Him,
    DM

On 12/01/2010 07:20 AM, Troy A. Griffitts wrote:

The logic to get from any Publisher Source Document to rendered HTML is
a very complex task to solve.

We conceptually create Plato's Form of, say, a Bible, and try to fit
imperfect Publisher markup into this concept.  A Bible has verses,
headings between verses, chapter intros, footnotes, crossrefs, lemma
information, etc.

If we do not do this, then we become a PDF reader-- there are already
PDF readers and we lose the ability to do Bible specific things with our
software.  For example, if we didn't normalize the concept of crossref
across all Books, then we couldn't turn them on and off; we couldn't
provide a crossref panel in the reader which fills according to which
crossref is hovered over, etc.  Same with notes, strongs, headings, etc.

This causes us to impose our Form onto a publisher's text.  I understand
why some people may not like this, but it is very much to our end users'
benefit that we do this.  Without this, we become a web-browser or a PDF
reader.  Which are fine for their purpose, but we intend to provide
common, familiar, and sometimes novel Bible study aides to our reader.

The current processing model is dark magic and I apologize for this.  It
should be well documented and easy to modify.  I will attempt to improve
the dissemination of knowledge of exactly WHAT our Forms are, how we
impose those Forms on publishers' texts and improve the documentation
and code to help others understand and have the ability to improve the code.

I'll attempt to post a few easy to swallow SWORD 101 classes in email,
which will help us gather our thoughts and documents on how all this works.


Troy



On 12/01/2010 12:09 AM, Greg Hellings wrote:

On Tue, Nov 30, 2010 at 1:08 PM, Troy A. Griffitts<[email protected]>  wrote:

Having finally returned from a hectic 2 weeks of conferences, and lots
to do before leaving for Christmas, I'm not sure I'm up for a heated,
passionate debate about technologies right now, but by all means, please
commence the public discussion.

Let me start by saying that everyone (I believe) agrees that we would
like to have an HTML output from the engine which is more generic and
would allow CSS to be applied if a frontend would like to do this.
Currently HTMLHREF output from the engine is used by the widest number
of frontends (to my knowledge) and would benefit everyone involved by
becoming much more generic. e.g.,

<title>  ->  <h1>
rather than
<title>  ->  <b><br />

<transChange type="added">  ->  <span class="tcAdded">
rather than
<transChange type="added">  ->  <i>

etc.

I believe this will solve a number of issues and possibly get the BT and
MacSword teams onboard to using the same HTML output filters as the
other projects involve (or at least subclassing them and using the
majority of their functionality).

I think this is our pretty well accepted premise.  The current filters
stink to various degrees and currently no one is willing to step up
and tackle them.


Now, as to the other issue of using XSLT internally in the engine to
process OSIS ->  HTML

I will throw a few melons into the air for target practice, and let the
shooting commence.

_____________________________
*Multiple Language*

XSLT is a programming language in the same sense that C++ is a
programming language.

The SWORD Project C++ engine is written in C++.  It is not a Python
engine; it is not a Perl engine; it is not a Java engine; it is C++.

One might say, "Well, you can use XSLT from C++.  Doesn't JSword do this
from Java?"  Well, yes, of course you can, and DM can comment, if he
feels the desire to recommend his decision to encorporate an XSLT engine
into the JSword logic flow.  But simply because one CAN doesn't mean one
SHOULD.  We COULD encorporate a Perl text processing engine in our C++
code, or an Awk processing engine...  that doesn't mean we SHOULD.  I'm
sure some would say we SHOULD.  And obviously DM has thought he SHOULD
encorporate XSLT processing for JSword, so I'm not intending to say it
is a BAD decision, just that it is not a decision I would make; in the
same way as our projects each chose C++ vs. Java to implement our objective.

If a developer is going to develop OSIS ->  HTML filters, for instance,
we are already assuming they know OSIS and HTML.  OSIS is XML and HTML
is SGML (though most of our work is probably targetting a more
XML-dialect of HTML).  XSLT is also XML.  Formally, it is not even a
programming language, but just a set of formatting/processing
instructions in XML.

Any developer using XML who is worth their salt should at least be
familiar with the basics of XSL - they may not be a guru of XPath
expressions or have every attribute of XSL memorized - and would
probably expect a library which handles XML as its preferred input
method to utilize one of the standard XML processing methods.  I know
I'm not the only person who was surprised to look in the library
filters and see neither DOM, SAX nor XSLT technologies in use.  That
was when I first ran and hid.

Of course, this portion of the discussion is only relevant for the
from-OSIS filters.

_______________________
*XSLT better than C++*

One might say, "well, XSLT is better suited to process XML than C++."
That's a loaded and unquantified statement.

Certainly the C++ language specification doesn't include facilities to
easily process XML, but that doesn't mean a plethora of C++ libraries
don't exists for assisting in this task.

The SWORD engine includes classes like XMLTag and SWBasicFilter which
implement a SAX processing model.

The current filters do not all use SWBasicFilter, nor XMLTag.  They've
been written over 15 years and many before these classes existed.  Some
are ugly and need to be rewritten for readability, certainly.  But not
necessarily in a different programming language.

XSLT being "better" is, yes, a matter of complete subjectivity.  And,
as I mentioned above, is only useful when our source is XML to begin
with.  For GBF or Plaintext sources, XSLT is clearly not even
applicable.

But the current C++ is so good that you seem the only person willing
to touch it.  Peter just mentioned he tried once and couldn't get it.
I have gone into the filters before with a singular goal in mind and
was able to produce my desired changes, but it was long, drawn-out and
painful.  Doing the same tasks in XSL would have taken me mere
seconds.  I know a few other people, at least, have said they would
know how to do a task if XSLT was used instead of C++.  Of course,
that is a hypothetical - I can't know that they would have done so,
but that was their claim at the time.

Our recent discussion about the use of the "n" attribute for footnotes
in ThML is a perfect example.  Maintaining the attribute in XSL would
have been a trivial task I could have handled in seconds.  Instead, it
required you, myself and Karl and took about 10 days to get fixed.
You had to alert Karl and me to presence of the attributes, I provided
him a preliminary patch to incorporate the values, then he had to
heavily modify the patch to operate correctly in non-ThML source and a
few other corner cases.  And, in the end, the fix is only in Xiphos'
code base - I would have to go through 2 of those three steps again in
Bibletime, BPBible, MacSword and any other applications I wanted to
see proper behavior in.  Alternatively I could tackle the filters -
but I'm not really inclined to do so.

Is XSLT "better"?  For me, it would be better because I could more
easily modify its behavior based on the fact that I know XML and could
easily locate the necessary processing directive.  For you, maybe not.
  Are there things you simply cannot do in XSL that C++ can? Yes.  IMO
the benefits of XSL outweigh the benefits of C++ for this task, but
you clearly disagree. :)  I would also say that DOM or SAX processing
would be better for all the same reasons - it shields the user from
having to see the XML parsing and handle inconsistencies in
whitespace, validation, etc and is still a decently well-known
technology among XML users (even if it's slightly less well-known than
XSL).  And with a DOM or SAX parser, you could still happily employ
the full power of C++.

________________________
*COMPLEXITY*

The task of enumerating all types of OSIS<title>  tags, and deciding
what to do with each, and how to classify all<title>  tags from all
possible OSIS documents into our enumeration is still going to be a
complex task using XSLT.<title>  is a complex example, but certainly
not the most complex.

It is a tall task to generalize all elements of all documents from all
publishers into one conceptual model with one chosen output for a
frontend-- whether that be for an audience on the Desktop, web-based, or
a handheld.

The complex processing required by the engine will require long, complex
XSLT-- which likely will encorporate callbacks to C++.  It will not be
more simple-- only mixed language.

I could also argue that the XSL would not require a developer to
mentally filter out the code that just identifies and locates XML
elements and attributes and parses them from the code that transforms
them and generates the output.  Thus yes, it might include some
extension functions into C++ but it would be simpler.  And it would
also be more expressive.

The enumeration of every OSIS<title>  tag is a moot point for the
decision.  You need to enumerate them all in C++ as well and decide
what to do with them.  That doesn't change in the XSL - just the
method used.  An XSL match along the lines of<xsl:template
match="tit...@type=psalm]">  still has to be done in C++ with some sort
of if(tag.name() == "title&&  tag.attr("type") == "psalm") or whatever
the syntax is.  And that is assuming the current filter is using
XMLTag and isn't comparing character strings directly.

_______________________
*Semantic vs. Display*

Some will say (and have), "well, let everything be display oriented and
let the publisher decide".  Fine, then you lose 2 things: the ability to
display differently per user preference, per display device; and you
also give up the promise to actually do any interesting research on the
text.  When you lose semantic markup, then you lose all interesting
information about WHAT is being marked up.

I just want to be clear that I'm not advocating the use of display
over semantics as a general choice.  My statements are strictly based
around my specific task and the fact that OSIS support in SWORD and
the front ends is not as good as the support of ThML.  Largely this is
because most applications display in HTML and my required task is
framed entirely in terms of the presentation and display - not the
semantics.  I would love and prefer to use OSIS for this task, but I
simply cannot accomplish it with the state of SWORD at this time.

_______________________
*More than a Rending Engine*

The SWORD C++ Engine is more than simply a text rendering engine-- it is
a Biblical text research engine.

If I'd like to know the morphology of word 3 in 2Thes 2.13 of the WHNU
Greek text, the entire program to do such is:

SWMgr library;
SWModule *whnu = library.getModule("WHNU");
whnu->setKey("2th.2.13");
whnu->RenderText();

cout<<  "The morphology of word three is: "<<
whnu->getEntryAttributes()["Word"]["003"]["Morph"]<<  endl;


That reads nice (at least in my opinion).  I don't need to know about
XML, XSLT, care what markup the WHNU module uses, I don't even have to
know how to make a SWORD filter.  The current filters do all the work of
breaking out these attributes and making them available in a nice and
interesting map.

I'd like to be clear again, that XSL would only be useful for material
already in OSIS formats (or in valid ThML - I think TEI is also an XML
format?).  I doubt many modules in ThML are strictly valid at their
import times, so XSL wouldn't be very useful, and GBF is a monster
unto itself.  Doing the above in XSL from an OSIS source would not be
much different in complexity than what you have listed there.

<xsl:template match="ver...@osisid='2thes.2.13']/w...@n=3]">
The morphology of word three is:<xsl:value-of select="@morph" />
</xsl:template>

Or something similar (my knowledge of exact OSIS attribute names and
values wanes and it's been two or three weeks since I wrote an XPath
expression).

Of course, the string processing portion of SWORD would continue to be
of great importance for any modules in GBF format or similar to bring
them into a useful form.  In that way, SWORD would continue to be more
than just a text rendering engine.  It would continue to offer all of
its features, its buffering from the system and from the format, its
indexing, its module fetching and storing, etc.

______________________


And finally, if bullets aren't flying already, I'll stir the heat up with...

XSLT sucks.  A good C++ programmer can do anything in C++ better than
any XSLT programmer.


:)

A C++ programmer can definitely do more, since C++ is actually a
programming language and XSLT is a set of processing instructions.
Better?  That depends on what the criteria is.  For me, in my current
role as a module creator, the use of C++ is not currently better
because it is less flexible and extensible.  For you, as the library
maintainer, perhaps C++ is better because it's what you are already
comfortable with and because it has largely been your hand in the
filters.

*duck*
Have fun.

Troy

PS.  In summary, I understand the current filters are sometimes overly
complex and need cleanup, standardization, etc.  It comes down to the
fact that they mostly work, and other things which don't get priority,
so they don't get much attention.  But honestly, I think one might be
oversimplifying the problem at hand without realizing it, if one simply
thinks switching to XSLT will make things easier.

I think one is also oversimplifying the options.  My dreamlist is that
SWORD produce a well-formed, valid, complete OSIS document for an
arbitrary KeyList that I pass it with FMT_OSIS set.  That basically
boils down to getting the *OSIS filters up to snuff and standardized.
The second item on the list is a readily extensible mechanism for
SWORD outputting HTML from that OSIS.  If that choice is providing an
XSL stylesheet with the library, a C++ SAX processor that a front-end
can readily extend, a DOM interface that can be easily customized is
immaterial to me.  I like all three of those, and can easily
understand and extend all of them.

I think any of those technologies would be an improvement over all
in-house C++ for the second half of any such processing.  If we are
using XML in Open Source Software, let's leverage the work of others
who have happily given us permission to use their libraries!

--Greg

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page



_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] XSLT vs. C++

Reply via email to