Re: V11n was Re: [sword-devel] Jonah 1.17 / 2.1

DM Smith Thu, 23 Mar 2006 19:26:43 -0800

Troy,
   I think you miss my point.

Re-engineering the index to use Lucene is a good thing. It providesan industrial strength implementation of a index that is not bounded bya fixed array size. This could be used for GBF, OSIS, ThML, andPlainText. It can also be used for General Books, Dictionaries,Commentaries, Devotionals, Topicals, .... Chris has repeatedly said thatOSIS is the format of choice for the future. So my suggestion did notfocus on GBF, ThML or PlainText. This is not "nothing more than OSISdocuments". This could be used for all current module types. All thatwould be needed is to look up the offset and length using lucene insteadof a proprietary solution.

Re-engineering the module content to not throw away the materialoutside of verses is a good thing. Preserving original content is a goodthing. It does not matter whether it the content is GBF, OSIS, ThML orPlainText. Or whether it is a General Book, Dictionary, Commentary,Devotional, Topical, .... Chris has repeatedly said that OSIS is theformat of choice for the future. So my suggestion did not focus on GBF,ThML or PlainText. This is not "nothing more than OSIS documents". Allthat would need to be changed is the routine to fetch the content of averse. More could be done, since the architecture allow for it, but itis not necessary.

And no where did I say anything about xslt. I did mention xmlprocessors, but I was specifically thinking of an xml parser, which isrequired to die on xml that is not well formed. It is a good thing touse industry standard xml parsers to parse xml data!

Let me quote what I said before <quote>The big question is whetherit is a proper direction and one that we are all willing to embark.</quote>

Troy, obviously you have some objections, but I am not at all clearwhat they are. What am I missing?

Let's use sword-devel as a forum for discussing development ideas. Iagree that we should embrace the cooperative environment we share hereat CrossWire.

What should the architecture and implementation of a v11n solutionbe? If we discuss it here and come to agreement, documenting it well,perhaps C++ coders will step up to doing the actual implementation.

For my part I have provided a concrete architecture with a fairlyobvious implementation and have offered to create a prototype index anda prototype module that satisfies the suggestion. Isn't thatcooperation? (BTW, I was going to do it in Perl!)


In His Service,
   DM

Troy A. Griffitts wrote:

I realize that Bibletime, and obviously JSword, don't see much benefitin the SWORD engine, as it stands now.
Basically, what you are both suggesting is removing the entire conceptof a common engine-- which JSword doesn't use now anyway-- andBibletime seem to currently have a 'work in spite of the limitationsof' mentality.
I obviously don't condone such a move. It will remove the synergybetween all of our efforts, providing nothing more than OSIS documentsand XSLT, in common. These 2 things are not a bad thing, butcurrently we have so much more to offer than just these.
I would encourage teams to better work together and consider thecontribution they might make to all projects by augmenting the engine,and thus embracing the cooperative environment we share here atCrossWire.
It is obviously your choice to proceed how you feel the Lord isdirecting you, but I would hope you embrace the benefits we've gleanedfor 13 years of having a common framework and codebase.
    -Troy.


DM Smith wrote:
Martin Gruner wrote:
DM,
your proposal is excellent. Working with OSIS files directly issomething Joachim and I have talked about already, seems to be agood way to go.
Joe and I have talked about "OSIS direct" too and we think that it isthe next big architectural addition for JSword. So what I outlined iswhat we have been figuring out. It just happens that "OSIS direct"gives us v11n for free.
I know that v11n is "next" for Sword, too. I just want to make surethat JSword can handle whatever Sword decides for v11n, but if we canlead that's great too.
So, I am planning to get started after I finish the KJV work.
For the mapping, there would have to be some kind of object that isable to "translate" OsisIDs from one v11n scheme to another. Thiscould probably be done by using an "absolute" (theoretical,nonexistent) v11n scheme and mapping all others to this one. Withyour system, this would not have to care about the order of thebooks. Might be done with a (c)lucene index too. I'll take this partif you do the module access. =)
I haven't written in C in ages. So maybe someone can port the Javacode after I write it.
Are clucene and lucene (and lucene4c etc.) indexes identical, andportable?
Troy and I did some experiments with this using clucene and lucenefor 1.4.3. The indexes were not identical from a byte comparison.However, they were identical from a practical perspective. Theyworked just fine for both giving the same results for a set ofqueries. There is also a python and perl ports and perhaps others.
They are also are portable to any OS.
I also tested compatibility with the upcoming lucene 2.0 (currentlycalled 1.9.1) Lucene 1.9.1 can read indexes created by 1.4.3 withoutany problem, but 1.4.3 can't use indexes built by 1.9.1.
Could they be distributed and used by different frontends in parallel?
I'm not sure what you are asking. We can zip them up and move them todifferent machines. So they can be distributed.Two applications, same or different, can use the same index at thesame time. When the index is being modified, it is locked for anyother threads or processes.
You are aware of the fact that this would mean a complete paradigmshift for the Sword API?
I realize that it is very different. I also think it is a bitsimpler, too. From a client perspective, I don't think it is that biga shift. Within JSword, we code to interfaces and I don't think thebasic client ones will have to change at all.
The big question is whether it is a proper direction and one that weare all willing to embark.
If so we will have to have a new "module" type for the conf and itwill need to be version specific to lucene (e.g. MinimumLucene=1.4.3)
Once I get the KJV (nearly) done, I'll build an index for it as Ihave described. (but not with the extra contexts at first). Then wecan play with the index to see if there are any gotchas that we didnot anticipate.
I think I should be pretty near done with the KJV in a couple of weeks.
mg

Am Donnerstag, 23. März 2006 19:11 schrieb DM Smith:
[EMAIL PROTECTED] wrote:
Hi,

I also have several issues with osis2mod, and I was getting ready to
post.  The fact is that there are several versification schemes for
both Old and New Testaments.  I was having a similar problem with
re-versification in Tischendorf's Greek New Testament.  It has John
1:52, because an earlier verse is sub-divided.  But it also has 3John
15 and Rev 12:18, which agrees with UBS 4.
How can we get osis2mod to recognize true variations inversification,
and not "standardize" everything?
A SWORD module consists of text (possibly compressed) and an indexinto
that text. (Compressed modules will have additional tables marking the
start and end of the compression unit. But I am ignoring them in the
discussion below.)
In a nutshell, the code needs to be changed both that which createsthe
index and that which reads it.

Here is an overview of how it all hangs together. This may be a bit
imprecise because the JSword implementation, which I work on and am
familiar, may be slightly different from the actual SWORD API
implementation.
The index is a big fixed size array with each entry giving thestart and
length of each verse. There are slots for "introductions" to chapters
and books, e.g. Gen.0 would give the intro to Genesis and Gen.1.0would
give an introduction to Genesis Chapter 1.
Lookup happens in this fashion, the verse reference is firstnormalized(e.g. Matthew 1:5 might become Matt.1.5) And then this isre-normalizedinto 40.1.5. Then that normalization is converted into an indexinto the
fixed size array via a lookup table.

In the same fashion, the index is created. As the input is parsed, the
verse body is substringed and titles which are immediately before the
verse are marked as pre-verse and prepended to the verse. The verse
reference is converted into the array index. The verse is writtento the
output file and the start of that verse in the output file is recorded
in the index along with its length.

You will note that the verses are laid down in the output file in the
order that they are in the input file. If a verse exists more thanoncein the input, I think both get written to the output file, but thelastone over-writes the first in the index. If a verse pertains to morethanone KJV verse (e.g. <verse osisID="Gen.1.1 Gen.1.2"> text ofGenesis 1.1
and Genesis 1.2</verse>) then this is recorded in two index slots that
point to the same place in the output file. It is possible to feed a
correction to a module of just the changed verses. This will then be
appended to the output file and the index will be updated toreflect the
new material. The old material still remains.
When a verse reference is outside of the KJV v11n, it is recognizedas aproblem. Now there are only so many ways that the program canhandle it.
It could reject it. Or in the case of JSword, if the "book" and
"chapter" are in the KJV v11n, then it figures out which verse isreallymeant by adding it to start of the chapter. So Matt 1:27 wouldsilently
become Matt 2:2. Later when Matt 2:2 is seen, it would overwrite the
earlier entry in the index and Matt 1:27 would be lost. There may be
other strategies. But in every case it will not produce the desired
results.
Here is how I would suggest implementing a solution to thisproblem: use
OSIS documents and use lucene with osisIDs as the keys.

I have found that lucene is very fast. Input references would be
normalized to osisIDs and these be used for lookup. Rather thanstoring
the document in this index, the original would be left on disk as is
(perhaps compressed by verse, chapter or book as we do today). Theindexwould store start offset and end offset for each and every osisIDin thedocument. The start offset would be to the beginning of the elementand
the end offset would be to the end of the element. In the case of
milestoned elements, it would be from the start of the sID element to
the end of the corresponding eID element. It could also handlemultiple
documents by storing the document names as well.

Handling a "passage", say Gen 50:2 - Ex 2 would become an osisRef of
Gen.50.1-Exod.2. This in turn would indicate the start and end of the
fragment in the document as the start offset of Gen.50.1 and the end
offset of Exod.2.

This solution allows:
    for books of the bible to be in any order as required for a
particular work.
    for there to be any number of chapters in a book,
    for there to be any number of verses in a chapter
    for there to be prefaces, introductions, titles, colophons,
appendices  and any other elements allowed by OSIS.
for the apocrypha to be before or after the NT or in a separatefile.
    for each book or a set of books to be in separate files (in fact,
one could go to the absurd level of doing it by paragraph).
for any other book (e.g. dictionary, Koran, ...) with a welldefined
hierarchical system of reference to be index or stored.
for the OSIS documents to be used for any other purpose by anyothersystem that can handle OSIS docs (ignoring compression andencryption;)
       (Maybe we don't want this last one;)

I would also advocate storing two other contexts: one for a minimal
well-formed xml fragment and one for a minimum display context (which
would also be a well-formed xml fragment) The reason for these is that
OSIS does not require that a verse, chapter or any other division be
well formed. It only requires that the divs that are children of the
osisText element be well formed.
Well-formedness is a requirement for using xml processors (whichJSword
uses). So having a minimal xml context will solve that.

The display context is needed to provide enough information to render
the verse correctly. Two examples: First, in poetry (e.g. a Psalm), a
verse may be wholly contained in a line of a "poem" and thus be well
formed, but unless it is seen as part of the whole, it cannot be
correctly rendered. Second, consider the word's of Jesus (always agood
idea:). It may be that a much earlier verse records that the selected
verse are the words of Jesus and a much later verse records that ithisspeech ends. Looking at the verse in isolation, it is impossible toknow
that the verse contains the Jesus' words. So in trying to apply
red-letter text to his words would fail when looking at the versealone.
The trick would be deciding what constitutes a display context. It
should at least encompass the larger of the paragraphs, quotes,speeches
or line groups in which the verse appears/intersects, if any.

The other advantage to using Lucene is that the indexes can be changed
to add more information at a later time and existing processeswould notneed to be changed unless they were to take advantage of theadditions.A given application, say BibleTime, could augment the index withfurtherinformation (e.g. notes, internal processing info, ...) andBibleDesktop
could use that index without needing to handle that additional info.

Of course, the above does not solve the mapping of one v11n scheme to
another.
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page


_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: V11n was Re: [sword-devel] Jonah 1.17 / 2.1

Reply via email to