Re: [sword-devel] Improvements to osis2mod to handle XML comments and
correctly

DM Smith Mon, 05 Apr 2010 11:21:48 -0700

On 04/05/2010 01:44 PM, Weston Ruter wrote:

DM:
    But what we really need is not a parser but a tokenizer. I'm
    thinking about writing one (my degree work was in compiler
    writing). Basically, we repeat the same tokenization code in
    several places. It should be trivial to write a complete, accurate
    one.
I've also been wanting to work on a tokenizer. At Open Scriptures, thetext of a work is currently represented by two models<http://github.com/openscriptures/api/blob/master/models.py> (databasetables): Token<http://github.com/openscriptures/api/blob/master/models.py#L242> andStructure<http://github.com/openscriptures/api/blob/master/models.py#L315>.Tokens are the smallest divisible units of text, such as words,punctuation, and whitespace; and structures are the spans of tokensthat form logical units, such as verses, paragraphs, quotes, etc. Thestructures are standoff-markup for the tokens. With the underlyingdata stored in this way, it can then be serialized in whicheverhierarchy desired (book-section-paragraph, book-chapter-verse,all-milestoned, etc) or whichever data format is needed (OSIS, SWORDModule, XHTML, etc.)

This is a lot lower than what an xml tokenizer needs. This would be atokenizer for the text between tags. Having a single tokenizer that doesboth would be more efficient when both are wanted and slower when onlyxml tokens are needed.

I think a model could be constructed that could do both and allow one toask for the depth of tokenization that is needed.

There is a big complication with the parsing of text: it is languagedependent. For example, Thai has words but not word breaks. Basically,the task will require a Unicode and somewhat language aware word-breakalgorithm. The best I've seen is in ICU.

Lucene has a wonderful example in their Jira issues database of how todo tokenization. (1488, if I remember.)

So what I'm currently rumenating on is the process of importing theraw data into the Token and Structure models. I wrote an importer<http://github.com/openscriptures/api/blob/master/importers/Tischendorf-2.5.py>for the Tischendorf GNT data which does everything both tokenizing andparsing, but obviously there is going to be a lot of code in commonwith other importers that are written. So I too am thinking about howthese importers can be reduced to the bare minimum to handle theunique aspects of the raw data (i.e. normalize it), and then streamthe tokens back to a central importer that parses the input and storesit into the Token and Structure models. This central importer facilitycould be a web service.
I've love to collaborate with you on this. We could come up with acommon tokenizer that can be used by both SWORD and Open Scriptures.The importer web service could take tokens as input and as outputgenerate a SWORD module and also populate the Open Scriptures modelsat the same time.
Thoughts?


Sounds good to me, too.

In Him,
    DM


Weston

On Mon, Apr 5, 2010 at 10:24 AM, Daniel Owens <dhow...@pmbx.net<mailto:dhow...@pmbx.net>> wrote:


    Yes, I agree, and if there were a feedback mechanism for the
    module creator to let them know how to start fixing an OSIS file
    or conf file, it would save Chris (or whoever else approves
    modules) time on the basic stuff.

    Daniel


    On 4/5/2010 11:09 AM, DM Smith wrote:

        This is a great idea. Rather than emailing source to modules
        at crosswire dot org, one could upload it via a web service.
        We could have stages of validation (xmllint) and construction
        (osis2mod). Such a service could evaluate the quality of the
        submission.

        In Him,
           DM

        On 04/05/2010 12:01 PM, Weston Ruter wrote:

            Why not turn osis2mod into a web service? Then it wouldn't
            matter how it is implemented since it would be abstracted
            away by the web service interface. It could use the best
            XML libraries available today and written in the
            programming language of choice, both of which would make
            maintenance and the addition of new features much easier.

            Weston

On Mon, Apr 5, 2010 at 9:05 AM, DM Smith <dmsm...@crosswire.org<mailto:dmsm...@crosswire.org>> wrote:


    On 04/05/2010 09:03 AM, Dmitrijs Ledkovs wrote:

        On 5 April 2010 13:55, Manfred
        Bergmann<manfred.bergm...@me.com
        <mailto:manfred.bergm...@me.com>>  wrote:

            Hi DM.

            Am 05.04.2010 um 13:21 schrieb DM Smith:


                Regarding using a "real" parser, it is a good idea.
                But we don't want SWORD to be dependant on an external
                parser.

            What's the reason for that?
            I could understand if it would mean for the user to
            install certain libraries manually but when the sources
            can be integrated into the project and has the appropriate
            licence then why not?


            Manfred


        IMHO there is no harm in bringing in libxml or a much more
        lightweight
        parser like GMarkup. The build system just needs to be adjusted to
        link e.g. libxml for the osis2mod binary and not shared sword
        library.
        in can be even called a new tool osisxml2mod for example and
        make it
        be build optionally such that you can still have full sword dev
        environment without libxml.

        Tools for creating modules do not have be linked with sword or
        even
        live in sword taball / svn. Although it does help consistent
        distribution of tools.

    I don't remember all of Troy's reasoning when I argued for a true
    parser.

    >From what I recall:
    o To maintain freedom to re-license SWORD (e.g. for some other
    Bible society) we need to be able to keep 3-rd party library
    dependencies well managed. The license needs to be compatible with
    the GPL but cannot be GPL.

    o The parser that we have is minimal and simple, sacrificing
    accuracy and completeness for speed. Regarding accuracy, e.g. the
    parser allows for spaces around = in attribute declarations.
    Regarding completeness, e.g. it does not handle namespaces, cdata,
    dtds/schemas, .... Significantly, it does not require a
    well-formed document, allowing for fragments. Rather than an
    error, it continues when an xml parser is required to stop.

    o This parser has better error reporting in that it is based upon
    knowledge of the input. E.g. it reports the verse having the problem.

    o By SWORD having the parser, we are not dependent on finding an
    implementation for every platform (e.g. Windows).

    There may be other reasons. I'm willing to live with it.

    But what we really need is not a parser but a tokenizer. I'm
    thinking about writing one (my degree work was in compiler
    writing). Basically, we repeat the same tokenization code in
    several places. It should be trivial to write a complete, accurate
    one.

    In His Service,
       DM


    _______________________________________________
    sword-devel mailing list: sword-devel@crosswire.org
    <mailto:sword-devel@crosswire.org>
    http://www.crosswire.org/mailman/listinfo/sword-devel
    Instructions to unsubscribe/change your settings at above page

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] Improvements to osis2mod to handle XML comments and correctly

Reply via email to

Re: [sword-devel] Improvements to osis2mod to handle XML comments and
correctly