Greg has nailed it.

Practically I try and work out first is the file follows any kind of pattern or is just a pile of junk. Too often latter is the case and life has become too short to bother.

If there is a pattern then the pattern maybe expressed in CSS, in html tags in combinations. And some are maybe only in the actual text.

My approach has always been to recognise as many as I can find and then nuke the rest. And then use any technology I know of, regex, xsl whatever to.transform each bit into something useful in OSIS.

Usually this is an iterative process with some patterns only emerging as I go along. And others not as clear as thought originally.

Peter

Sent from my mobile. Please forgive shortness, typos and weird autocorrects.


-------- Original Message --------
Subject: Re: [sword-devel] Tool for convertion html to osis
From: Greg Hellings
To: SWORD Developers' Collaboration Forum
CC:


On its surface, this is a very straightforward process.

1. Convert the HTML (which is a specific set of defined tags using the SGML grammar) into XML (not specifically targeting XHTML, as that's a slightly different grammar, but all HTML in places where it violates XML rules can be rendered into XML-compatible forms as long as it is well-formed, since XML is just a strict subset of SGML that requires certain things that SGML leaves as optional).

There might be other tools to do this specifically, but you can get by with the command line tool `osx` from the Open Jade[0] framework. If you use Fedora this is available from the "opensp" package. I presume other Linux distributions have it similarly packaged.

2. Convert the XML version of the HTML into OSIS using an XSLT.

Although the technical outline of this is relatively straightforward, that doesn't mean the actual implementation is. Step 1 is pretty simple as long as you start from a well-formed HTML document. Step 2 sounds deceptively simple. If the HTML embeds CSS or, worse yet, references an external CSS document, then you might need to consider that. If there is active _javascript_ in the document, then you'll need to figure out if that does anything important to the text that needs to be preserved.

Additionally, HTML is a presentation format, despite some peoples' efforts to push it away from that. They've pretty much failed at that endeavor. So you'll have to figure out what the presentation markup means and convert that into OSIS. As an example, a superscript number might always be a verse number. But it might not. Encountering "<sup>1</sup>" might be easily translated to a meaning in your OSIS document, but it also might not, because it might be used by both the verses and the footnotes. Of course, those might be delimited by `<sup class="verse">1</sup>` and `<sup class="footnote">1</sup>` but it's equally possible that the difference is `<sup style="color: green">1</sup>` and `<sup style="color: blue">1</sup>` and now what does THAT mean? Of course, they might have gone with `<span style="vertical-align: super; font-size: 50%; color: blue; cursor: pointer" _onclick_="show_box();">1</span>` and now you've got to parse the value of `show_box` defined in _javascript_ somewhere to figure out what's been done and what type of character this is.

So the simplicity of #2 really boils down to the nature of the HTML you're dealing with, and if it is exceedingly complex in its own right, how much of its own information you need to preserve in the OSIS that you're getting out the other end. And without any visibility into the file, none of the rest of us can begin to guess at the complexity of that process. But it CAN be automated. Like John, I've invested a lot of time back in the day on converting Logos XML to OSIS, and I'm happy to say these things are possible (just not always easy).

There are a number of people on this list who are and could be qualified to assist you if there was a lot more information to fill in all the details of what I've just described above. However, whether you can engage us will depend on the nature of the text you have, the way you've been given it, and any distribution requirements and rights that it's held under.

--Greg


On Fri, Feb 1, 2019 at 10:27 AM Cyrille <lafricai...@gmail.com> wrote:
Hello,
All is in the title, someone have a Linux tool to convert html files to
osis?
In this case it is for the KD module. I download the html source files
but I want not to work  a lot on it. First I will work on bible issues
and not commentary. But if someone have a tool to do quickly the job...

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to