On Wed, Oct 12, 2011 at 2:18 PM, David Haslam <dfh...@googlemail.com> wrote: > Hi Troy, > > Yes - you're probably right about lack of a readily available tool for > direct conversion. > > Had I been tackling the task, I might have considered these steps: > > 1. Open each HTML file using MS Word, save each file as RTF. > 2. Open each RTF file using WordPad, save again as RTF (smaller and simpler > file structure). > 3. Create & run a script to process the RTF tags for italics attribute and > for red font colour. > 4. Open the processed RTF files using WordPad, save as Unicode text > (encoded as UTF-16 LE). > 5. Use a suitable editor to open the Unicode text files and change encoding > to UTF-8 (without BOM).
This seems incredibly more complicated than it needs to be and probably a terrible idea to filter HTML through MS Word. We talk about format-shifting and information loss as a result frequently. Every programming language a person is likely to know has a library for directly parsing HTML in some fashion. If you have any knowledge of script and coding it is probably a much better idea to leverage one of those and make a direct step from HTML to OSIS. I have done this at least twice now and with only a small amount of work you can adapt a script that will process any source text from a given format source. With Wycliffe we have two source formats which are proprietary SGML formats akin to HTML. We wrote parsing scripts using well established SGML and XML formatting tools and are able to leverage this for automated processing of around 800 different source texts. Moreover most scripting languages have a simple mechanism that will do the encoding shifting as well. A single line in the script is sufficient in Python to convert from any given source encoding into UTF-8. Assume that the variable 'text' contains the source in encoding 'enc'. Just execute text.decode(enc).encode('utf-8') and you're done. The SWORD library has similar functionality in SWBuf, fairly sure Perl has similar abilities. All in all, you're much better to create a script to take straight out of the source markup (HTML in this case) and into OSIS. Yes, you'd need to create a new script for each source, as each one will utilize different HTML constructs, but a single script could be used to - for instance - lift all the translations on Biblegateway into a person's local repository. A single script could run through his website and scrape it and dump it into an OSIS text with little effort. The markup format is simple and readily handled by many HTML loading/parsing libraries. --Greg > > After step 5 you'd have something similar to where you began converting > plain text to OSIS, but with some ingenuity at step 3, you'd also have some > elementary markup for italics and red letters that survives the complete > loss of formating attributes at step 4. > > During my Go Bible activities, I've used this approach more times than I can > recall. > > /The steepest part of the learning curve is getting used to the format of > RTF files when viewed by an ordinary text editor/. > > After step 5, it's often simpler to do the next conversion to USFM, and then > use usfm2osis.pl > > Best regards, > David > > > > > -- > View this message in context: > http://sword-dev.350566.n4.nabble.com/EMTV-text-source-URL-is-now-unrelated-tp3871411p3899264.html > Sent from the SWORD Dev mailing list archive at Nabble.com. > > _______________________________________________ > sword-devel mailing list: sword-devel@crosswire.org > http://www.crosswire.org/mailman/listinfo/sword-devel > Instructions to unsubscribe/change your settings at above page > _______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page