IIRC, Huffman encoding seems to produce an optimal compression. The
basic idea is to build a trie with the shortest paths through the trie
being the most frequent patterns. The algorithms that I saw did this
on input assuming a single byte character encoding such as ASCII or
Latin-1. It is readily adaptable to UTF-8, by considering bytes rather
than characters.

I don't think this is typically true. At least for text, LZW type compression is generally superior (at least in compression ratio, not necessarily in speed).

I am not aware of any available code to do this. It might exist. But
it probably would need to be written.

Is it worth the effort? I don't think so at this point and time. My
take on it is that there is enough to do that this gets pushed
further down my list of things to do (it is on my todo list). And
unless it makes sense in the SWORD world as a contribution, it would
only be an academic exercise for me (which I love doing).

I think that in the LCDBible world, it would make lots of sense.

A year or so ago, I defined a sourceforge project BibleDb that would be optimized for Bible decompression/decryption/search speed (not necessarily for compression ratio). The idea was a variable number of bits based on an analysis of word frequency. (6 or 10 or 14 bits). All tags would be external lengths/offsets, and not in the actual content in order to optimize searching.

As a group, all English Bibles have a fairly small number of words (about 16,000 ... give or take a thousand or so, depending on how you count capitalization, plurals, possessives, contractions, etc.), and the dictionary is very static. The ESV and WEB would have almost the exact same dictionary ... the KJV-1769/ASV would be only slightly different. A single dictionary would suffice for all English translations. (maybe a different dictionary for OT and NT?).

One intent was to have searches integrated in this (sort of like Lucene works?), and dictionaries / concordances would be feasible.

After some wrestling with it, I realized I don't have the time or math background or aptitude to have much of a chance of making it work. BibleDB is only in pre-alpha stage.
http://sourceforge.net/project/admin/?group_id=117234

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to