On Wed, 28 Nov 2001, Andreas Jung wrote: > I think the software "MG" from the book "Managing Gigabytes" is GPLed and > currently > released as mg-1.21. Walking through the TOC of the book, it seems to be a > very detailed > sources about text processing and gives very much informations about > different indexes types. > But I miss some explanations about current data structures like suffix > arrays or suffix tree > that have several advantages for text processing compared to B-Trees.
Suffix Trees/Tries take up a *lot* of space. But they are very fast, and useful for searching for substrings. The main gist of the stuff in 'Managing Gigabytes' is that it is possible to store an ascending list of integers in a compressed form, such that on average each integer requires only 4 bits to represent it. This is obviously much more compact than a straight list of 32 or 64 bit integers/longs (plus any overhead python adds to its inbuild list type). The other point is that you can read and decode the lists very quickly (you don't need to decompress the entire list first before reading it). Also consecutive numbers only take 1 bit of storage, this means that 'stopwords' that are normally omitted from indexes due to their very high frequency (and hence bloat of the index) can be stored very efficiently. One problem is that all of the research done in MG is based on much older hardware than is currently availible and they try to make certain optimisations, which nowadays don't save much time. -Matt -- Matt Hamilton [EMAIL PROTECTED] Netsight Internet Solutions, Ltd. Business Vision on the Internet http://www.netsight.co.uk +44 (0)117 9090901 Web Hosting | Web Design | Domain Names | Co-location | DB Integration _______________________________________________ Zope-Dev maillist - [EMAIL PROTECTED] http://lists.zope.org/mailman/listinfo/zope-dev ** No cross posts or HTML encoding! ** (Related lists - http://lists.zope.org/mailman/listinfo/zope-announce http://lists.zope.org/mailman/listinfo/zope )