This is a refinement of FASD that proposes a simpler form of inverted index. It is compact and fast, and is effective for Asian languages. I will only briefly sketch out the idea here, saving the many details for further discussion.
Conceptually, the indexing scheme is as follows: 1) A document (or abstract, description, title ...) to be indexed is converted to 32-bit Unicode. 2) All n-grams of length 1 through N are extracted from the Unicode text. I have found N=4 to be a good choice for English, and N=3 to be a good choice for Japanese. 3) A bit vector is formed by taking each n-gram to represent an integer, and setting the corresponding bit on. 4) The bit vector is compressed by some scheme which is compact and easily uncompressed. This bit vector is the document surrogate used for retrieval. Conceptually, the retrieval scheme is as follows: 1) A full-text query is converted to Unicode. 2) All n-grams of length 1 through N are extracted from the Unicode text (N need not be the same as for indexing). 3) A bit vector is formed of all n-grams in the query, just as for the document surrogate. 4) A compressed form of the bit vector is used as a FASD retrieval key. The similarity function is the number of bits that are on in the bit-wise AND of the query vector and a document vector. Many further refinements are possible, such as stop-word pruning, searches for misspelled words, multiple-vector Boolean queries, and weighting of search words. I think that as much as possible these features should be provided by the client, rather than the many servers. I prefer n-gram approaches over word-based approachs because many languages do not put spaces between words, making parsing of documents into words very difficult. Also, searches for phrases are as easy as searches for words, with no need to store word positions in the index. The main disadvantage of the n-gram approach is that a search will retrieve a false positive whenever all of the n-grams within a query word or phrase appear in a document, but not the word or phrase itself. In my experience this is not a major problem, and if desired retrieved documents can be scanned by the client to eliminate false positives. I have found the use of compressed bit vectors as document surrogates to be a fast and compact scheme, requiring on average two bits in the compressed surrogate for each unique word in a document. I have no references at hand to back up my opinions, but I do have some twenty years of experience building full-text search systems, and I do my best to keep up with the relevant academic literature. _______________________________________________ Tech mailing list [EMAIL PROTECTED] http://hawk.freenetproject.org/cgi-bin/mailman/listinfo/tech
