Hello Scott, I was hoping that you would read my message, many thanks for your reply!
>UPDATE and DELETE need to have the previous document text, because the >docids are embedded in the index, and there is no docid->term index >(or, put another way, the previous document text _is_ the docid->term >index). This is very understandable given the present design. >Keeping track of that information would probably double the >size of the index. With your estimate, the SQLite full text index (without document storage) would still take up only 50% of the documents' size. In my opinion, this is still a very good ratio, even if some specialized full text search engines apparently get away with less than 30%. I think you have done an enourmous job on FTS2! I am optimistic that the proper implementation will use even less than 50%: My modifications are completely rudimentary and not at all optimized - the column to store the document text still exists. The only difference is that it is not used - it stores a null value which could be saved. In fact, the entire FTS table (the one without the suffixes) would not be needed and cut down storage space. >A thing I've considered doing is to keep deletions >as a special index to the side, Would this open the door to "insert only, but no-modify and no-delete" indexes? I am sure users would like pay this cost for the benefit of even smaller FTS indexes! >which would allow older data to be >deleted during segment merges. Unfortunately, I suspect that this >would slow things down by introducing another bit of data which needs >to be considered during merges. I found that _not_ adding the original text turned out to be a great time saver. This makes sense if we know that the original text is about 4 times the size of the index. Storing lots of text by itself is already quite time consuming even without creating a FTS index. So I do not expect really bad slow downs by adding a docid->term index. >Of course, there's no way the current system could generate snippets >without the original text, because doclists don't record the set of >adjacent terms. That information could be recorded, but it's doubtful >that doing so would be an improvement on simply storing the original >text in the first place. The current system _does_ have everything >needed to generate the offsets to hits even without the original text, >so the client application could generate snippets, though the code is >not currently in place to expose this information. Snippets are of course nice to have out of the box as it is right now. But even without storing the original text, snippets could be created by 1. supplying the text through other means (additional parameter or callback function), so that not FTS but the application would read it from a disk file or decompress it from a database field. 2. constructing token-only snippets from the document tokens and offsets. This would of course exclude all non-word characters, but would still return legible information. >Being able to have an index without storing the original data was a >weak goal when fts1 was being developed, but every time we visitted >it, we found that the negatives of that approach were substantial >enough to discourage us for a time. [The "we" in that sentence means >"me and the various people I run wacky ideas past."] I'm keeping an >eye out for interesting implementation strategies and the time to >explore them, though. Maybe my arguments could influence the opinion of "we"? I would love to see FTS without text storage, especially since I just lost a project to another FTS product because duplicating data was unfortunately "out of disk space". All the best and keep up your good work, Ralf ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------