On Wed, Jan 6, 2010 at 10:48 AM, Chris Anderson <[email protected]> wrote: > The only catch is that you'll end up with a large index file in the > long run. Lucene's indexes should be more compact on disk. Lucene also > has more stemming options and will generally be smarter than your > tokenizer. > > That said, if it works, it works.
Thanks Chris. I do have a decent amount of experience with Lucene as well, so I realize that is a great product, I just didn't want to add another dependency, especially considering that CouchDB is still changing quite a bit under the hood. Any way to get an insight as to how big the index is? I can see how big my database is (78M with ~11k docs) but I'd be curious to know how big that view is stored in memory. One question I have is that it seems like it is rather inefficient to store each word/id pair individually. Would there be any value to adding a reduce step that groups them so that the view would be word->[id array] instead? I will admit the reduce() step is one I am still grabbling with a bit. -Nic
