On Wed, Jan 6, 2010 at 11:10 AM, Nic Pottier <[email protected]> wrote: > On Wed, Jan 6, 2010 at 10:48 AM, Chris Anderson <[email protected]> wrote: >> The only catch is that you'll end up with a large index file in the >> long run. Lucene's indexes should be more compact on disk. Lucene also >> has more stemming options and will generally be smarter than your >> tokenizer. >> >> That said, if it works, it works. > > Thanks Chris. I do have a decent amount of experience with Lucene as > well, so I realize that is a great product, I just didn't want to add > another dependency, especially considering that CouchDB is still > changing quite a bit under the hood. > > Any way to get an insight as to how big the index is? I can see how > big my database is (78M with ~11k docs) but I'd be curious to know how > big that view is stored in memory.
The view is stored on disk. Look in the CouchDB data directory /usr/local/var/lib/couchdb for the view directory. > > One question I have is that it seems like it is rather inefficient to > store each word/id pair individually. Would there be any value to > adding a reduce step that groups them so that the view would be > word->[id array] instead? I will admit the reduce() step is one I am > still grabbling with a bit. > Our reduce is not key-bounded, so [id array] would end up being the list of unique ids in the entire database for full-reduce. The storage inefficiency you describe is likely what would force you from a pure Couch to a Lucene FTI solution first, as your data begins to scale. Chris -- Chris Anderson http://jchrisa.net http://couch.io
