Dear Distinguished Colleagues:
I need to add full-text search and somewhat free form queries to my
application. Our data is made up of "items" that are stored in a single
column family, and we have a bunch of secondary indices for look ups.
An item has header fields and data fields, and the structure of the
items CF is a super column family with row-key being item's natural ID,
super column for header, super column for data.
Our application is made up of a several redundant/load balanced servers
all pointing at a Cassandra cluster. Our servers run embedded Jetty.
I need to be able to find items by a combination of field values.
Currently I have an index for items by field value which works
reasonably well. I could also add support for data types and index
items by fields of appropriate types, so we can do range queries on
items.
Ultimately, though, what we want is full text search with suggestions
and human language sensitivity. We want to search by date ranges, by
field values, etc. I did some homework on this topic, and here is what
I see as options:
1) Use an SQL database as a helper. This is rather clunky, not sure
what it gets us since just about anything that can be done in SQL can
be done in Cassandra with proper structures. Then the problem here also
is where am I going to get an open source database that can handle the
workload ? Probably nowhere, nor do I get natural language support.
2) Each of our servers can index data using Lucene, but again we have
to come up with a clunky mechanism where either one of the servers does
the indexing and results are replicated, or each server does its own
indexing.
3) We can use Solr as is, perhaps with some small modifications it can
run within our server JVM -- since we already run embedded Jetty. I
like this idea, actually, but I know that Solr indexing doesn't take
advantage of Cassandra.
4) Datastax Enterprise with search, presumably, supports Solr indexing
of existing column families -- but for the life of me I couldn't figure
out how exactly it does that. The Wikipedia example shows that Solr can
create column families based on Solr schemas that I can then query
using Cassandra itself (which is great) and supposedly I can modify
those column families directly and Solr will reindex them (which is
even better), but I am not sure how that fits into our server design.
The other concern is locking in to a commercial product, something I am
very much worried about.
So, one possibility I can see is using Solr embedded within our own
server solution but storing its indexes in the file system outside of
Cassandra. This is not optimal, and maybe over time i can add my own
support for storing Solr index in Cassandra w/o relying on the Datastax
solution.
In any case, what are your thoughts and experiences ?
Regards,
Oleg