On Wed, Sep 2, 2009 at 10:44 PM, Zhenyu Zhong <zhongresea...@gmail.com>wrote:
> Dear all, > > I am very interested in Solr and would like to deploy Solr for distributed > indexing and searching. I hope you are the right Solr expert who can help > me > out. > However, I have concerns about the scalability and management overhead of > Solr. I am wondering if anyone could give me some guidance on Solr. > > Basically, I have the following questions, > For indexing > 1. How does Solr handle the distributed indexing? It seems Solr generates > index on a single box. What if the index is huge and can't sit on one box? > Solr leaves the distribution of index upto the user. So if you think your index will not fit in one box, you figure out a sharding strategy (such as hashing or round-robin) and index your collection into each shards. Solr supports distributed search so that your query can use all the shards to give you the results. > 2. Is it possible for Solr to generate index in HDFS? > > Never tried but it seems so. See Jason's response and the Jira issue he has mentioned. > For searching > 3. Solr provides Master/Slave framework. How does the Solr distribute the > search? Does Solr know which index/shard to deliver the query to? Or does > it > have to do a multicast query to all the nodes? > > For a full-text search it is hard to figure out the correct shards because matching document could be living anywhere (unless you shard in a very clever way and your data can be sharded in that way). Each shard is queried, the results are merged and returned as if you had queried a single Solr server. > For fault tolerance > 4. Does Solr handle the management overhead automatically? suppose master > goes down, how does Solr recover the master in order to get the latest > index > updates? Do we have to code ourselves to handle this? > It does not. You have to handle that yourself currently. Similar topics have been discussed on this list in the past and some workarounds have been suggested. I suggest you search the archives. > 5. Suppose master goes down immediately after the index updates, while the > updates haven't been replicated to the slaves, data loss seems to happen. > Does Solr have any mechanism to deal with that? > > No. If you want you can setup a backup master and index on both master and backup machines to achieve redundancy. However switching between the master and the backup would need to be done by you. > Performance of real-time index updating > 6. How is the performance of this realtime index updating? Suppose we are > updating a million records for a huge index with billions of records > frequently. Can Solr provides a reasonable performance and low latency on > that? (Probably it is related to Lucene library) > > How frequently? With careful sharding, you can distribute your write load. Depending on your data, you may also be able to split you indexes into a more frequently updated on and an older archive index. A lot of work is in progress in this area. Lucene 2.9 has support for near real time search with more improvements planned in the coming days. Solr 1.4 will not have support for these new Lucene features but with 1.5 things should be a lot better. -- Regards, Shalin Shekhar Mangar.