Wow, Steven, you really did your homework well! Major A+ in my book. :) Otis ---- Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/
----- Original Message ---- > From: Steven Noels <[email protected]> > To: hbase-user <[email protected]> > Sent: Thu, June 3, 2010 3:33:04 AM > Subject: Re: elastic search or other Lucene for HBase? > > On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson < > ymailto="mailto:[email protected]" > href="mailto:[email protected]">[email protected]>wrote: > > Hi all, > > Is anyone using elastic search as an indexing layer to > HBase content? > It looks to have a really nice API, and was thinking of > setting up an > EC2 test where I maintain an ES index storing only the Key > to HBase > rows. So ES provides all search returning Key lists and > all single > record Get being served from HBase. > > Or is > there a preferred distributed Lucene approach for HBase from the > few > that have been popping up? I have not had a chance to really dig > > into the options but I know there has been a lot of chatter on > this. > For Lily - www.lilycms.org, we opted for SOLR. Here's > some rationale behind that (copy-pasted from our draft Lily > website): Selecting a search solution: SOLR For search, the choice > for Lucene as core technology was pretty much a given. In Daisy, our previous > CMS, we used Lucene only for full-text search and performed structural > searches on the SQL database. We merged the results from those two different > search technologies on the fly, supporting mixed structural and full-text > queries. However, this merging, combined with other high-level features of > Daisy, was not designed to handle very large data sets. For Lily, we decided > that a better approach would be to perform all searching using one > technology, Lucene. A downside to Lucene is that index updates are only > visible with some delay to searchers, though work is ongoing to improve this. > At its heart it is a text-search library, though with its fielded documents > and the trie-range queries, it handles more data-oriented queries quite > well. Lucene in itself is a library, not a standalone application, nor a > scalable search solution. But all this can be built on top. The best known > standalone search server on top of Lucene is SOLR, which we decided to use in > Lily. But before we made that choice, we considered a lot of the > available options: - Katta < > href="http://katta.sourceforge.net/" target=_blank > >http://katta.sourceforge.net/>. Katta provides a powerful > scalable search model whereby each node is responsible for searching > on a number of shards, replicas of the shards are present on multiple > of the nodes. This provides scaling for both index size and number of > users, and gracefully handles node failures since the shards that > were on a failed node will be available online on some other nodes. > However, Katta is only a search solution, not an indexing solution, > and does not offer extra search features such as faceting. > - Hadoop contrib/index< > href="http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README" > > target=_blank > >http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>. > This is a MapReduce solution for building Lucene indexes. The nice > thing about it is that the MR framework manages spreading the index > building work over multiple nodes, reschedules failed jobs, and so > on. It can also be used to update existing indexes. The number of > index shards is determined by the number of reduce tasks. Hadoop > contrib/index is an ideal complement to Katta. The downside is that > it is inherently batch-oriented, which excludes profiting from the > ongoing Lucene near-real time (NRT) work. - The tools > from LinkedIn < > >http://sna-projects.com/sna/> (blog< > href="http://invertedindex.blogspot.com/" target=_blank > >http://invertedindex.blogspot.com/>). LinkedIn has made > available some cool Lucene-related projects like Bobo< > href="http://sna-projects.com/bobo/" target=_blank > >http://sna-projects.com/bobo/>, an optimized facet browser > that does not rely on cached bitsets, and Zoie< > href="http://sna-projects.com/zoie/" target=_blank > >http://sna-projects.com/zoie/>, a real-time index-search > solution (built in a different way than what is available in Lucene > 3). They are apparently integrating it all in Sensei< > href="http://sna-projects.com/sensei/" target=_blank > >http://sna-projects.com/sensei/>. It is interesting to study > the design of these projects. - ElasticSearch < > href="http://www.elasticsearch.com/" target=_blank > >http://www.elasticsearch.com/>. ElasticSearch (ES) is a very > new project, that appeared as a one-man project at about the same time > we made our choice for SOLR. One can easily launch a number of ES > nodes, they find themselves without configuration. Multiple indexes > can be created using a simple REST API. When creating an index, you > specify the number of shards and replicas you desire. It is designed > to work on cloud computing solutions like EC2, where the local disk > is only a temporary storage. There is a lot more to tell, but you can > read that on their website. Despite the name 'elastic', it does not > support indexes growing dynamically in the same way as tables can > grow in HBase: the number of shards is fixed when creating the index. > However, if you find yourself in need of more shards, you can create > a new index with more shards and re-index your content into that. The > number of shards is not related to the number of nodes, so you can plan > for growth by choosing e.g. 10 shards even if you have just one or > two nodes to start with. - Lucandra< > href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/" > > target=_blank > >http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>, > Lucehbase < > >http://github.com/thkoch2001/lucehbase> and Hbasene< > href="http://github.com/akkumar/hbasene" target=_blank > >http://github.com/akkumar/hbasene>. These projects work by > storing the inverted index on top of Cassandra respectively HBase. > The use of a database is quite different from Lucene's segment-based > approach. While it makes the storage of the inverted index scalable, > it does not necessarily make all of Lucene's functionality scalable, > such as sorting and faceting which depend on the field caches and > bitset-based filters. Moreover, for HBase, which we know best, the > storage is not as scalable as it may seem, since terms are stored as > rows and the postings lists (= the documents containing the term) as > columns. Usually the number of terms in a corpus is relatively > limited, while the number of documents can be huge, but columns in > HBase do not scale in the same way as rows. We think the scaling > (sharding, replication) needs to happen on the level of Lucene > instances itself, rather than just the storage below it. Still, it is > interesting to watch how these projects will evolve. - > Building our own. Another option was to just take Lucene itself and > build our own scalable search solution using it. In this case we > would have gone for a Katta/ElasticSearch-like approach to sharding > and replication, with a focus on the search features we are most > interested in (such as faceting). However, we decided that this would > take too much of our time. - Our choice, SOLR < > href="http://lucene.apache.org/solr/" target=_blank > >http://lucene.apache.org/solr/>. SOLR is a standalone > Lucene-based search server made publicly available in 2006, making it > the oldest of the solutions listed here. It makes a lot of Lucene > functionality easily available, adds a schema with field types, > faceting, different kinds of caches and cache warming, a > user-friendly safe query syntax, and more. SOLR supports replicas and > distributed searching over multiple SOLR instances, though you are > responsible for setting it up all yourself, it is very much a static > solution. Work on cloudifying SOLR< > href="http://svn.apache.org/repos/asf/lucene/solr/branches/cloud" > target=_blank > >http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is ongoing. > SOLR has lots of users, there is a book, there are companies > supporting it, it has a large team < > target=_blank >http://www.ohloh.net/p/solr>, and the Lucene > and SOLR projects recently merged. We don't expect Lily to be > deployed on EC2 infrastructure, but more in private cloud/datacenter settings > with customers. However, should ES have appeared sooner on our radar, there's > a good chance we would have looked at it more in-depth. SOLR cloud uses > ZooKeeper which we'll need between Lily clients and servers anyway, so that > was a nice fit with our architecture. I haven't compared the search > language/features between SOLR and ES myself though, it could be that you > need specific stuff from ES (like: the EC2 focus). Hope this > helps, Steven. -- Steven Noels > > href="http://outerthought.org/" target=_blank > >http://outerthought.org/ Outerthought > Open Source Java > & XML stevenn at outerthought.org > Makers of the Daisy CMS
