Re: elastic search or other Lucene for HBase?

Otis Gospodnetic Thu, 03 Jun 2010 07:59:16 -0700

Wow, Steven, you really did your homework well!  Major A+ in my book. :)

 Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/




----- Original Message ----
> From: Steven Noels <[email protected]>
> To: hbase-user <[email protected]>
> Sent: Thu, June 3, 2010 3:33:04 AM
> Subject: Re: elastic search or other Lucene for HBase?
> 
> On Sat, Mar 27, 2010 at 8:46 PM, Tim Robertson <
> ymailto="mailto:[email protected]"; 
> href="mailto:[email protected]";>[email protected]>wrote:

> 
> Hi all,
>
> Is anyone using elastic search as an indexing layer to 
> HBase content?
> It looks to have a really nice API, and was thinking of 
> setting up an
> EC2 test where I maintain an ES index storing only the Key 
> to HBase
> rows.  So ES provides all search returning Key lists and 
> all single
> record Get being served from HBase.
>
> Or is 
> there a preferred distributed Lucene approach for HBase from the
> few 
> that have been popping up?  I have not had a chance to really dig
> 
> into the options but I know there has been a lot of chatter on 
> this.
>


For Lily - www.lilycms.org, we opted for SOLR. Here's 
> some rationale behind
that (copy-pasted from our draft Lily 
> website):

Selecting a search solution: SOLR

For search, the choice 
> for Lucene as core technology was pretty much a
given. In Daisy, our previous 
> CMS, we used Lucene only for full-text search
and performed structural 
> searches on the SQL database. We merged the results
from those two different 
> search technologies on the fly, supporting mixed
structural and full-text 
> queries. However, this merging, combined with other
high-level features of 
> Daisy, was not designed to handle very large data
sets. For Lily, we decided 
> that a better approach would be to perform all
searching using one 
> technology, Lucene.

A downside to Lucene is that index updates are only 
> visible with some delay
to searchers, though work is ongoing to improve this. 
> At its heart it is a
text-search library, though with its fielded documents 
> and the trie-range
queries, it handles more data-oriented queries quite 
> well.

Lucene in itself is a library, not a standalone application, nor a 
> scalable
search solution. But all this can be built on top. The best known 
> standalone
search server on top of Lucene is SOLR, which we decided to use in 
> Lily.

But before we made that choice, we considered a lot of the 
> available
options:

   -

   Katta <
> href="http://katta.sourceforge.net/"; target=_blank 
> >http://katta.sourceforge.net/>. Katta provides a powerful 
> scalable
   search model whereby each node is responsible for searching 
> on a number of
   shards, replicas of the shards are present on multiple 
> of the nodes. This
   provides scaling for both index size and number of 
> users, and gracefully
   handles node failures since the shards that 
> were on a failed node will be
   available online on some other nodes. 
> However, Katta is only a search
   solution, not an indexing solution, 
> and does not offer extra search features
   such as faceting.
  
> -

   Hadoop contrib/index<
> href="http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README";
>  
> target=_blank 
> >http://svn.apache.org/repos/asf/hadoop/mapreduce/trunk/src/contrib/index/README>.
  
> This is a MapReduce solution for building Lucene indexes. The nice 
> thing
   about it is that the MR framework manages spreading the index 
> building work
   over multiple nodes, reschedules failed jobs, and so 
> on. It can also be used
   to update existing indexes. The number of 
> index shards is determined by the
   number of reduce tasks. Hadoop 
> contrib/index is an ideal complement to
   Katta. The downside is that 
> it is inherently batch-oriented, which excludes
   profiting from the 
> ongoing Lucene near-real time (NRT) work.
   -

   The tools 
> from LinkedIn <
> >http://sna-projects.com/sna/>
(blog<
> href="http://invertedindex.blogspot.com/"; target=_blank 
> >http://invertedindex.blogspot.com/>).
   LinkedIn has made 
> available some cool Lucene-related projects like
Bobo<
> href="http://sna-projects.com/bobo/"; target=_blank 
> >http://sna-projects.com/bobo/>,
   an optimized facet browser 
> that does not rely on cached bitsets,
and Zoie<
> href="http://sna-projects.com/zoie/"; target=_blank 
> >http://sna-projects.com/zoie/>,
   a real-time index-search 
> solution (built in a different way than what is
   available in Lucene 
> 3). They are apparently integrating it all in
Sensei<
> href="http://sna-projects.com/sensei/"; target=_blank 
> >http://sna-projects.com/sensei/>.
   It is interesting to study 
> the design of these projects.
   -

   ElasticSearch <
> href="http://www.elasticsearch.com/"; target=_blank 
> >http://www.elasticsearch.com/>. ElasticSearch (ES) is a
   very 
> new project, that appeared as a one-man project at about the same time
  
> we made our choice for SOLR. One can easily launch a number of ES 
> nodes,
   they find themselves without configuration. Multiple indexes 
> can be created
   using a simple REST API. When creating an index, you 
> specify the number of
   shards and replicas you desire. It is designed 
> to work on cloud computing
   solutions like EC2, where the local disk 
> is only a temporary storage. There
   is a lot more to tell, but you can 
> read that on their website. Despite the
   name 'elastic', it does not 
> support indexes growing dynamically in the same
   way as tables can 
> grow in HBase: the number of shards is fixed when creating
   the index. 
> However, if you find yourself in need of more shards, you can
   create 
> a new index with more shards and re-index your content into that. The
  
> number of shards is not related to the number of nodes, so you can plan 
> for
   growth by choosing e.g. 10 shards even if you have just one or 
> two nodes to
   start with.
   -

   Lucandra<
> href="http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/";
>  
> target=_blank 
> >http://blog.sematext.com/2010/02/09/lucandra-a-cassandra-based-lucene-backend/>,
  
> Lucehbase <
> >http://github.com/thkoch2001/lucehbase> and
Hbasene<
> href="http://github.com/akkumar/hbasene"; target=_blank 
> >http://github.com/akkumar/hbasene>.
   These projects work by 
> storing the inverted index on top of Cassandra
   respectively HBase. 
> The use of a database is quite different from Lucene's
   segment-based 
> approach. While it makes the storage of the inverted index
   scalable, 
> it does not necessarily make all of Lucene's functionality
   scalable, 
> such as sorting and faceting which depend on the field caches and
  
> bitset-based filters. Moreover, for HBase, which we know best, the 
> storage
   is not as scalable as it may seem, since terms are stored as 
> rows and the
   postings lists (= the documents containing the term) as 
> columns. Usually the
   number of terms in a corpus is relatively 
> limited, while the number of
   documents can be huge, but columns in 
> HBase do not scale in the same way as
   rows. We think the scaling 
> (sharding, replication) needs to happen on the
   level of Lucene 
> instances itself, rather than just the storage below it.
   Still, it is 
> interesting to watch how these projects will evolve.
   -

  
> Building our own. Another option was to just take Lucene itself and 
> build
   our own scalable search solution using it. In this case we 
> would have gone
   for a Katta/ElasticSearch-like approach to sharding 
> and replication, with a
   focus on the search features we are most 
> interested in (such as faceting).
   However, we decided that this would 
> take too much of our time.
   - Our choice, SOLR <
> href="http://lucene.apache.org/solr/"; target=_blank 
> >http://lucene.apache.org/solr/>. SOLR is a standalone
  
> Lucene-based search server made publicly available in 2006, making it 
> the
   oldest of the solutions listed here. It makes a lot of Lucene 
> functionality
   easily available, adds a schema with field types, 
> faceting, different kinds
   of caches and cache warming, a 
> user-friendly safe query syntax, and more.
   SOLR supports replicas and 
> distributed searching over multiple SOLR
   instances, though you are 
> responsible for setting it up all yourself, it is
   very much a static 
> solution. Work on cloudifying
SOLR<
> href="http://svn.apache.org/repos/asf/lucene/solr/branches/cloud"; 
> target=_blank 
> >http://svn.apache.org/repos/asf/lucene/solr/branches/cloud>is
ongoing. 
> SOLR has lots of users, there is a book, there are companies
  
> supporting it, it has a large team <
> target=_blank >http://www.ohloh.net/p/solr>, and the
   Lucene 
> and SOLR projects recently merged.


We don't expect Lily to be 
> deployed on EC2 infrastructure, but more in
private cloud/datacenter settings 
> with customers. However, should ES have
appeared sooner on our radar, there's 
> a good chance we would have looked at
it more in-depth. SOLR cloud uses 
> ZooKeeper which we'll need between Lily
clients and servers anyway, so that 
> was a nice fit with our architecture.

I haven't compared the search 
> language/features between SOLR and ES myself
though, it could be that you 
> need specific stuff from ES (like: the EC2
focus).

Hope this 
> helps,

Steven.
-- 
Steven Noels          
>                   
> href="http://outerthought.org/"; target=_blank 
> >http://outerthought.org/
Outerthought          
>                   Open Source Java 
> & XML
stevenn at outerthought.org          
>    Makers of the Daisy CMS

Re: elastic search or other Lucene for HBase?

Reply via email to