[jira] Commented: (SOLR-1301) Solr + Hadoop

Ted Dunning (JIRA) Tue, 02 Feb 2010 21:47:44 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828961#action_12828961
 ]


Ted Dunning commented on SOLR-1301:
-----------------------------------

{quote}
Based on these observation, I have few questions. (I am a beginner to the 
Hadoop & Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better 
(performance I have not verified yet ). Can anyone tell me, whats the actual 
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really 
need 2 separate issues?
{quote}

In the katta community, the recommended practice started with SOLR-1045 (what I 
call map-side indexing) behavior, but I think that the consensus now is that 
SOLR-1301 behavior (what I call reduce side indexing) is much, much better.  
This is not necessarily the obvious result given your observations.  There are 
some operational differences between katta and SOLR that might make the 
conclusions different, but what I have observed is the following:

a) index merging is a really bad idea that seems very attractive to begin with 
because it is actually pretty expensive and doesn't solve the real problems of 
bad document distribution across shards.  It is much better to simply have lots 
of shards per machine (aka micro-sharding) and use one reducer per shard.  For 
large indexes, this gives entirely acceptable performance.  On a pretty small 
cluster, we can index 50-100 million large documents in multiple ways in 2-3 
hours.  Index merging gives you no benefit compared to reduce side indexing and 
just increases code complexity.

b) map-side indexing leaves you with indexes that are heavily skewed by being 
composed of of documents from a single input split.  At retrieval time, this 
means that different shards have very different term frequency profiles and 
very different numbers of relevant documents.  This makes lots of statistics 
very difficult including term frequency computation, term weighting and 
determining the number of documents to retrieve.  Map-side merge virtually 
guarantees that you have to do two cluster queries, one to gather term 
frequency statistics and another to do the actual query.  With reduce side 
indexing, you can provide strong probabilistic bounds on how different the 
statistics in each shard can be so you can use local term statistics and you 
can depend on the score distribution being this same which radically decreases 
the number of documents you need to retrieve from each shard.

c) reduce-side indexing improves the balance of computation during retrieval.  
If (as is the rule) some document subset is hotter than other document subset 
due, say to data-source boosting or recency boosting, you will have very bad 
cluster utilization with skewed shards from map-side indexing while all shards 
will cost about the same for any query leading to good cluster utilization and 
faster queries with reduce-side indexing.

d) with reduce-side indexing has properties that can be mathematically stated 
and proved.  Map-side indexing only has comparable properties if you make 
unrealistic assumptions about your original data.

e) micro-sharding allows very simple and very effective use of multiple cores 
on multiple machines in a search cluster.  This can be very difficult to do 
with large shards or a single index.

Now, as you say, these advantages may evaporate if you are looking to produce a 
single output index.  That seems, however, to contradict the whole point of 
scaling.   If you need to scale indexing, presumably you also need to scale 
search speed and throughput.  As such you probably want to have many shards 
rather than few.  Conversely, if you can stand to search a single index, then 
you probably can stand to index on a single machine. 

Another thing to think about is the fact SOLR doesn't yet do micro-sharding or 
clustering very well and, in particular, doesn't handle multiple shards per 
core.  That will be changing before long, however, and it is very dangerous to 
design for the past rather than the future.

In case, you didn't notice, I strongly suggest you stick with reduce-side 
indexing.

> Solr + Hadoop
> -------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.5
>
>         Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1301) Solr + Hadoop

Reply via email to