[jira] Commented: (SOLR-1301) Solr + Hadoop

shyjuThomas (JIRA) Tue, 02 Feb 2010 18:58:45 -0800

    [ 
https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828915#action_12828915
 ]


shyjuThomas commented on SOLR-1301:
-----------------------------------

I have a need to perform Solr indexing in MapReduce task, to achive 
parallelism. I have noticed 2 Jira issues related to that: SOLR-1045 & 
SOLR-1301. 

I have tried out the patches available with both the issues, and my observation 
is given below:
1. The SOLR-1301 patch, performs  input-record to key-value conversion in Map 
phase; Hadoop (key, value) to SolrInputDocument conversion and the actual 
indexing will happen in the Reduce phase.
Meanwhile, SOLR-1045 patch performs the record-to-Doc conversion and the actual 
indexing in the Map phase; User can make use of the Reducer to perform merging 
of multiple indices (if required). In another way we can configure the number 
of reducers as same as the number of Shards. 
2. The SOLR-1301 patch doesn't supports merging of the indices, while SOLR-1045 
patch supports.
3. As per SOLR-1301 patch, no big activity happens in the Map phase (only 
input-record to key-value conversion). Most of the heavy jobs (esp. the 
indexing) are happening in the Reduce phase. If we need the final output as a 
single index, we can use only one reducer, which means bottleneck at Reducer & 
almost the whole operation happens non-paralelly. 
                       But the case is different with SOLR-1045 patch. It 
achieves better parallelism when the number of map tasks is greater than the 
number of reduce tasks, which is usually the case.

Based on these observation, I have few questions. (I am a beginner to the 
Hadoop & Solr world. So, please forgive me if my questions are silly):
1. As per above observation, SOLR-1045 patch is functionally better 
(performance I have not verified yet ). Can anyone tell me, whats the actual 
advantage SOLR-1301 patch offers over SOLR-1045 patch?
2. If both the jira issues are trying to solve the same problem, do we really 
need 2 separate issues?

NOTE : I felt this Jira issue is more active than SOLR-1045. Thats why I posted 
my comment here.

> Solr + Hadoop
> -------------
>
>                 Key: SOLR-1301
>                 URL: https://issues.apache.org/jira/browse/SOLR-1301
>             Project: Solr
>          Issue Type: Improvement
>    Affects Versions: 1.4
>            Reporter: Andrzej Bialecki 
>             Fix For: 1.5
>
>         Attachments: commons-logging-1.0.4.jar, 
> commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, 
> log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, 
> SOLR-1301.patch, SolrRecordWriter.java
>
>
> This patch contains  a contrib module that provides distributed indexing 
> (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is 
> twofold:
> * provide an API that is familiar to Hadoop developers, i.e. that of 
> OutputFormat
> * avoid unnecessary export and (de)serialization of data maintained on HDFS. 
> SolrOutputFormat consumes data produced by reduce tasks directly, without 
> storing it in intermediate files. Furthermore, by using an 
> EmbeddedSolrServer, the indexing task is split into as many parts as there 
> are reducers, and the data to be indexed is not sent over the network.
> Design
> ----------
> Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, 
> which in turn uses SolrRecordWriter to write this data. SolrRecordWriter 
> instantiates an EmbeddedSolrServer, and it also instantiates an 
> implementation of SolrDocumentConverter, which is responsible for turning 
> Hadoop (key, value) into a SolrInputDocument. This data is then added to a 
> batch, which is periodically submitted to EmbeddedSolrServer. When reduce 
> task completes, and the OutputFormat is closed, SolrRecordWriter calls 
> commit() and optimize() on the EmbeddedSolrServer.
> The API provides facilities to specify an arbitrary existing solr.home 
> directory, from which the conf/ and lib/ files will be taken.
> This process results in the creation of as many partial Solr home directories 
> as there were reduce tasks. The output shards are placed in the output 
> directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories 
> can be used to run N shard servers. Additionally, users can specify the 
> number of reduce tasks, in particular 1 reduce task, in which case the output 
> will consist of a single shard.
> An example application is provided that processes large CSV files and uses 
> this API. It uses a custom CSV processing to avoid (de)serialization overhead.
> This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this 
> issue, you should put it in contrib/hadoop/lib.
> Note: the development of this patch was sponsored by an anonymous contributor 
> and approved for release under Apache License.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (SOLR-1301) Solr + Hadoop

Reply via email to