[ https://issues.apache.org/jira/browse/SOLR-1301?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12828961#action_12828961 ]
Ted Dunning commented on SOLR-1301: ----------------------------------- {quote} Based on these observation, I have few questions. (I am a beginner to the Hadoop & Solr world. So, please forgive me if my questions are silly): 1. As per above observation, SOLR-1045 patch is functionally better (performance I have not verified yet ). Can anyone tell me, whats the actual advantage SOLR-1301 patch offers over SOLR-1045 patch? 2. If both the jira issues are trying to solve the same problem, do we really need 2 separate issues? {quote} In the katta community, the recommended practice started with SOLR-1045 (what I call map-side indexing) behavior, but I think that the consensus now is that SOLR-1301 behavior (what I call reduce side indexing) is much, much better. This is not necessarily the obvious result given your observations. There are some operational differences between katta and SOLR that might make the conclusions different, but what I have observed is the following: a) index merging is a really bad idea that seems very attractive to begin with because it is actually pretty expensive and doesn't solve the real problems of bad document distribution across shards. It is much better to simply have lots of shards per machine (aka micro-sharding) and use one reducer per shard. For large indexes, this gives entirely acceptable performance. On a pretty small cluster, we can index 50-100 million large documents in multiple ways in 2-3 hours. Index merging gives you no benefit compared to reduce side indexing and just increases code complexity. b) map-side indexing leaves you with indexes that are heavily skewed by being composed of of documents from a single input split. At retrieval time, this means that different shards have very different term frequency profiles and very different numbers of relevant documents. This makes lots of statistics very difficult including term frequency computation, term weighting and determining the number of documents to retrieve. Map-side merge virtually guarantees that you have to do two cluster queries, one to gather term frequency statistics and another to do the actual query. With reduce side indexing, you can provide strong probabilistic bounds on how different the statistics in each shard can be so you can use local term statistics and you can depend on the score distribution being this same which radically decreases the number of documents you need to retrieve from each shard. c) reduce-side indexing improves the balance of computation during retrieval. If (as is the rule) some document subset is hotter than other document subset due, say to data-source boosting or recency boosting, you will have very bad cluster utilization with skewed shards from map-side indexing while all shards will cost about the same for any query leading to good cluster utilization and faster queries with reduce-side indexing. d) with reduce-side indexing has properties that can be mathematically stated and proved. Map-side indexing only has comparable properties if you make unrealistic assumptions about your original data. e) micro-sharding allows very simple and very effective use of multiple cores on multiple machines in a search cluster. This can be very difficult to do with large shards or a single index. Now, as you say, these advantages may evaporate if you are looking to produce a single output index. That seems, however, to contradict the whole point of scaling. If you need to scale indexing, presumably you also need to scale search speed and throughput. As such you probably want to have many shards rather than few. Conversely, if you can stand to search a single index, then you probably can stand to index on a single machine. Another thing to think about is the fact SOLR doesn't yet do micro-sharding or clustering very well and, in particular, doesn't handle multiple shards per core. That will be changing before long, however, and it is very dangerous to design for the past rather than the future. In case, you didn't notice, I strongly suggest you stick with reduce-side indexing. > Solr + Hadoop > ------------- > > Key: SOLR-1301 > URL: https://issues.apache.org/jira/browse/SOLR-1301 > Project: Solr > Issue Type: Improvement > Affects Versions: 1.4 > Reporter: Andrzej Bialecki > Fix For: 1.5 > > Attachments: commons-logging-1.0.4.jar, > commons-logging-api-1.0.4.jar, hadoop-0.19.1-core.jar, hadoop.patch, > log4j-1.2.15.jar, README.txt, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, SOLR-1301.patch, > SOLR-1301.patch, SolrRecordWriter.java > > > This patch contains a contrib module that provides distributed indexing > (using Hadoop) to Solr EmbeddedSolrServer. The idea behind this module is > twofold: > * provide an API that is familiar to Hadoop developers, i.e. that of > OutputFormat > * avoid unnecessary export and (de)serialization of data maintained on HDFS. > SolrOutputFormat consumes data produced by reduce tasks directly, without > storing it in intermediate files. Furthermore, by using an > EmbeddedSolrServer, the indexing task is split into as many parts as there > are reducers, and the data to be indexed is not sent over the network. > Design > ---------- > Key/value pairs produced by reduce tasks are passed to SolrOutputFormat, > which in turn uses SolrRecordWriter to write this data. SolrRecordWriter > instantiates an EmbeddedSolrServer, and it also instantiates an > implementation of SolrDocumentConverter, which is responsible for turning > Hadoop (key, value) into a SolrInputDocument. This data is then added to a > batch, which is periodically submitted to EmbeddedSolrServer. When reduce > task completes, and the OutputFormat is closed, SolrRecordWriter calls > commit() and optimize() on the EmbeddedSolrServer. > The API provides facilities to specify an arbitrary existing solr.home > directory, from which the conf/ and lib/ files will be taken. > This process results in the creation of as many partial Solr home directories > as there were reduce tasks. The output shards are placed in the output > directory on the default filesystem (e.g. HDFS). Such part-NNNNN directories > can be used to run N shard servers. Additionally, users can specify the > number of reduce tasks, in particular 1 reduce task, in which case the output > will consist of a single shard. > An example application is provided that processes large CSV files and uses > this API. It uses a custom CSV processing to avoid (de)serialization overhead. > This patch relies on hadoop-core-0.19.1.jar - I attached the jar to this > issue, you should put it in contrib/hadoop/lib. > Note: the development of this patch was sponsored by an anonymous contributor > and approved for release under Apache License. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.