Hi, I have a SolrCloud (Solr 4.4, writing to HDFS on CDH-5.3) collection configured to be populated by flume Morphlines sink. The flume agent reads data from Kafka and writes to the Solr collection.
The issue is that Solr indexing rate is abysmally poor (~6k docs/sec at best, dips to a few hundred per sec) across the cluster. The incoming data/document rate is about 30-40k/second. I have gone wide/thin with 18 nodes and each with 8GB (Java) + 4GB (non-heap) memory and narrow/thick with current set of 5 dedicated nodes each with 36GB (Java) and 16GB (non-heap) memory (18 shards with the former config and 5 shards, right now). On the flume side, I have gone from 5 flume instances, each with a single sink to 5 sinks for each flume instance. I have tweaked batchSize and batchDuration. I checked ZooKeeper loads and don't see it stressed. Neither are the datanodes. On the Solr nodes, solr is consuming all the allocated memory (32GB) but I don't see solr hitting any CPU limits. *But*, indexing rate stubbornly stays at ~6k docs/sec. When I bounce the flume agent, it jumps up momentarily to several hundreds of thousands but then comes down to ~6k/sec and the flume channels get saturated within seconds. Any clues/pointers for troubleshooting will be appreciated? Thanks, Tim