Hari Sekhon created SOLR-7393: --------------------------------- Summary: HDFS bulk indexing performance Key: SOLR-7393 URL: https://issues.apache.org/jira/browse/SOLR-7393 Project: Solr Issue Type: Bug Components: Hadoop Integration, hdfs, SolrCloud Affects Versions: 4.10.3, 4.7.2 Environment: HDP 2.2 / HDP Search + LucidWorks Hive SerDe Reporter: Hari Sekhon Priority: Critical
When switching SolrCloud from local dataDir to HDFS directory factory indexing performance falls through the floor. A previous Hive to SolrCloud online indexing job that took 2 hours for 620M rows ended up taking a projected 20+ hours and never completing, usually breaking around the 16-17 hour timeframe when left overnight. It's worth noting that I had to disable the HDFS write cache which was causing index corruption (SOLR-7255) on the advice of Mark Miller, who tells me this doesn't make much performance difference anway. This is probably also related to SolrCloud not respecting HDFS replication factor, effectively making 4 copies of data instead of 2 (SOLR-6528), but that solely doesn't account for the massive performance drop going from vanilla SolrCloud to SolrCloud on HDFS HA + Kerberos. Hari Sekhon http://www.linkedin.com/in/harisekhon -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org