Hi I am trying to index 12MM docs faster than is currently happening in Solr (using solrj). We have identified solr's add method as the bottleneck (and not commit - which is tuned ok through mergeFactor and maxRamBufferSize and jvm ram).
Adding 1000 docs is taking approximately 25 seconds. We are making sure we add and commit in batches. And we've tried both CommonsHttpSolrServer and EmbeddedSolrServer (assuming removing http overhead would speed things up with embedding) but the differences is marginal. The docs being indexed are on average 20 fields long, mostly indexed but none stored. The major size contributors are two fields: - content, and - shingledContent (populated using copyField of content). The length of the content field is (likely) gaussian distributed (few large docs 50-80K tokens, but majority around 2k tokens). We use shingledContent to support phrase queries and content for unigram queries (following the advice of Solr Enterprise search server advice - p. 305, section "The Solution: Shingling"). Clearly the size of the docs is a contributor to the slow adds (confirmed by removing these 2 fields resulting in halving the indexing time). We've tried compressed=true also but that is not working. Any guidance on how to support our application logic (without having to change the schema too much) and speed the indexing speed (from current 212 days for 12MM docs) would be much appreciated. thank you Peyman