Forgot to attach server and solr configurations: SolrCloud 4.1, internal Zookeeper, 16 shards, custom java importer. Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb SSD and 50tb SAS memory
On Thu, Jul 25, 2013 at 3:20 PM, Radu Ghita <r...@wmds.ro> wrote: > > Hi, > > We are having a client with business model that requires indexing each > month billion rows into solr from mysql in a small time-frame. The > documents are very light, but the number is very high and we need to > achieve speeds of around 80-100k/s. The built in solr indexer goes to > 40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows > down as hours go by. > > Therefore we have developed a custom java importer that connects directly > to mysql and solrcloud via zookeeper, grabs data from mysql, creates > documents and then imports into solr. This helps because we are opening ~50 > threads and the indexing process speeds up. We have optimized the mysql > queries ( mysql was the initial bottleneck ) and the speeds we get now are > over 100k/s, but as index number gets bigger, solr stays very long on > adding documents. I assume it needs to be something from solrconfig that > makes solr stay and even block after 100 mil documents indexed. > > Here is the java code that creates documents and then adds to solr server: > > public void createDocuments() throws SQLException, SolrServerException, > IOException > { > App.logger.write("Creating documents.."); > this.docs = new ArrayList<SolrInputDocument>(); > App.logger.incrementNumberOfRows(this.size); > while(this.results.next()) > { this.docs.add(this.getDocumentFromResultSet(this.results)); } > > this.statement.close(); > this.results.close(); > } > > public void commitDocuments() throws SolrServerException, IOException > { App.logger.write("Committing.."); App.solrServer.add(this.docs); // here > it stays very long and then blocks > App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); } > > I am also pasting solrconfig.xml parameters that make sense to this > discussion: > <maxIndexingThreads>128</maxIndexingThreads> > <useCompoundFile>false</useCompoundFile> > <ramBufferSizeMB>10000</ramBufferSizeMB> > <maxBufferedDocs>1000000</maxBufferedDocs> > <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> > <int name="maxMergeAtOnce">20000</int> > <int name="segmentsPerTier">1000000</int> > <int name="maxMergeAtOnceExplicit">10000</int> > </mergePolicy> > <mergeFactor>100</mergeFactor> > <termIndexInterval>1024</termIndexInterval> > <autoCommit> > <maxTime>15000</maxTime> > <maxDocs>1000000</maxDocs> > <openSearcher>false</openSearcher> > </autoCommit> > <autoSoftCommit> > <maxTime>2000000</maxTime> > </autoSoftCommit> > > The big problem stands in SOLR, because I've run the mysql queries single > and speed is great, but as the time passes solr adding function stays way > too long and then it blocks, even tho server is top level and has lots of > resources. > > I'm new to this so please assist. Thanks, > -- > > ** > > *Radu Ghita *-------------------------------- > > Tel: +40 721 18 18 68 > > Fax: +40 351 81 85 52 > -- ** *Radu Ghita *-------------------------------- Tel: +40 721 18 18 68 Fax: +40 351 81 85 52