I don't seem to be seeing a signifigant slowdown over time when I use the old defaults for merge threads and max merges.
- Mark On Jul 25, 2013, at 10:17 AM, Mark Miller <markrmil...@gmail.com> wrote: > I'm looking into some possible slow down after long indexing issues when I > get back from vacation. This could be related. Very early guess though. > > Another thing you might try - Lucene recently changed the merge scheduler > policy defaults (in 4.1) - it used to use up 3 threads to merge and have a > max merge setting of that + 2. It now defaults to 1 and 2, and that can > really impact how fast documents are added by a significant amount. It also > causes indexing threads to pause and wait for merges *way* more, especially > when your index gets large and the merges start taking a long time. The > tradeoff was supposedly that merges are faster, but honestly, I think it's a > poor default, especially if you are measuring indexing speed and now really > paying attention to how long merges go on afar you finish indexing, and > especially if you have beefy hardware. You might play with those settings. > > - Mark > > On Jul 25, 2013, at 8:36 AM, Radu Ghita <r...@wmds.ro> wrote: > >> Forgot to attach server and solr configurations: >> >> SolrCloud 4.1, internal Zookeeper, 16 shards, custom java importer. >> Server: Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz, 32 cores, 192gb RAM, 10tb >> SSD and 50tb SAS memory >> >> >> On Thu, Jul 25, 2013 at 3:20 PM, Radu Ghita <r...@wmds.ro> wrote: >> >>> >>> Hi, >>> >>> We are having a client with business model that requires indexing each >>> month billion rows into solr from mysql in a small time-frame. The >>> documents are very light, but the number is very high and we need to >>> achieve speeds of around 80-100k/s. The built in solr indexer goes to >>> 40-50k tops, but after some hours ( ~12h ) it crashes and the speed slows >>> down as hours go by. >>> >>> Therefore we have developed a custom java importer that connects directly >>> to mysql and solrcloud via zookeeper, grabs data from mysql, creates >>> documents and then imports into solr. This helps because we are opening ~50 >>> threads and the indexing process speeds up. We have optimized the mysql >>> queries ( mysql was the initial bottleneck ) and the speeds we get now are >>> over 100k/s, but as index number gets bigger, solr stays very long on >>> adding documents. I assume it needs to be something from solrconfig that >>> makes solr stay and even block after 100 mil documents indexed. >>> >>> Here is the java code that creates documents and then adds to solr server: >>> >>> public void createDocuments() throws SQLException, SolrServerException, >>> IOException >>> { >>> App.logger.write("Creating documents.."); >>> this.docs = new ArrayList<SolrInputDocument>(); >>> App.logger.incrementNumberOfRows(this.size); >>> while(this.results.next()) >>> { this.docs.add(this.getDocumentFromResultSet(this.results)); } >>> >>> this.statement.close(); >>> this.results.close(); >>> } >>> >>> public void commitDocuments() throws SolrServerException, IOException >>> { App.logger.write("Committing.."); App.solrServer.add(this.docs); // here >>> it stays very long and then blocks >>> App.logger.incrementNumberOfRows(this.docs.size()); this.docs.clear(); } >>> >>> I am also pasting solrconfig.xml parameters that make sense to this >>> discussion: >>> <maxIndexingThreads>128</maxIndexingThreads> >>> <useCompoundFile>false</useCompoundFile> >>> <ramBufferSizeMB>10000</ramBufferSizeMB> >>> <maxBufferedDocs>1000000</maxBufferedDocs> >>> <mergePolicy class="org.apache.lucene.index.TieredMergePolicy"> >>> <int name="maxMergeAtOnce">20000</int> >>> <int name="segmentsPerTier">1000000</int> >>> <int name="maxMergeAtOnceExplicit">10000</int> >>> </mergePolicy> >>> <mergeFactor>100</mergeFactor> >>> <termIndexInterval>1024</termIndexInterval> >>> <autoCommit> >>> <maxTime>15000</maxTime> >>> <maxDocs>1000000</maxDocs> >>> <openSearcher>false</openSearcher> >>> </autoCommit> >>> <autoSoftCommit> >>> <maxTime>2000000</maxTime> >>> </autoSoftCommit> >>> >>> The big problem stands in SOLR, because I've run the mysql queries single >>> and speed is great, but as the time passes solr adding function stays way >>> too long and then it blocks, even tho server is top level and has lots of >>> resources. >>> >>> I'm new to this so please assist. Thanks, >>> -- >>> >>> ** >>> >>> *Radu Ghita *-------------------------------- >>> >>> Tel: +40 721 18 18 68 >>> >>> Fax: +40 351 81 85 52 >>> >> >> >> >> -- >> >> ** >> >> *Radu Ghita *-------------------------------- >> >> Tel: +40 721 18 18 68 >> >> Fax: +40 351 81 85 52 >