Here is a good article from IBM, with code, on how to do hybrid/cloud computing.
http://www.ibm.com/developerworks/library/x-cloudpt1/ Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin <knagelb...@globeandmail.com> wrote: > From: Nagelberg, Kallin <knagelb...@globeandmail.com> > Subject: RE: Machine utilization while indexing > To: "'solr-user@lucene.apache.org'" <solr-user@lucene.apache.org> > Date: Thursday, May 20, 2010, 8:16 AM > How about throwing a blockingqueue, > http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, > between your document-creator and solrserver? Give it a size > of 10,000 or something, with one thread trying to feed it, > and one thread waiting for it to get near full then draining > it. Take the drained results and add them to the server > (maybe try not using streamingsolrserver). Something like > that worked well for me with about 5,000,000 documents each > ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:vonk.th...@gmail.com] > > Sent: Thursday, May 20, 2010 11:02 AM > To: solr-user@lucene.apache.org > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker > then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. > These > documents are preprocessed by a java application that > effectively > combines multiple database tables with each-other to form > the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that > are ready to > be send to the solr server exceeds my preset limit. Telling > me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of > Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents > building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We > don't do any > analysis on the fields that are being indexed. The schema > is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" > omitNorms="true"/> > <field name="objectId" type="long" stored="true" > indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" > indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > > <useCompoundFile>false</useCompoundFile> > > <mergeFactor>100</mergeFactor> > > <RAMBufferSizeMB>256</RAMBufferSizeMB> > > <maxMergeDocs>2147483647</maxMergeDocs> > > <maxFieldLength>10000</maxFieldLength> > > <writeLockTimeout>1000</writeLockTimeout> > > <commitLockTimeout>10000</commitLockTimeout> > > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ > 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr > version 1.4 > > What I'm seeing is that the network almost never reaches > more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is > used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: > 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms > per insert but > this slows to 18-19ms per insert. > > Are there any tips/tricks I can use to speed up my > indexing? Because I > have a feeling that my machine is capable of doing more > (use more > cpu's). I just can't figure-out how. > > Thijs >