Re: Machine utilization while indexing
Sorry I missed it in the solrconfig.xml (my bad). I wasn't looking for it in the right place. Thijs On 27-5-2010 6:41, Chris Hostetter wrote: : So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) : aren't turned on by default. (eps considering some threads on the dev-list I don't really understand this question -- the BinaryUpdateRequestHandler is registered with the path /update/javabin in the example solrconfig.xml -- that's about as close to turning something on by default as solr supports. -Hoss
Re: Machine utilization while indexing
: So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) : aren't turned on by default. (eps considering some threads on the dev-list I don't really understand this question -- the BinaryUpdateRequestHandler is registered with the path /update/javabin in the example solrconfig.xml -- that's about as close to turning something on by default as solr supports. -Hoss
Re: Machine utilization while indexing
Hi all, I did some further investigation and (after turning of some filters in yourkit) found that is was actually the machine sending the files to solr that was slowing things down. At first I couldn't find this as it turned out that yourkit hides org.apache.* classes. When I removed this filter, it turned out that atleast 50% of the CPU time was taken by org.apache.solr.client.solrj.util.ClientUtils.writeXML(SolrInputDocument, Writer) This was taking so much time that the commit queues where filling up on the client side instead of the solr server. I have now switched back to my custom BlockingQueue with multiple CommonsHttpSolrServers that use the BinaryRequestWriter. And I'm now able to index 80 documents in 8minutes (including optimize). And 2.9milj documents in 32 minutes(inlc. optimize). As the StreamingUpdateSolrServer only supports XML I can't use that. So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) aren't turned on by default. (eps considering some threads on the dev-list some time ago about setting a default schema for optimum performance. Also finding out about this performance enhancement wasn't easy as it's hardly mentioned on the Wiki. I'll see if I can update this. Thanks for all the advise and esp the great work on SolrLucene. Thijs On 20-5-2010 21:34, Chris Hostetter wrote: : StreamingUpdateSolrServer already has multiple threads and uses multiple : connections under the covers. At least the api says ' Uses an internal Hmmm... i think one of us missunderstands the point behind StreamingUpdateSolrServer and it's internal threads/queues. (it's very possible that it's me) my understanding is that this allows it to manage the batching of multiple operations for you, reusing connections as it goes -- so the the queueSize is how many individual requests it buffers before sending the batch to Solr, and the threadCount controls how many batches it can send in parallel (in the event that one thread is still waiting for the response when the queue next fills up) But if you are only using a single thread to feed SolrRequests to a single instance of StreamingUpdateSolrServer then there can still be lots of opportunities for Solr itself to be idle -- as i said, it's not clear to me if you are using multiple threads to write to your StreamingUpdateSolrServer ... even if if you reuse the same StreamingUpdateSolrServer instance, multiple threads in your client code may increse the throughput (assuming that at the moment the threads in StreamingUpdateSolrServer are largely idle) But as i said ... this is all mostly a guess. I'm not intimatiely familiar with solrj. -Hoss
RE: Machine utilization while indexing
How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Re: Machine utilization while indexing
I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Re: Machine utilization while indexing
Why would I need faster hardware if my current hardware isn't reaching it's max capacity? I'm already using a different machine for querying and indexing so while indexing the queries aren't affected. Pulling an optimized snapshot isn't even noticeable on the query-machines. Thijs On 20-5-2010 17:25, Dennis Gearon wrote: It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallinknagelb...@globeandmail.com wrote: From: Nagelberg, Kallinknagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org'solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs ram, and a doc size of 5-10k maybe substantially larger than what you have. They could be substantially smaller too. As another point of reference my index ends up being about 20Gigs with the 5 million docs. I should also point out I only need to do this once.. I'm not constantly reindexing everything. My indexed documents rarely change, and when they do we have a process that selectively updates those few that need it. Combine that with a constant trickle of new documents and indexing performance isn't much of a concern. You should be able to experiment with a small subset of your documents to speedily test new schemas, etc. In my case I selected a representative sample and store them in my project for unit testing. -Kallin Nagelberg -Original Message- From: Dennis Gearon [mailto:gear...@sbcglobal.net] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: RE: Machine utilization while indexing It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing
RE: Machine utilization while indexing
You're sure it's not blocking on indexing IO? If not then I guess it must be a thread waiting unnecessarily in solr or your loading program. To get my loader running at full speed I hooked it up to jprofiler's thread views to see where the stalls were and optimized from there. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:25 AM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPUQ9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
RE: Machine utilization while indexing
Here is a good article from IBM, with code, on how to do hybrid/cloud computing. http://www.ibm.com/developerworks/library/x-cloudpt1/ Dennis Gearon Signature Warning EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php --- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote: From: Nagelberg, Kallin knagelb...@globeandmail.com Subject: RE: Machine utilization while indexing To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org Date: Thursday, May 20, 2010, 8:16 AM How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -Original Message- From: Thijs [mailto:vonk.th...@gmail.com] Sent: Thursday, May 20, 2010 11:02 AM To: solr-user@lucene.apache.org Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like fieldType name=long class=solr.LongField omitNorms=true/ field name=objectId type=long stored=true indexed=true required=true / field name=listId type=long stored=false indexed=true multiValued=true/ the relevant solrconfig.xml indexDefaults useCompoundFilefalse/useCompoundFile mergeFactor100/mergeFactor RAMBufferSizeMB256/RAMBufferSizeMB maxMergeDocs2147483647/maxMergeDocs maxFieldLength1/maxFieldLength writeLockTimeout1000/writeLockTimeout commitLockTimeout1/commitLockTimeout lockTypesingle/lockType /indexDefaults The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
Re: Machine utilization while indexing
I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
RE: Machine utilization while indexing
StreamingUpdateSolrServer already has multiple threads and uses multiple connections under the covers. At least the api says ' Uses an internal MultiThreadedHttpConnectionManager to manage http connections'. The constructor allows you to specify the number of threads used, http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String, int, int) . -Kallin Nagelberg -Original Message- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Thursday, May 20, 2010 3:14 PM To: solr-user@lucene.apache.org Subject: Re: Machine utilization while indexing I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
RE: Machine utilization while indexing
: StreamingUpdateSolrServer already has multiple threads and uses multiple : connections under the covers. At least the api says ' Uses an internal Hmmm... i think one of us missunderstands the point behind StreamingUpdateSolrServer and it's internal threads/queues. (it's very possible that it's me) my understanding is that this allows it to manage the batching of multiple operations for you, reusing connections as it goes -- so the the queueSize is how many individual requests it buffers before sending the batch to Solr, and the threadCount controls how many batches it can send in parallel (in the event that one thread is still waiting for the response when the queue next fills up) But if you are only using a single thread to feed SolrRequests to a single instance of StreamingUpdateSolrServer then there can still be lots of opportunities for Solr itself to be idle -- as i said, it's not clear to me if you are using multiple threads to write to your StreamingUpdateSolrServer ... even if if you reuse the same StreamingUpdateSolrServer instance, multiple threads in your client code may increse the throughput (assuming that at the moment the threads in StreamingUpdateSolrServer are largely idle) But as i said ... this is all mostly a guess. I'm not intimatiely familiar with solrj. -Hoss