Re: Machine utilization while indexing

2010-05-27 Thread Thijs
Sorry I missed it in the solrconfig.xml (my bad). I wasn't looking for 
it in the right place.


Thijs

On 27-5-2010 6:41, Chris Hostetter wrote:


: So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler)
: aren't turned on by default. (eps considering some threads on the dev-list

I don't really understand this question -- the BinaryUpdateRequestHandler
is registered with the path /update/javabin in the example solrconfig.xml
-- that's about as close to turning something on by default as solr
supports.



-Hoss





Re: Machine utilization while indexing

2010-05-26 Thread Chris Hostetter

: So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler)
: aren't turned on by default. (eps considering some threads on the dev-list

I don't really understand this question -- the BinaryUpdateRequestHandler 
is registered with the path /update/javabin in the example solrconfig.xml 
-- that's about as close to turning something on by default as solr 
supports.



-Hoss



Re: Machine utilization while indexing

2010-05-25 Thread Thijs

Hi all,

I did some further investigation and (after turning of some filters in 
yourkit) found that is was actually the machine sending the files to 
solr that was slowing things down.


At first I couldn't find this as it turned out that yourkit hides 
org.apache.* classes. When I removed this filter, it turned out that 
atleast 50% of the CPU time was taken by 
org.apache.solr.client.solrj.util.ClientUtils.writeXML(SolrInputDocument, Writer)
This was taking so much time that the commit queues where filling up on 
the client side instead of the solr server.


I have now switched back to my custom BlockingQueue with multiple 
CommonsHttpSolrServers that use the BinaryRequestWriter. And I'm now 
able to index 80 documents in 8minutes (including optimize). And 
2.9milj documents in 32 minutes(inlc. optimize).

As the StreamingUpdateSolrServer only supports XML I can't use that.

So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) 
aren't turned on by default. (eps considering some threads on the 
dev-list some time ago about setting a default schema for optimum 
performance.
Also finding out about this performance enhancement wasn't easy as it's 
hardly mentioned on the Wiki. I'll see if I can update this.


Thanks for all the advise and esp the great work on SolrLucene.
Thijs


On 20-5-2010 21:34, Chris Hostetter wrote:


: StreamingUpdateSolrServer already has multiple threads and uses multiple
: connections under the covers. At least the api says ' Uses an internal

Hmmm... i think one of us missunderstands the point behind
StreamingUpdateSolrServer and it's internal threads/queues.  (it's very
possible that it's me)

my understanding is that this allows it to manage the batching of multiple
operations for you, reusing connections as it goes -- so the the
queueSize is how many individual requests it buffers before sending the
batch to Solr, and the threadCount controls how many batches it can send
in parallel (in the event that one thread is still waiting for the
response when the queue next fills up)

But if you are only using a single thread to feed SolrRequests to a single
instance of StreamingUpdateSolrServer then there can still be lots of
opportunities for Solr itself to be idle -- as i said, it's not clear to
me if you are using multiple threads to write to your
StreamingUpdateSolrServer ... even if if you reuse the same
StreamingUpdateSolrServer instance, multiple threads in your client code
may increse the throughput (assuming that at the moment the threads in
StreamingUpdateSolrServer are largely idle)

But as i said ... this is all mostly a guess.  I'm not intimatiely
familiar with solrj.


-Hoss





RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does 
at the moment.

I have to index (and re-index) some 3-5 million documents. These 
documents are preprocessed by a java application that effectively 
combines multiple database tables with each-other to form the 
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to 
be send to the solr server exceeds my preset limit. Telling me that Solr 
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer 
as it would not process the documents fast enough causing 
OutOfMemoryExceptions due to the large amount of documents building up 
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any 
analysis on the fields that are being indexed. The schema is rather 
straight forward.

most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true 
required=true /
field name=listId type=long stored=false indexed=true 
multiValued=true/

the relevant solrconfig.xml
indexDefaults
 useCompoundFilefalse/useCompoundFile
 mergeFactor100/mergeFactor
 RAMBufferSizeMB256/RAMBufferSizeMB
 maxMergeDocs2147483647/maxMergeDocs
 maxFieldLength1/maxFieldLength
 writeLockTimeout1000/writeLockTimeout
 commitLockTimeout1/commitLockTimeout
 lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10% 
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the 
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but 
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I 
have a feeling that my machine is capable of doing more (use more 
cpu's). I just can't figure-out how.

Thijs


RE: Machine utilization while indexing

2010-05-20 Thread Dennis Gearon
It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing? Because I 
 have a feeling that my machine is capable of doing more
 (use more 
 cpu's). I just can't figure-out how.
 
 Thijs



Re: Machine utilization while indexing

2010-05-20 Thread Thijs
I already have a blockingqueue in place (that's my custom queue) and 
luckily I'm indexing faster then what your doing.Currently it takes 
about 2hour to index the 5m documents I'm talking about. But I still 
feel as if my machine is under utilized.


Thijs


On 20-5-2010 17:16, Nagelberg, Kallin wrote:

How about throwing a blockingqueue, 
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size of 10,000 or 
something, with one thread trying to feed it, and one thread waiting for it to 
get near full then draining it. Take the drained results and add them to the 
server (maybe try not using streamingsolrserver). Something like that worked 
well for me with about 5,000,000 documents each ~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com]
Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker then it does
at the moment.

I have to index (and re-index) some 3-5 million documents. These
documents are preprocessed by a java application that effectively
combines multiple database tables with each-other to form the
SolrInputDocument.

What I'm seeing however is that the queue of documents that are ready to
be send to the solr server exceeds my preset limit. Telling me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We don't do any
analysis on the fields that are being indexed. The schema is rather
straight forward.

most fields look like
fieldType name=long class=solr.LongField omitNorms=true/
field name=objectId type=long stored=true indexed=true
required=true /
field name=listId type=long stored=false indexed=true
multiValued=true/

the relevant solrconfig.xml
indexDefaults
  useCompoundFilefalse/useCompoundFile
  mergeFactor100/mergeFactor
  RAMBufferSizeMB256/RAMBufferSizeMB
  maxMergeDocs2147483647/maxMergeDocs
  maxFieldLength1/maxFieldLength
  writeLockTimeout1000/writeLockTimeout
  commitLockTimeout1/commitLockTimeout
  lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

What I'm seeing is that the network almost never reaches more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

And that in the beginning (with a empty index) I get 2ms per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my indexing? Because I
have a feeling that my machine is capable of doing more (use more
cpu's). I just can't figure-out how.

Thijs




Re: Machine utilization while indexing

2010-05-20 Thread Thijs
Why would I need faster hardware if my current hardware isn't reaching 
it's max capacity?


I'm already using a different machine for querying and indexing so while 
indexing the queries aren't affected. Pulling an optimized snapshot 
isn't even noticeable on the query-machines.


Thijs


On 20-5-2010 17:25, Dennis Gearon wrote:

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions.

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
   otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallinknagelb...@globeandmail.com  wrote:


From: Nagelberg, Kallinknagelb...@globeandmail.com
Subject: RE: Machine utilization while indexing
To: 'solr-user@lucene.apache.org'solr-user@lucene.apache.org
Date: Thursday, May 20, 2010, 8:16 AM
How about throwing a blockingqueue,
http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
between your document-creator and solrserver? Give it a size
of 10,000 or something, with one thread trying to feed it,
and one thread waiting for it to get near full then draining
it. Take the drained results and add them to the server
(maybe try not using streamingsolrserver). Something like
that worked well for me with about 5,000,000 documents each
~5k taking about 8 hours.

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com]

Sent: Thursday, May 20, 2010 11:02 AM
To: solr-user@lucene.apache.org
Subject: Machine utilization while indexing

Hi.

I have a question about how I can get solr to index quicker
then it does
at the moment.

I have to index (and re-index) some 3-5 million documents.
These
documents are preprocessed by a java application that
effectively
combines multiple database tables with each-other to form
the
SolrInputDocument.

What I'm seeing however is that the queue of documents that
are ready to
be send to the solr server exceeds my preset limit. Telling
me that Solr
somehow can't process the documents fast enough.

(I have created my own queue in front of
Solrj.StreamingUpdateSolrServer
as it would not process the documents fast enough causing
OutOfMemoryExceptions due to the large amount of documents
building up
in it's queue)

I have an index that for 95% consist of ID's (Long). We
don't do any
analysis on the fields that are being indexed. The schema
is rather
straight forward.

most fields look like
fieldType name=long class=solr.LongField
omitNorms=true/
field name=objectId type=long stored=true
indexed=true
required=true /
field name=listId type=long stored=false
indexed=true
multiValued=true/

the relevant solrconfig.xml
indexDefaults

useCompoundFilefalse/useCompoundFile

mergeFactor100/mergeFactor

RAMBufferSizeMB256/RAMBufferSizeMB

maxMergeDocs2147483647/maxMergeDocs

maxFieldLength1/maxFieldLength

writeLockTimeout1000/writeLockTimeout

commitLockTimeout1/commitLockTimeout

lockTypesingle/lockType
/indexDefaults


The machines I'm testing on have a:
Intel(R) Core(TM)2 Quad CPUQ9550  @
2.83GHz
With 4GB of ram.
Running on linux java version 1.6.0_17, tomcat 6 and solr
version 1.4

What I'm seeing is that the network almost never reaches
more then 10%
of the 1GB/s connection.
That the CPU utilization is always below 25% (1 core is
used, not the
others)
I don't see heavy disk-io.
Also while indexing the memory consumption is:
Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
2730.68 MB

And that in the beginning (with a empty index) I get 2ms
per insert but
this slows to 18-19ms per insert.

Are there any tips/tricks I can use to speed up my
indexing? Because I
have a feeling that my machine is capable of doing more
(use more
cpu's). I just can't figure-out how.

Thijs





RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs 
ram, and a doc size of 5-10k maybe substantially larger than what you have. 
They could be substantially smaller too. As another point of reference my index 
ends up being about 20Gigs with the 5 million docs. 

I should also point out I only need to do this once.. I'm not constantly 
reindexing everything. My indexed documents rarely change, and when they do we 
have a process that selectively updates those few that need it. Combine that 
with a constant trickle of new documents and indexing performance isn't much of 
a concern.

You should be able to experiment with a small subset of your documents to 
speedily test new schemas, etc. In my case I selected a representative sample 
and store them in my project for unit testing.

-Kallin Nagelberg


-Original Message-
From: Dennis Gearon [mailto:gear...@sbcglobal.net] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: RE: Machine utilization while indexing

It takes that long to do indexing? I'm HOPING to have a site that has low 10's 
of millions of documents to billions. 

Sounds to me like I will DEFINITELY need a cloud account at indexing time. For 
the original author of this thread, that's what I'd recommend.

1/ Optimize as best as you can on one machine.
2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over 
to 5-10 machines during indexing. Combine the index, shut down the EC 
instances. Probably could get it down to 1/2 hour, without impacting your 
current queries.


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing

RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
You're sure it's not blocking on indexing IO? If not then I guess it must be a 
thread waiting unnecessarily in solr or your loading program. To get my loader 
running at full speed I hooked it up to jprofiler's thread views to see where 
the stalls were and optimized from there. 

-Kallin Nagelberg

-Original Message-
From: Thijs [mailto:vonk.th...@gmail.com] 
Sent: Thursday, May 20, 2010 11:25 AM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing

I already have a blockingqueue in place (that's my custom queue) and 
luckily I'm indexing faster then what your doing.Currently it takes 
about 2hour to index the 5m documents I'm talking about. But I still 
feel as if my machine is under utilized.

Thijs


On 20-5-2010 17:16, Nagelberg, Kallin wrote:
 How about throwing a blockingqueue, 
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
  between your document-creator and solrserver? Give it a size of 10,000 or 
 something, with one thread trying to feed it, and one thread waiting for it 
 to get near full then draining it. Take the drained results and add them to 
 the server (maybe try not using streamingsolrserver). Something like that 
 worked well for me with about 5,000,000 documents each ~5k taking about 8 
 hours.

 -Kallin Nagelberg

 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing

 Hi.

 I have a question about how I can get solr to index quicker then it does
 at the moment.

 I have to index (and re-index) some 3-5 million documents. These
 documents are preprocessed by a java application that effectively
 combines multiple database tables with each-other to form the
 SolrInputDocument.

 What I'm seeing however is that the queue of documents that are ready to
 be send to the solr server exceeds my preset limit. Telling me that Solr
 somehow can't process the documents fast enough.

 (I have created my own queue in front of Solrj.StreamingUpdateSolrServer
 as it would not process the documents fast enough causing
 OutOfMemoryExceptions due to the large amount of documents building up
 in it's queue)

 I have an index that for 95% consist of ID's (Long). We don't do any
 analysis on the fields that are being indexed. The schema is rather
 straight forward.

 most fields look like
 fieldType name=long class=solr.LongField omitNorms=true/
 field name=objectId type=long stored=true indexed=true
 required=true /
 field name=listId type=long stored=false indexed=true
 multiValued=true/

 the relevant solrconfig.xml
 indexDefaults
   useCompoundFilefalse/useCompoundFile
   mergeFactor100/mergeFactor
   RAMBufferSizeMB256/RAMBufferSizeMB
   maxMergeDocs2147483647/maxMergeDocs
   maxFieldLength1/maxFieldLength
   writeLockTimeout1000/writeLockTimeout
   commitLockTimeout1/commitLockTimeout
   lockTypesingle/lockType
 /indexDefaults


 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPUQ9550  @ 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4

 What I'm seeing is that the network almost never reaches more then 10%
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is used, not the
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB

 And that in the beginning (with a empty index) I get 2ms per insert but
 this slows to 18-19ms per insert.

 Are there any tips/tricks I can use to speed up my indexing? Because I
 have a feeling that my machine is capable of doing more (use more
 cpu's). I just can't figure-out how.

 Thijs



RE: Machine utilization while indexing

2010-05-20 Thread Dennis Gearon
Here is a good article from IBM, with code, on how to do hybrid/cloud computing.

http://www.ibm.com/developerworks/library/x-cloudpt1/


Dennis Gearon

Signature Warning

EARTH has a Right To Life,
  otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Thu, 5/20/10, Nagelberg, Kallin knagelb...@globeandmail.com wrote:

 From: Nagelberg, Kallin knagelb...@globeandmail.com
 Subject: RE: Machine utilization while indexing
 To: 'solr-user@lucene.apache.org' solr-user@lucene.apache.org
 Date: Thursday, May 20, 2010, 8:16 AM
 How about throwing a blockingqueue,
 http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html,
 between your document-creator and solrserver? Give it a size
 of 10,000 or something, with one thread trying to feed it,
 and one thread waiting for it to get near full then draining
 it. Take the drained results and add them to the server
 (maybe try not using streamingsolrserver). Something like
 that worked well for me with about 5,000,000 documents each
 ~5k taking about 8 hours.
 
 -Kallin Nagelberg
 
 -Original Message-
 From: Thijs [mailto:vonk.th...@gmail.com]
 
 Sent: Thursday, May 20, 2010 11:02 AM
 To: solr-user@lucene.apache.org
 Subject: Machine utilization while indexing
 
 Hi.
 
 I have a question about how I can get solr to index quicker
 then it does 
 at the moment.
 
 I have to index (and re-index) some 3-5 million documents.
 These 
 documents are preprocessed by a java application that
 effectively 
 combines multiple database tables with each-other to form
 the 
 SolrInputDocument.
 
 What I'm seeing however is that the queue of documents that
 are ready to 
 be send to the solr server exceeds my preset limit. Telling
 me that Solr 
 somehow can't process the documents fast enough.
 
 (I have created my own queue in front of
 Solrj.StreamingUpdateSolrServer 
 as it would not process the documents fast enough causing 
 OutOfMemoryExceptions due to the large amount of documents
 building up 
 in it's queue)
 
 I have an index that for 95% consist of ID's (Long). We
 don't do any 
 analysis on the fields that are being indexed. The schema
 is rather 
 straight forward.
 
 most fields look like
 fieldType name=long class=solr.LongField
 omitNorms=true/
 field name=objectId type=long stored=true
 indexed=true 
 required=true /
 field name=listId type=long stored=false
 indexed=true 
 multiValued=true/
 
 the relevant solrconfig.xml
 indexDefaults
  
    useCompoundFilefalse/useCompoundFile
  
    mergeFactor100/mergeFactor
  
    RAMBufferSizeMB256/RAMBufferSizeMB
  
    maxMergeDocs2147483647/maxMergeDocs
  
    maxFieldLength1/maxFieldLength
  
    writeLockTimeout1000/writeLockTimeout
  
    commitLockTimeout1/commitLockTimeout
  
    lockTypesingle/lockType
 /indexDefaults
 
 
 The machines I'm testing on have a:
 Intel(R) Core(TM)2 Quad CPU    Q9550  @
 2.83GHz
 With 4GB of ram.
 Running on linux java version 1.6.0_17, tomcat 6 and solr
 version 1.4
 
 What I'm seeing is that the network almost never reaches
 more then 10% 
 of the 1GB/s connection.
 That the CPU utilization is always below 25% (1 core is
 used, not the 
 others)
 I don't see heavy disk-io.
 Also while indexing the memory consumption is:
 Free memory: 212.15 MB Total memory: 509.12 MB Max memory:
 2730.68 MB
 
 And that in the beginning (with a empty index) I get 2ms
 per insert but 
 this slows to 18-19ms per insert.
 
 Are there any tips/tricks I can use to speed up my
 indexing? Because I 
 have a feeling that my machine is capable of doing more
 (use more 
 cpu's). I just can't figure-out how.
 
 Thijs



Re: Machine utilization while indexing

2010-05-20 Thread Chris Hostetter

I'm really only guessing here, but based on your description of what you 
are doing it sounds like you only have one thread streaming documents to 
solr (via a single StreamingUpdateSolrServer instance which creates a 
single HTTP connection)

Have you at all attempted to have parallel threads in your client initiate 
parallel connections to Solr via multiple instances of 
StreamingUpdateSolrServer objects?)


-Hoss



RE: Machine utilization while indexing

2010-05-20 Thread Nagelberg, Kallin
StreamingUpdateSolrServer already has multiple threads and uses multiple 
connections under the covers. At least the api says ' Uses an internal 
MultiThreadedHttpConnectionManager to manage http connections'. The constructor 
allows you to specify the number of threads used, 
http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String,
 int, int) . 

-Kallin Nagelberg

-Original Message-
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Thursday, May 20, 2010 3:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Machine utilization while indexing


I'm really only guessing here, but based on your description of what you 
are doing it sounds like you only have one thread streaming documents to 
solr (via a single StreamingUpdateSolrServer instance which creates a 
single HTTP connection)

Have you at all attempted to have parallel threads in your client initiate 
parallel connections to Solr via multiple instances of 
StreamingUpdateSolrServer objects?)


-Hoss



RE: Machine utilization while indexing

2010-05-20 Thread Chris Hostetter

: StreamingUpdateSolrServer already has multiple threads and uses multiple 
: connections under the covers. At least the api says ' Uses an internal 

Hmmm... i think one of us missunderstands the point behind 
StreamingUpdateSolrServer and it's internal threads/queues.  (it's very 
possible that it's me)

my understanding is that this allows it to manage the batching of multiple 
operations for you, reusing connections as it goes -- so the the 
queueSize is how many individual requests it buffers before sending the 
batch to Solr, and the threadCount controls how many batches it can send 
in parallel (in the event that one thread is still waiting for the 
response when the queue next fills up)

But if you are only using a single thread to feed SolrRequests to a single 
instance of StreamingUpdateSolrServer then there can still be lots of 
opportunities for Solr itself to be idle -- as i said, it's not clear to 
me if you are using multiple threads to write to your 
StreamingUpdateSolrServer ... even if if you reuse the same 
StreamingUpdateSolrServer instance, multiple threads in your client code 
may increse the throughput (assuming that at the moment the threads in 
StreamingUpdateSolrServer are largely idle)

But as i said ... this is all mostly a guess.  I'm not intimatiely 
familiar with solrj.


-Hoss