Re: Question on Batch process

2011-04-28 Thread Otis Gospodnetic
Charles,

Maybe the question to ask is why you are committing at all?  Do you need 
somebody to see index changes while you are indexing?  If not, commit just at 
the end.  And optimize if you won't touch the index for a while.

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Charles Wardell charles.ward...@bcsolution.com
 To: solr-user@lucene.apache.org
 Sent: Wed, April 27, 2011 7:51:20 PM
 Subject: Re: Question on Batch process
 
 Thank you for your response. I did not make the StreamingUpdate application 
yet,  but I did change the other settings that you mentioned. It gave me a 
huge 
boost  in indexing speed. (I am still using post.sh but hope to change that  
soon).
 
 One thing I noticed is the indexing speed was incredibly fast last  night, 
 but 
today the commits are taking so long. Is this to be  expected?
 
 
 
 -- 
 Best Regards,
 
 Charles Wardell
 Blue  Chips Technology, Inc.
 www.bcsolution.com
 
 On Wednesday, April 27, 2011  at 6:15 PM, Otis Gospodnetic wrote: 
  Hi Charles,
  
  Yes,  the threads I was referring to are in the context of the 
client/indexer, so 

  one of the params for StreamingUpdateSolrServer.
  post.sh/jar  are just there because they are handy. Don't use them for 
   production.
  
  It's impossible to tell how long indexing of 100M  documents may take. They 
  could be very big or very small. You could  perform very light or no 
  analysis 
or 

  heavy analysis. They could contain  1 or 100 fields. :)
  
  Otis
  
  Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
  Lucene ecosystem search :: http://search-lucene.com/
  
  
  
  - Original  Message 
   From: Charles Wardell charles.ward...@bcsolution.com
To: solr-user@lucene.apache.org
Sent: Tue, April 26, 2011 8:01:28 PM
   Subject: Re: Question on  Batch process
   
   Thank you Otis.
   Without  trying to appear to stupid, when you refer to having the params 
matching your # of CPU cores, you are talking about the # of threads I 
   can 

   spawn with the StreamingUpdateSolrServer object?
   Up  until now, I have been just utilizing post.sh or post.jar. Are these 
capable of that or do I need to write some code to collect a bunch of 
files 

   into the buffer and send it off?
   
   Also,  Do you have a sense for how long it should take to index 100,000 
files 

or in my case 100,000,000 documents?
StreamingUpdateSolrServer
   public StreamingUpdateSolrServer(String  solrServerUrl, int queueSize, 
   int 

   threadCount) throws  MalformedURLException
   
   Thanks again,
Charlie
   
   -- 
   Best Regards,
   
   Charles Wardell
   Blue Chips Technology, Inc.
www.bcsolution.com
   
   On Tuesday, April 26, 2011 at  5:12 PM, Otis Gospodnetic wrote: 
Charlie,

How's this:
* -Xmx2g
*  ramBufferSizeMB 512
* mergeFactor 10 (default, but you could  up it to 20, 30, if ulimit -n 
   allows)
*  ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
 * use SolrStreamingUpdateServer (with params matching your number of 
   CPU 

   cores) 
   
or send batches of say  1000 docs with the other SolrServer impl using 
N 

   threads 

(N=# of your CPU cores)

 Otis
 
Sematext :: http://sematext.com/ :: Solr -  Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/


 
- Original Message 
  From: Charles Wardell charles.ward...@bcsolution.com
  To: solr-user@lucene.apache.org
  Sent: Tue, April 26, 2011 2:32:29 PM
  Subject: Question on Batch process
 
  I am sure that this question has been asked a few times, but I can't 
seem 

   to 
   
 find the sweetspot for  indexing.
 
 I have about 100,000  files each containing 1,000 xml documents ready 
to be 

   
  posted to Solr. My desire is to have it index as quickly as  
  possible 
and 

   then 
   
 once  completed the daily stream of ADDs will be small in comparison.
  
 The individual documents are small.  Essentially web postings from 
 the 
net. 

   
  Title, postPostContent, date. 
 
 
  What would be the ideal configuration? For  RamBufferSize, 
mergeFactor, 

 MaxbufferedDocs,  etc..
 
 My machine is a quad core  hyper-threaded. So it shows up as 8 cpu's 
 in 

  TOP
  I have 16GB of available ram.
 
 
 Thanks in advance.
  Charlie
  
 


Re: Question on Batch process

2011-04-27 Thread Otis Gospodnetic
Hi Charles,

Yes, the threads I was referring to are in the context of the client/indexer, 
so 
one of the params for StreamingUpdateSolrServer.
post.sh/jar are just there because they are handy.  Don't use them for 
production.

It's impossible to tell how long indexing of 100M documents may take.  They 
could be very big or very small.  You could perform very light or no analysis 
or 
heavy analysis.  They could contain 1 or 100 fields. :)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Charles Wardell charles.ward...@bcsolution.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 8:01:28 PM
 Subject: Re: Question on Batch process
 
 Thank you Otis.
 Without trying to appear to stupid, when you refer to having  the params 
matching your # of CPU cores, you are talking about the # of threads  I can 
spawn with the StreamingUpdateSolrServer object?
 Up until now, I have  been just utilizing post.sh or post.jar. Are these 
capable of that or do I need  to write some code to collect a bunch of files 
into the buffer and send it  off?
 
 Also, Do you have a sense for how long it should take to index  100,000 files 
or in my case 100,000,000  documents?
 StreamingUpdateSolrServer
 public  StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
threadCount)  throws MalformedURLException
 
 Thanks again,
 Charlie
 
 -- 
 Best  Regards,
 
 Charles Wardell
 Blue Chips Technology,  Inc.
 www.bcsolution.com
 
 On Tuesday, April 26, 2011 at 5:12 PM, Otis  Gospodnetic wrote: 
  Charlie,
  
  How's this:
  *  -Xmx2g
  * ramBufferSizeMB 512
  * mergeFactor 10 (default, but you  could up it to 20, 30, if ulimit -n 
allows)
  * ignore/delete  maxBufferedDocs - not used if you ran ramBufferSizeMB
  * use  SolrStreamingUpdateServer (with params matching your number of CPU 
cores) 

  or send batches of say 1000 docs with the other SolrServer impl using N  
threads 

  (N=# of your CPU cores)
  
  Otis
   
  Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
  Lucene ecosystem  search :: http://search-lucene.com/
  
  
  
  - Original  Message 
   From: Charles Wardell charles.ward...@bcsolution.com
To: solr-user@lucene.apache.org
Sent: Tue, April 26, 2011 2:32:29 PM
   Subject: Question on  Batch process
   
   I am sure that this question has been  asked a few times, but I can't 
   seem 
to 

   find the sweetspot for  indexing.
   
   I have about 100,000 files each containing  1,000 xml documents ready to 
   be 

   posted to Solr. My desire is to  have it index as quickly as possible and 
then 

   once completed the  daily stream of ADDs will be small in comparison.
   
   The  individual documents are small. Essentially web postings from the 
   net. 

Title, postPostContent, date. 
   
   
What would be the ideal configuration? For RamBufferSize, mergeFactor, 
MaxbufferedDocs, etc..
   
   My machine is a quad core  hyper-threaded. So it shows up as 8 cpu's in 
TOP
   I have 16GB of  available ram.
   
   
   Thanks in  advance.
   Charlie
  
 


Re: Question on Batch process

2011-04-27 Thread Charles Wardell
Thank you for your response. I did not make the StreamingUpdate application 
yet, but I did change the other settings that you mentioned. It gave me a huge 
boost in indexing speed. (I am still using post.sh but hope to change that 
soon).

One thing I noticed is the indexing speed was incredibly fast last night, but 
today the commits are taking so long. Is this to be expected?



-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Wednesday, April 27, 2011 at 6:15 PM, Otis Gospodnetic wrote: 
 Hi Charles,
 
 Yes, the threads I was referring to are in the context of the client/indexer, 
 so 
 one of the params for StreamingUpdateSolrServer.
 post.sh/jar are just there because they are handy. Don't use them for 
 production.
 
 It's impossible to tell how long indexing of 100M documents may take. They 
 could be very big or very small. You could perform very light or no analysis 
 or 
 heavy analysis. They could contain 1 or 100 fields. :)
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
  From: Charles Wardell charles.ward...@bcsolution.com
  To: solr-user@lucene.apache.org
  Sent: Tue, April 26, 2011 8:01:28 PM
  Subject: Re: Question on Batch process
  
  Thank you Otis.
  Without trying to appear to stupid, when you refer to having the params 
  matching your # of CPU cores, you are talking about the # of threads I can 
  spawn with the StreamingUpdateSolrServer object?
  Up until now, I have been just utilizing post.sh or post.jar. Are these 
  capable of that or do I need to write some code to collect a bunch of files 
  into the buffer and send it off?
  
  Also, Do you have a sense for how long it should take to index 100,000 
  files 
  or in my case 100,000,000 documents?
  StreamingUpdateSolrServer
  public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
  threadCount) throws MalformedURLException
  
  Thanks again,
  Charlie
  
  -- 
  Best Regards,
  
  Charles Wardell
  Blue Chips Technology, Inc.
  www.bcsolution.com
  
  On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
   Charlie,
   
   How's this:
   * -Xmx2g
   * ramBufferSizeMB 512
   * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n 
  allows)
   * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
   * use SolrStreamingUpdateServer (with params matching your number of CPU 
  cores) 
  
   or send batches of say 1000 docs with the other SolrServer impl using N 
  threads 
  
   (N=# of your CPU cores)
   
   Otis

   Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
   Lucene ecosystem search :: http://search-lucene.com/
   
   
   
   - Original Message 
From: Charles Wardell charles.ward...@bcsolution.com
To: solr-user@lucene.apache.org
Sent: Tue, April 26, 2011 2:32:29 PM
Subject: Question on Batch process

I am sure that this question has been asked a few times, but I can't 
seem 
  to 
  
find the sweetspot for indexing.

I have about 100,000 files each containing 1,000 xml documents ready to 
be 
  
posted to Solr. My desire is to have it index as quickly as possible 
and 
  then 
  
once completed the daily stream of ADDs will be small in comparison.

The individual documents are small. Essentially web postings from the 
net. 
  
Title, postPostContent, date. 


 What would be the ideal configuration? For RamBufferSize, mergeFactor, 
MaxbufferedDocs, etc..

My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in 
 TOP
I have 16GB of available ram.


Thanks in advance.
Charlie
 


Question on Batch process

2011-04-26 Thread Charles Wardell
I am sure that this question has been asked a few times, but I can't seem to 
find the sweetspot for indexing.

I have about 100,000 files each containing 1,000 xml documents ready to be 
posted to Solr. My desire is to have it index as quickly as possible and then 
once completed the daily stream of ADDs will be small in comparison.

The individual documents are small. Essentially web postings from the net. 
Title, postPostContent, date. 

What would be the ideal configuration? For RamBufferSize, mergeFactor, 
MaxbufferedDocs, etc..

My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
I have 16GB of available ram.


Thanks in advance.
Charlie

Re: Question on Batch process

2011-04-26 Thread Otis Gospodnetic
Charlie,

How's this:
* -Xmx2g
* ramBufferSizeMB 512
* mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
* ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
* use SolrStreamingUpdateServer (with params matching your number of CPU cores) 
or send batches of say 1000 docs with the other SolrServer impl using N threads 
(N=# of your CPU cores)

Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



- Original Message 
 From: Charles Wardell charles.ward...@bcsolution.com
 To: solr-user@lucene.apache.org
 Sent: Tue, April 26, 2011 2:32:29 PM
 Subject: Question on Batch process
 
 I am sure that this question has been asked a few times, but I can't seem to  
find the sweetspot for indexing.
 
 I have about 100,000 files each  containing 1,000 xml documents ready to be 
posted to Solr. My desire is to have  it index as quickly as possible and then 
once completed the daily stream of ADDs  will be small in comparison.
 
 The individual documents are small.  Essentially web postings from the net. 
Title, postPostContent, date. 

 
 What would be the ideal configuration? For RamBufferSize, mergeFactor,  
MaxbufferedDocs, etc..
 
 My machine is a quad core hyper-threaded. So it  shows up as 8 cpu's in TOP
 I have 16GB of available ram.
 
 
 Thanks in  advance.
 Charlie


Re: Question on Batch process

2011-04-26 Thread Charles Wardell
Thank you Otis.
Without trying to appear to stupid, when you refer to having the params 
matching your # of CPU cores, you are talking about the # of threads I can 
spawn with the StreamingUpdateSolrServer object?
Up until now, I have been just utilizing post.sh or post.jar. Are these capable 
of that or do I need to write some code to collect a bunch of files into the 
buffer and send it off?

Also, Do you have a sense for how long it should take to index 100,000 files or 
in my case 100,000,000 documents?
StreamingUpdateSolrServer
public StreamingUpdateSolrServer(String solrServerUrl, int queueSize, int 
threadCount) throws MalformedURLException

Thanks again,
Charlie

-- 
Best Regards,

Charles Wardell
Blue Chips Technology, Inc.
www.bcsolution.com

On Tuesday, April 26, 2011 at 5:12 PM, Otis Gospodnetic wrote: 
 Charlie,
 
 How's this:
 * -Xmx2g
 * ramBufferSizeMB 512
 * mergeFactor 10 (default, but you could up it to 20, 30, if ulimit -n allows)
 * ignore/delete maxBufferedDocs - not used if you ran ramBufferSizeMB
 * use SolrStreamingUpdateServer (with params matching your number of CPU 
 cores) 
 or send batches of say 1000 docs with the other SolrServer impl using N 
 threads 
 (N=# of your CPU cores)
 
 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/
 
 
 
 - Original Message 
  From: Charles Wardell charles.ward...@bcsolution.com
  To: solr-user@lucene.apache.org
  Sent: Tue, April 26, 2011 2:32:29 PM
  Subject: Question on Batch process
  
  I am sure that this question has been asked a few times, but I can't seem 
  to 
  find the sweetspot for indexing.
  
  I have about 100,000 files each containing 1,000 xml documents ready to be 
  posted to Solr. My desire is to have it index as quickly as possible and 
  then 
  once completed the daily stream of ADDs will be small in comparison.
  
  The individual documents are small. Essentially web postings from the net. 
  Title, postPostContent, date. 
  
  
  What would be the ideal configuration? For RamBufferSize, mergeFactor, 
  MaxbufferedDocs, etc..
  
  My machine is a quad core hyper-threaded. So it shows up as 8 cpu's in TOP
  I have 16GB of available ram.
  
  
  Thanks in advance.
  Charlie