Re: indexing best practices

2010-07-20 Thread Lance Norskog
"Nomerge" has struck me as somewhat uncontrollable. There is also a
"balanced" merge policy in the trunk, courtesy of LinkedIn.

On Mon, Jul 19, 2010 at 12:43 PM, Burton-West, Tom  wrote:
> Hi Ken,
>
> This is all very dependent on your documents, your indexing setup and your 
> hardware. Just as an extreme data point, I'll describe our experience.
>
> We run 5 clients on each of 6 machines to send documents to Solr using the 
> standard http xml process.  Our documents contain about 10 fields, but one 
> field contains OCR for the full text of a book.  The documents are about 
> 700KB in size.
>
> Each client sends solr documents to one of 10 solr shards on a round-robin 
> basis.  We are running 5 shards on each of two dedicated indexing machines 
> each with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz 
> processors (Nehalem).  What we generally see is that once the index gets 
> large enough for significant merging, our producers can send documents to 
> solr faster than it can index them.
>
> We suspect that our bottleneck is simply disk I/O for index merging on the 
> Solr build machines.  We are currently experimenting with changing the 
> maxRAMBufferSize settings and various merge policies/merge factors to see if 
> we can speed up the Solr end of the indexing process.   Since we optimize our 
> index down to two segments, we are also planning to experiment with using the 
> "nomerge" merge policy. I hope to have some results to report on our blog 
> sometime in the next  month or so.
>
> Tom Burton-West
> www.hathitrust.org/blogs
>
> -Original Message-
> From: kenf_nc [mailto:ken.fos...@realestate.com]
> Sent: Sunday, July 18, 2010 8:18 AM
> To: solr-user@lucene.apache.org
> Subject: Re: indexing best practices
>
>
> No one has done performance analysis? Or has a link to anywhere where it's
> been done?
>
> basically fastest way to get documents into Solr. So many options available,
> what's the fastest:
> 1) file import (xml, csv)  vs  DIH  vs POSTing
> 2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
> returns number?
>
> I have 16 million small (8 to 10 fields, no large text fields) docs that get
> updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
> fields) that get updated monthly. It currently takes about 20 hours to do a
> full import. I would like to cut that down as much as possible.
> Thanks,
> Ken
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Lance Norskog
goks...@gmail.com


RE: indexing best practices

2010-07-19 Thread Burton-West, Tom
Hi Ken,

This is all very dependent on your documents, your indexing setup and your 
hardware. Just as an extreme data point, I'll describe our experience.  

We run 5 clients on each of 6 machines to send documents to Solr using the 
standard http xml process.  Our documents contain about 10 fields, but one 
field contains OCR for the full text of a book.  The documents are about 700KB 
in size.

Each client sends solr documents to one of 10 solr shards on a round-robin 
basis.  We are running 5 shards on each of two dedicated indexing machines each 
with 144GB of memory and 2 x Quad Core Intel Xeon E5540 2.53GHz processors 
(Nehalem).  What we generally see is that once the index gets large enough for 
significant merging, our producers can send documents to solr faster than it 
can index them.

We suspect that our bottleneck is simply disk I/O for index merging on the Solr 
build machines.  We are currently experimenting with changing the 
maxRAMBufferSize settings and various merge policies/merge factors to see if we 
can speed up the Solr end of the indexing process.   Since we optimize our 
index down to two segments, we are also planning to experiment with using the 
"nomerge" merge policy. I hope to have some results to report on our blog 
sometime in the next  month or so.

Tom Burton-West
www.hathitrust.org/blogs

-Original Message-
From: kenf_nc [mailto:ken.fos...@realestate.com] 
Sent: Sunday, July 18, 2010 8:18 AM
To: solr-user@lucene.apache.org
Subject: Re: indexing best practices


No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: indexing best practices

2010-07-18 Thread Geert-Jan Brits
Have you read:
http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr

To be short there are only guidelines (see links) no definitive answers.
If you followed the guidelines for improviing indexing speed on a single box
and after having tested various settings indexing is still too slow, you may
want to test the scenario:
1. indexing to several boxes/shards (using round robin or something).
2. copy all created indexes to one box.
3. use indexwriter.addIndexes to merge the indexes.

1/2/3 done on ssd's is of course going to boost performance a lot as well
(on large indexes, bc small ones may fit in disk cache entirely)

Hope that helps a bit,
Geert-Jan

2010/7/18 kenf_nc 

>
> No one has done performance analysis? Or has a link to anywhere where it's
> been done?
>
> basically fastest way to get documents into Solr. So many options
> available,
> what's the fastest:
> 1) file import (xml, csv)  vs  DIH  vs POSTing
> 2) number of concurrent clients   1   vs 10 vs 100 ...is there a
> diminishing
> returns number?
>
> I have 16 million small (8 to 10 fields, no large text fields) docs that
> get
> updated monthly and 2.5 million largish (20 to 30 fields, a couple html
> text
> fields) that get updated monthly. It currently takes about 20 hours to do a
> full import. I would like to cut that down as much as possible.
> Thanks,
> Ken
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: indexing best practices

2010-07-18 Thread kenf_nc

No one has done performance analysis? Or has a link to anywhere where it's
been done?

basically fastest way to get documents into Solr. So many options available,
what's the fastest:
1) file import (xml, csv)  vs  DIH  vs POSTing
2) number of concurrent clients   1   vs 10 vs 100 ...is there a diminishing
returns number?

I have 16 million small (8 to 10 fields, no large text fields) docs that get
updated monthly and 2.5 million largish (20 to 30 fields, a couple html text
fields) that get updated monthly. It currently takes about 20 hours to do a
full import. I would like to cut that down as much as possible.
Thanks,
Ken
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-best-practices-tp973274p976313.html
Sent from the Solr - User mailing list archive at Nabble.com.