Re: Any tips for indexing large amounts of data?

Brendan Grainger Wed, 21 Nov 2007 10:24:55 -0800

Hi Otis,

Thanks for this. Are you using a flavor of linux and is it 64bit? Howmuch heap are you giving your jvm?


Thanks again
Brendan

On Nov 21, 2007, at 2:03 AM, Otis Gospodnetic wrote:

Mike is right about the occasional slow-down, which appears as apause and is due to large Lucene index segment merging. Thisshould go away with newer versions of Lucene where this ishappening in the background.

That said, we just indexed about 20MM documents on a single 8-coremachine with 8 GB of RAM, resulting in nearly 20 GB index. Thewhole process took a little less than 10 hours - that's over 550docs/second. The vanilla approach before some of our changesapparently required several days to index the same amount of data.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Mike Klaas <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Monday, November 19, 2007 5:50:19 PM
Subject: Re: Any tips for indexing large amounts of data?

There should be some slowdown in larger indices as occasionally large
segment merge operations must occur.  However, this shouldn't really
affect overall speed too much.

You haven't really given us enough data to tell you anything useful.
I would recommend trying to do the indexing via a webapp to eliminate
all your code as a possible factor.  Then, look for signs to what is
happening when indexing slows.  For instance, is Solr high in cpu, is
the computer thrashing, etc?

-Mike

On 19-Nov-07, at 2:44 PM, Brendan Grainger wrote:

Hi,

Thanks for answering this question a while back. I have made some
of the suggestions you mentioned. ie not committing until I've
finished indexing. What I am seeing though, is as the index get
larger (around 1Gb), indexing is taking a lot longer. In fact it
slows down to a crawl. Have you got any pointers as to what I might
be doing wrong?

Also, I was looking at using MultiCore solr. Could this help in
some way?

Thank you
Brendan

On Oct 31, 2007, at 10:09 PM, Chris Hostetter wrote:


: I would think you would see better performance by allowing auto
commit
: to handle the commit size instead of reopening the connection
all the
: time.

if your goal is "fast" indexing, don't use autoCommit at all ...

 just

index everything, and don't commit until you are completely done.

autoCommitting will slow your indexing down (the benefit being
that more
results will be visible to searchers as you proceed)




-Hoss

Re: Any tips for indexing large amounts of data?

Reply via email to