Re: Importing large datasets

David Stuart Wed, 02 Jun 2010 23:45:16 -0700


On 3 Jun 2010, at 02:51, Dennis Gearon <gear...@sbcglobal.net> wrote:

Well, I hope to have around 5 million datasets/documents within 1year, so this is good info. BUT if I DO have that many, then themarket I am aiming at will end giving me 100 times more than thanwithin 2 years.
Are there good references/books on using Solr/Lucen/(linux/nginx)for 500 million plus documents?

As far as I'm aware there aren't any books yet that cover this forsolr. The wiki, this mailing list, nabble are your best sources andthere have been some quite indepth conversations on the matter in thislist in the past

The data is easily shardible geographially, as one given.

Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Grant Ingersoll <gsing...@apache.org> wrote:

From: Grant Ingersoll <gsing...@apache.org>
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 3:42 AM

On Jun 1, 2010, at 9:54 PM, Blargy wrote:


We have around 5 million items in our index and each

item has a description

located on a separate physical database. These item

descriptions vary in

size and for the most part are quite large. Currently

we are only indexing

items and not their corresponding description and a

full import takes around

4 hours. Ideally we want to index both our items and

their descriptions but

after some quick profiling I determined that a full

import would take in

excess of 24 hours.

- How would I profile the indexing process to

determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index 5M items on
normal
hardware in approx. 1 hour (give or take 30 minutes).


When you say "quite large", what do you mean?  Are we
talking books here or maybe a couple pages of text or just a
couple KB of data?

How long does it take you to get that data out (and, from
the sounds of it, merge it with your item) w/o going to
Solr?

- In either case, how would one speed up this process?

Is there a way to run

parallel import processes and then merge them together

at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple threads.  The
absolute fastest way that I know of to index is via multiple
threads sending batches of documents at a time (at least
100).  Often, from DBs one can split up the table via
SQL statements that can then be fetched separately.
You may want to write your own multithreaded client to
index.

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem using Solr/Lucene: 
http://www.lucidimagination.com/search

Re: Importing large datasets

Reply via email to