Re: Importing large datasets

David Stuart Wed, 02 Jun 2010 23:41:35 -0700


On 3 Jun 2010, at 02:58, Dennis Gearon <gear...@sbcglobal.net> wrote:

When adding data continuously, that data is available aftercommitting and is indexed, right?

Yes


If so, how often is reindexing do some good?

You should only need to reindex if the data changes or you change yourschema. The DIH in solr 1.4 supports delta imports so you should onlyreally be adding of updating (which is actually deleting and adding)items when necessary.


Dennis Gearon

Signature Warning
----------------
EARTH has a Right To Life,
 otherwise we all die.

Read 'Hot, Flat, and Crowded'
Laugh at http://www.yert.com/film.php


--- On Wed, 6/2/10, Andrzej Bialecki <a...@getopt.org> wrote:

From: Andrzej Bialecki <a...@getopt.org>
Subject: Re: Importing large datasets
To: solr-user@lucene.apache.org
Date: Wednesday, June 2, 2010, 4:52 AM
On 2010-06-02 13:12, Grant Ingersoll
wrote:


On Jun 2, 2010, at 6:53 AM, Andrzej Bialecki wrote:

On 2010-06-02 12:42, Grant Ingersoll wrote:


On Jun 1, 2010, at 9:54 PM, Blargy wrote:


We have around 5 million items in our

index and each item has a description

located on a separate physical database.

These item descriptions vary in

size and for the most part are quite

large. Currently we are only indexing

items and not their corresponding

description and a full import takes around

4 hours. Ideally we want to index both our

items and their descriptions but

after some quick profiling I determined

that a full import would take in

excess of 24 hours.

- How would I profile the indexing process

to determine if the bottleneck is

Solr or our Database.


As a data point, I routinely see clients index

5M items on normal

hardware in approx. 1 hour (give or take 30

minutes).


When you say "quite large", what do you

mean?  Are we talking books here or maybe a couple
pages of text or just a couple KB of data?


How long does it take you to get that data out

(and, from the sounds of it, merge it with your item) w/o
going to Solr?

- In either case, how would one speed up

this process? Is there a way to run

parallel import processes and then merge

them together at the end? Possibly

use some sort of distributed computing?


DataImportHandler now supports multiple

threads.  The absolute fastest way that I know of to
index is via multiple threads sending batches of documents
at a time (at least 100).  Often, from DBs one can
split up the table via SQL statements that can then be
fetched separately.  You may want to write your own
multithreaded client to index.


SOLR-1301 is also an option if you are familiar

with Hadoop ...


If the bottleneck is the DB, will that do much?


Nope. But the workflow could be set up so that during night
hours a DB
export takes place that results in a CSV or SolrXML file
(there you
could measure the time it takes to do this export), and
then indexing
can work from this file.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _
_   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic
Web
___|||__||  \|  ||  |  Embedded Unix,
System Integration
http://www.sigram.com  Contact: info at sigram dot
com

Re: Importing large datasets

Reply via email to