Re: Solr feasibility with terabyte-scale data

Phillip Farber Tue, 22 Jan 2008 16:20:56 -0800


Otis Gospodnetic wrote:

Hi,
Some quick notes, since it's late here.

- You'll need to wait for SOLR-303 - there is no way even a big machine will be 
able to search such a large index in a reasonable amount of time, plus you may 
simply not have enough RAM for such a large index.


Are you basing this on data similar to what Mike Klaas outlines?

Quoting Mike Klaas:

"That's 280K tokens per document, assuming ~5 chars/word. That's 2trillion tokens. Lucene's posting list compression is decent, butyou're still talking about a minimum of 2-4TB for the index (that'sassuming 1 or 2 bytes per token). "

and

"Well, the average compressed posting list will be at least 80MB thatneeds to be read from the NAS and decoded and ranked. Since the size isexponentially distributed, common terms will be much bigger and rarerterms much smaller."


End of quoting Mike Klaas:

We would need all 7M ids scored so we could push them through a filterquery to reduce them to a much smaller number on the order of 100-10,000representing just those that correspond to items in a collection.

So to ask again, do you think it's possible to do this in, say, under 15seconds? (I think I'm giving up on 0.5 sec. ...)

- I'd suggest you wait for Solr 1.3 (or some -dev version that uses the 
about-to-be-released Lucene 2.3)...performance reasons.

- As for avoiding index duplication - how about having a SAN with a single copy 
of the index that all searchers (and the master) point to?

Yes we're thinking a single copy of the index using hardware-basedsnapshot technology for the readers a dedicated indexing solr instanceupdates the index. Reasonable?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Phillip Farber <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, January 18, 2008 5:26:21 PM
Subject: Solr feasibility with terabyte-scale data

Hello everyone,

We are considering Solr 1.2 to index and search a terabyte-scale
datasetof OCR. Initially our requirements are simple: basic tokenizing, scoresorting only, no faceting. The schema is simple too. A documentconsists of a numeric id, stored and indexed and a large text field,indexed not stored, containing the OCR typically ~1.4Mb. Some limitedfaceting or additional metadata fields may be added later.
The data in question currently amounts to about 1.1Tb of OCR (about 1Mdocs) which we expect to increase to 10Tb over time. Pilot tests onthedesktop w/ 2.6 GHz P4 with 2.5 Gb memory, java 1Gb heap on ~180 Mb ofdata via HTTP suggest we can index at a rate sufficient to keep up withthe inputs (after getting over the 1.1 Tb hump). We envision nightlycommits/optimizes.
We expect to have low QPS (<10) rate and probably will not needmillisecond query response.
Our environment makes available Apache on blade servers (Dell 1955 dual
dual-core 3.x GHz Xeons w/ 8GB RAM) connected to a *large*,
high-performance NAS system over a dedicated (out-of-band) GbE switch
(Dell PowerConnect 5324) using a 9K MTU (jumbo packets). We are
 starting
with 2 blades and will add as demands require.

While we have a lot of storage, the idea of master/slave Solr
CollectionDistribution to add more Solr instances clearly means duplicating animmense index. Is it possible to use one instance to update the indexon NAS while other instances only read the index and commit to keeptheir caches warm instead?
Should we expect Solr indexing time to slow significantly as we scaleup? What kind of query performance could we expect? Is it totallynaive even to consider Solr at this kind of scale?
Given these parameters is it realistic to think that Solr could handlethe task?
Any advice/wisdom greatly appreciated,

Phil

Re: Solr feasibility with terabyte-scale data

Reply via email to