Yatir,

I actually think you may be OK with a single machine for 60M docs, though.
You should be able to quickly do a test where you use SolrJ to post to Solr and 
get docs/second.

Use SOlr 1.3.  Use 2-3 indexing threads going against a single Solr instance.  
Increase the buffer size param and increase mergeFactor slightly.  Then 
determine docs/sec and see if that's high enough for you.  I'll bet it will be 
enough, unless you have some crazy analyzers.

TSVs will be faster, but if it takes you 3-4 hours to assemble them every 
night, the overall time may not be shorter.

But this is just indexing.  You may want to copy the index to a different 
box(es) for searching, as you don't wnat the high indexing IO to affect 
searching.

Your QPS is low and 5 sec for query latency should give you plenty of room.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



----- Original Message ----
> From: "Ben Shlomo, Yatir" <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Wednesday, September 24, 2008 2:50:54 AM
> Subject: help required: how to design a large scale solr system
> 
> Hi!
> 
> I am already using solr 1.2 and happy with it.
> 
> In a new project with very tight dead line (10 development days from
> today) I need to setup a more ambitious system in terms of scale
> Here is the spec:
> 
> 
> 
> *                                 I need to index about 60,000,000
> documents 
> 
> *         Each document is has 11 textual fields to be indexed & stored
> and 4 more fields to be stored only 
> 
> *         Most fields are short (2-14 characters) however 2 indexed
> fields can be up to 1KB and another stored field is up to 1KB 
> 
> *         On average every document is about 0.5 KB to be stored and
> 0.4KB to be indexed 
> 
> *         The SLA for data freshness is a full nightly re-index ( I
> cannot obtain an incremental update/delete lists of the modified
> documents) 
> 
> *         The SLA for query time is 5 seconds 
> 
> *         the number of expected queries is 2-3 queries per second 
> 
> *         the queries are simple a combination of Boolean operation and
> name searches (no fancy fuzzy searches and levinstien distances, no
> faceting, etc) 
> 
> *         I have a 64 bit Dell 2950 4-cpu machine  (2 dual cores ) with
> RAID 10, 200 GB HD space, and 8GB RAM memory 
> 
> *         The documents are not given to me explicitly - I am given a
> raw-documents in RAM - one by one, from which I create my document in
> RAM.
> and then I can either http-post is to index it directly or append it to
> a tsv file for later indexing 
> 
> *         Each document has a unique ID
> 
> 
> 
> I have a few directions I am thinking about
> 
> 
> 
> The simple approach
> 
> *                                 Have one solr instance that will index
> the entire document set (from files). I am afraid this will take too
> much time
> 
> 
> 
> Direction 1
> 
> *                                 Create TSV files from all the
> documents - this will take around 3-4 hours 
> 
> *                                 Have all the documents partitioned
> into several subsets (how many should I choose? ) 
> 
> *                                 Have multiple solr instances on the
> same machine 
> 
> *                                 Let each solr instance concurrently
> index the appropriate subset 
> 
> *                                 At the end merge all the indices using
> the IndexMergeTool - (how much time will it take ?)
> 
> 
> 
> Direction 2
> 
> *                                 Like  the previous but instead of
> using the IndexMergeTool , use distributed search with shards (upgrading
> to solr 1.3)
> 
> 
> 
> Direction 3,4
> 
> *                                 Like previous directions only avoid
> using TSV files at all and directly index the documents from RAM
> 
> Questions:
> 
> *         Which direction do you recommend in order to meet the SLAs in
> the fastest way? 
> 
> *         Since I have RAID on the machine can I gain performance by
> using multiple solr instances on the same machine or only multiple
> machines will help me 
> 
> *         What's the minimal number of machines I should require (I
> might get more weaker machines) 
> 
> *         How many concurrent indexers are recommended? 
> 
> *         Do you agree that the bottle neck is the indexing time?
> 
> Any help is appreciated 
> 
> Thanks in advance
> 
> yatir

Reply via email to