Yatir, I actually think you may be OK with a single machine for 60M docs, though. You should be able to quickly do a test where you use SolrJ to post to Solr and get docs/second.
Use SOlr 1.3. Use 2-3 indexing threads going against a single Solr instance. Increase the buffer size param and increase mergeFactor slightly. Then determine docs/sec and see if that's high enough for you. I'll bet it will be enough, unless you have some crazy analyzers. TSVs will be faster, but if it takes you 3-4 hours to assemble them every night, the overall time may not be shorter. But this is just indexing. You may want to copy the index to a different box(es) for searching, as you don't wnat the high indexing IO to affect searching. Your QPS is low and 5 sec for query latency should give you plenty of room. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "Ben Shlomo, Yatir" <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Wednesday, September 24, 2008 2:50:54 AM > Subject: help required: how to design a large scale solr system > > Hi! > > I am already using solr 1.2 and happy with it. > > In a new project with very tight dead line (10 development days from > today) I need to setup a more ambitious system in terms of scale > Here is the spec: > > > > * I need to index about 60,000,000 > documents > > * Each document is has 11 textual fields to be indexed & stored > and 4 more fields to be stored only > > * Most fields are short (2-14 characters) however 2 indexed > fields can be up to 1KB and another stored field is up to 1KB > > * On average every document is about 0.5 KB to be stored and > 0.4KB to be indexed > > * The SLA for data freshness is a full nightly re-index ( I > cannot obtain an incremental update/delete lists of the modified > documents) > > * The SLA for query time is 5 seconds > > * the number of expected queries is 2-3 queries per second > > * the queries are simple a combination of Boolean operation and > name searches (no fancy fuzzy searches and levinstien distances, no > faceting, etc) > > * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with > RAID 10, 200 GB HD space, and 8GB RAM memory > > * The documents are not given to me explicitly - I am given a > raw-documents in RAM - one by one, from which I create my document in > RAM. > and then I can either http-post is to index it directly or append it to > a tsv file for later indexing > > * Each document has a unique ID > > > > I have a few directions I am thinking about > > > > The simple approach > > * Have one solr instance that will index > the entire document set (from files). I am afraid this will take too > much time > > > > Direction 1 > > * Create TSV files from all the > documents - this will take around 3-4 hours > > * Have all the documents partitioned > into several subsets (how many should I choose? ) > > * Have multiple solr instances on the > same machine > > * Let each solr instance concurrently > index the appropriate subset > > * At the end merge all the indices using > the IndexMergeTool - (how much time will it take ?) > > > > Direction 2 > > * Like the previous but instead of > using the IndexMergeTool , use distributed search with shards (upgrading > to solr 1.3) > > > > Direction 3,4 > > * Like previous directions only avoid > using TSV files at all and directly index the documents from RAM > > Questions: > > * Which direction do you recommend in order to meet the SLAs in > the fastest way? > > * Since I have RAID on the machine can I gain performance by > using multiple solr instances on the same machine or only multiple > machines will help me > > * What's the minimal number of machines I should require (I > might get more weaker machines) > > * How many concurrent indexers are recommended? > > * Do you agree that the bottle neck is the indexing time? > > Any help is appreciated > > Thanks in advance > > yatir