help required: how to design a large scale solr system
Hi! I am already using solr 1.2 and happy with it. In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3) Direction 3,4 * Like previous directions only avoid using TSV files at all and directly index the documents from RAM Questions: * Which direction do you recommend in order to meet the SLAs in the fastest way? * Since I have RAID on the machine can I gain performance by using multiple solr instances on the same machine or only multiple machines will help me * What's the minimal number of machines I should require (I might get more weaker machines) * How many concurrent indexers are recommended? * Do you agree that the bottle neck is the indexing time? Any help is appreciated Thanks in advance yatir
Re: help required: how to design a large scale solr system
From my limited experience: I think you might have a bit of trouble getting 60 mil docs on a single machine. Cached queries will probably still be *very* fast, but non cached queries are going to be very slow in many cases. Is that 5 seconds for all queries? You will never meet that on first run queries with 60mil docs on that machine. The light query load might make things workable...but your near the limits of a single machine (4 core or not) with 60 mil. You want to use a very good stopword list...common term queries will be killer. The docs being so small will be your only possible savior if you go the one machine route - that and cached hits. You don't have enough ram to get as much of the filesystem into RAM as youd like for 60 mil docs either. I think you might try two machines with 30, 3 with 20, or 4 with 15. The more you spread, even with slower machines, the faster your likely to index, which as you say, will take a long time for 60 mil docs (start today g). Multiple machines will help the indexing speed the most for sure - its still going to take a long time. I don't think you will get much advantage using more than one solr install on a single machine - if you do, that should be addressed in the code, even with RAID. So I say, spread if you can. Faster indexing, faster search, easy to expand later. Distributed search is so easy with solr 1.3, you wont regret it. I think there is a bug to be addressed if your needing this in a week though - in my experience, with distributed search, for every million docs on a machine beyond the first, you lose a doc in a search across all machines (ie 1 mil on machine 1, 1 million on machine 2, a *:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* search will be missing 30. Not a big deal, but could be a concern for some with picky, look at everything customers. - Mark Ben Shlomo, Yatir wrote: Hi! I am already using solr 1.2 and happy with it. In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3) Direction 3,4 * Like previous directions only avoid using TSV files at all and directly index the documents from RAM Questions: * Which direction do you recommend in order to meet the SLAs in the fastest way? * Since I have RAID on the machine can I gain performance by using multiple solr instances on the same machine or only multiple machines will help me * What's the minimal number of machines I should require (I might get more weaker machines) * How many concurrent indexers are recommended? * Do you agree that the bottle neck is the indexing time? Any
RE: help required: how to design a large scale solr system
Thanks Mark!. Do you have any comment regarding the performance differences between indexing TSV files as opposed to directly indexing each document via http post? -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:12 PM To: solr-user@lucene.apache.org Subject: Re: help required: how to design a large scale solr system From my limited experience: I think you might have a bit of trouble getting 60 mil docs on a single machine. Cached queries will probably still be *very* fast, but non cached queries are going to be very slow in many cases. Is that 5 seconds for all queries? You will never meet that on first run queries with 60mil docs on that machine. The light query load might make things workable...but your near the limits of a single machine (4 core or not) with 60 mil. You want to use a very good stopword list...common term queries will be killer. The docs being so small will be your only possible savior if you go the one machine route - that and cached hits. You don't have enough ram to get as much of the filesystem into RAM as youd like for 60 mil docs either. I think you might try two machines with 30, 3 with 20, or 4 with 15. The more you spread, even with slower machines, the faster your likely to index, which as you say, will take a long time for 60 mil docs (start today g). Multiple machines will help the indexing speed the most for sure - its still going to take a long time. I don't think you will get much advantage using more than one solr install on a single machine - if you do, that should be addressed in the code, even with RAID. So I say, spread if you can. Faster indexing, faster search, easy to expand later. Distributed search is so easy with solr 1.3, you wont regret it. I think there is a bug to be addressed if your needing this in a week though - in my experience, with distributed search, for every million docs on a machine beyond the first, you lose a doc in a search across all machines (ie 1 mil on machine 1, 1 million on machine 2, a *:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* search will be missing 30. Not a big deal, but could be a concern for some with picky, look at everything customers. - Mark Ben Shlomo, Yatir wrote: Hi! I am already using solr 1.2 and happy with it. In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3) Direction 3,4 * Like previous directions only avoid using TSV files at all and directly index the documents from RAM Questions: * Which direction do you recommend in order to meet
Re: help required: how to design a large scale solr system
Hi, I'm very new to search engines in general. I've been using Zend_Search_Lucene class before to try Lucene in general and though it surely works it's not what I'm looking for performance wise. I recently installed Solr on a newly installed Ubuntu (Hardy Heron) machine. I have about 207k docs (currently, and I'm getting about 100k each month from now on) and that's why I decided to throw myself into something real for once. As I'm learning from today, I was wondering two main things. I'm using Jetty as the Java container, and PHP5 to handle the search- requests from an agent. If I start Solr using java -jar start.jar in the example directory, everything works fine. I even manage to populate the index with the example data as documented in the tutorials. How can I setup to run Solr as a service, so I don't need to have a SSH connection open? Sorry for being stupid here btw. I'm working to have a multi-langual search. So a company (doc) exists in say Poland, what design of scheme should I read/work on to be able to write Poland/Polen/Polska (Poland in different languages) and still hit the same results. I have the data from geonames.org for this, but I can't really grasp how I should be working the scheme.xml. The easiest solution would be to populate each document with each possible hit word, but this would give me a bunch of duplicates. Yours, Martin Iwanowski
Re: help required: how to design a large scale solr system
Yes. You will def see a speed increasing by avoiding http (especially doc at a time http) and using the direct csv loader. http://wiki.apache.org/solr/UpdateCSV - Mark Ben Shlomo, Yatir wrote: Thanks Mark!. Do you have any comment regarding the performance differences between indexing TSV files as opposed to directly indexing each document via http post? -Original Message- From: Mark Miller [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:12 PM To: solr-user@lucene.apache.org Subject: Re: help required: how to design a large scale solr system From my limited experience: I think you might have a bit of trouble getting 60 mil docs on a single machine. Cached queries will probably still be *very* fast, but non cached queries are going to be very slow in many cases. Is that 5 seconds for all queries? You will never meet that on first run queries with 60mil docs on that machine. The light query load might make things workable...but your near the limits of a single machine (4 core or not) with 60 mil. You want to use a very good stopword list...common term queries will be killer. The docs being so small will be your only possible savior if you go the one machine route - that and cached hits. You don't have enough ram to get as much of the filesystem into RAM as youd like for 60 mil docs either. I think you might try two machines with 30, 3 with 20, or 4 with 15. The more you spread, even with slower machines, the faster your likely to index, which as you say, will take a long time for 60 mil docs (start today g). Multiple machines will help the indexing speed the most for sure - its still going to take a long time. I don't think you will get much advantage using more than one solr install on a single machine - if you do, that should be addressed in the code, even with RAID. So I say, spread if you can. Faster indexing, faster search, easy to expand later. Distributed search is so easy with solr 1.3, you wont regret it. I think there is a bug to be addressed if your needing this in a week though - in my experience, with distributed search, for every million docs on a machine beyond the first, you lose a doc in a search across all machines (ie 1 mil on machine 1, 1 million on machine 2, a *:* search will be missing 1 doc. 10 mil each on 3 machines, a *:* search will be missing 30. Not a big deal, but could be a concern for some with picky, look at everything customers. - Mark Ben Shlomo, Yatir wrote: Hi! I am already using solr 1.2 and happy with it. In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3
Re: help required: how to design a large scale solr system
On Wed, 24 Sep 2008 07:46:57 -0400 Mark Miller [EMAIL PROTECTED] wrote: Yes. You will def see a speed increasing by avoiding http (especially doc at a time http) and using the direct csv loader. http://wiki.apache.org/solr/UpdateCSV and the obvious reason that if, for whatever reason, something breaks while you are indexing directly from memory, can you restart the import? it may be just easier to keep in disk and keep track of where you are up to adding to the index... B _ {Beto|Norberto|Numard} Meijome Sysadmins can't be sued for malpractice, but surgeons don't have to deal with patients who install new versions of their own innards. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: help required: how to design a large scale solr system
Norberto Meijome wrote: On Wed, 24 Sep 2008 07:46:57 -0400 Mark Miller [EMAIL PROTECTED] wrote: Yes. You will def see a speed increasing by avoiding http (especially doc at a time http) and using the direct csv loader. http://wiki.apache.org/solr/UpdateCSV and the obvious reason that if, for whatever reason, something breaks while you are indexing directly from memory, can you restart the import? it may be just easier to keep in disk and keep track of where you are up to adding to the index... B _ {Beto|Norberto|Numard} Meijome Sysadmins can't be sued for malpractice, but surgeons don't have to deal with patients who install new versions of their own innards. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned. Nothing to stop you from breaking up the tsv/csv files into multiple tsv/csv files.
Re: help required: how to design a large scale solr system
On Wed, 24 Sep 2008 11:45:34 -0400 Mark Miller [EMAIL PROTECTED] wrote: Nothing to stop you from breaking up the tsv/csv files into multiple tsv/csv files. Absolutely agreeing with you ... in one system where I implemented SOLR, I have a process run through the file system and lazily pick up new files as they come in.. if something breaks (and it will,as the files are user generated in many cases...), report it / leave it for later...move on. b _ {Beto|Norberto|Numard} Meijome I used to hate weddings; all the Grandmas would poke me and say, You're next sonny! They stopped doing that when i started to do it to them at funerals. I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.
Re: help required: how to design a large scale solr system
Yatir, I actually think you may be OK with a single machine for 60M docs, though. You should be able to quickly do a test where you use SolrJ to post to Solr and get docs/second. Use SOlr 1.3. Use 2-3 indexing threads going against a single Solr instance. Increase the buffer size param and increase mergeFactor slightly. Then determine docs/sec and see if that's high enough for you. I'll bet it will be enough, unless you have some crazy analyzers. TSVs will be faster, but if it takes you 3-4 hours to assemble them every night, the overall time may not be shorter. But this is just indexing. You may want to copy the index to a different box(es) for searching, as you don't wnat the high indexing IO to affect searching. Your QPS is low and 5 sec for query latency should give you plenty of room. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Ben Shlomo, Yatir [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Wednesday, September 24, 2008 2:50:54 AM Subject: help required: how to design a large scale solr system Hi! I am already using solr 1.2 and happy with it. In a new project with very tight dead line (10 development days from today) I need to setup a more ambitious system in terms of scale Here is the spec: * I need to index about 60,000,000 documents * Each document is has 11 textual fields to be indexed stored and 4 more fields to be stored only * Most fields are short (2-14 characters) however 2 indexed fields can be up to 1KB and another stored field is up to 1KB * On average every document is about 0.5 KB to be stored and 0.4KB to be indexed * The SLA for data freshness is a full nightly re-index ( I cannot obtain an incremental update/delete lists of the modified documents) * The SLA for query time is 5 seconds * the number of expected queries is 2-3 queries per second * the queries are simple a combination of Boolean operation and name searches (no fancy fuzzy searches and levinstien distances, no faceting, etc) * I have a 64 bit Dell 2950 4-cpu machine (2 dual cores ) with RAID 10, 200 GB HD space, and 8GB RAM memory * The documents are not given to me explicitly - I am given a raw-documents in RAM - one by one, from which I create my document in RAM. and then I can either http-post is to index it directly or append it to a tsv file for later indexing * Each document has a unique ID I have a few directions I am thinking about The simple approach * Have one solr instance that will index the entire document set (from files). I am afraid this will take too much time Direction 1 * Create TSV files from all the documents - this will take around 3-4 hours * Have all the documents partitioned into several subsets (how many should I choose? ) * Have multiple solr instances on the same machine * Let each solr instance concurrently index the appropriate subset * At the end merge all the indices using the IndexMergeTool - (how much time will it take ?) Direction 2 * Like the previous but instead of using the IndexMergeTool , use distributed search with shards (upgrading to solr 1.3) Direction 3,4 * Like previous directions only avoid using TSV files at all and directly index the documents from RAM Questions: * Which direction do you recommend in order to meet the SLAs in the fastest way? * Since I have RAID on the machine can I gain performance by using multiple solr instances on the same machine or only multiple machines will help me * What's the minimal number of machines I should require (I might get more weaker machines) * How many concurrent indexers are recommended? * Do you agree that the bottle neck is the indexing time? Any help is appreciated Thanks in advance yatir
Re: help required: how to design a large scale solr system
Martin Iwanowski wrote: How can I setup to run Solr as a service, so I don't need to have a SSH connection open? The advice that I was given on this very list was to use daemontools. I set it up and it is really great - starts when the machine boots, auto-restart on failures, easy to bring up/down on demand. Search the archive for my post on the subject, I explained how to set it up in detail. (I've also had success using launchd to manage Solr on Mac OS X in case anyone wants to try running it on their desktop.) -jsd-