Re: Shared Directory for two Solr Clouds(Writer and Reader)
Hi Erick, Thanks for your kind reply. In order to deal with more documents in SolrCloud, we are thinking to use many collections and each of collection will also have several shards. Basic idea to deal with much document is that when a collection is filled with much data, we will create a new collection. That is, we will create many collections to contain more data. Currently we are thinking to use timestamp of each file to decide which collection will contain which documents. For example, we will create a new collection like every day or ever hour. But not fixed interval. We will maintain start/end time of each collection. Then once a collection is filled with the limit of number of documents. We will create a new collection. In order to avoid too many running collections in a SolrCloud, we are also thinking to unload(disable) old collections without deleting index files. So in future, we could enable the collection again on demand in a different SolrCloud. This way is what we are thinking to deal with many documents. But problem is Because our document ingest rate is very high, most of resources(CPU/Memory) in Solr is used for indexing. So when we try to do some query on the same machine, the query might be slow because of lack resources. Also the queries reduces indexing performance. So we have been investigating the way to use two separate Solr Clouds; one for indexing and the other for query. These two clouds will share data but use separate computing resources. Here is what we already setup in our prototype. Setup Currently we setup two separate Solr Clouds. 1. Two Solr Clouds. 2. One zooKeeper for each SolrCloud. Indexing Solr Cloud need to know Search Solr Cloud's zookeeper address. But Search SolrCloud doesn't need to know indexing Zookeeper. 3. Indexing SolrCloud and query SolrCloud will have the same collection name and same number of shards for the collection. 4. Indexing SolrCloud and query SolrCloud uses their own solrHome but indexing data directory are shared between them. 5. In indexing SolrCloud, each shard will have one only node. 6. In query SolrCloud, each shard could have more than one node for more query capability. For example, [image: Inline image 1] How it works In order to keep consistent view between Indexing and Search Solr Clouds, 1. Search Solr Cloud doesn't have any updateHandler/commit. It uses ReadOnlyDirectory(Factory) and NoOpUpdateHandler for /update. Also solrcloud.skip.autorecovery=true 2. Search Solr Cloud doesn't open Searcher by itself. It opens Searcher only when it receives openSearcherCmd from Indexing Solr Cloud. 3. Indexing Solr Cloud sends openSearcherCmd to search Solr Cloud after commit. That is, after each commit on Indexing SolrCloud, it schedules openSearcherCmd with remoteOpenSearcherMaxDelayAfterCommit interval. After the interval(default is 80 secs), Indexing SolrCloud will send openSearcherCmd to Search Solr Cloud. 4. Indexing SolrCloud has own deletionPolicy to keep old commit points which might be used by running queries on Search Cloud. Currently Indexing SolrCloud keep last 20 minutes commit points. Any feedback or your opinion would be very helpful to us. Thanks in advance. Jae On Tue, Oct 21, 2014 at 7:30 AM, Erick Erickson erickerick...@gmail.com wrote: Hmmm, I sure hope you have _lots_ of shards. At that rate, a single shard is probably going to run up against internal limits in a _very_ short time (the most docs I've seen successfully served on a single shard run around 300M). It seems, to handle any reasonable retention period, you need lots and lots and lots of physical machines out there. Which hints at using regular SolrCloud since each machine would then be handling much less of the load. This is what I mean by the XY problem. Your setup, at least from what you've told us so far, has so many unknowns that it's impossible to say much. If you go with your original e-mail and get it all set up and running on, say, 3 shards, it would work fine for about an hour. At that point you would have 300M docs on each shard and your query performance would start having... problems. You'd be hitting the hard limit of 2B docs/shard in less than 10 hours. And all the work you've put into this complex coordination setup would be totally wasted. So, you _really_ have to explain a lot more about the problem before we talk about writing code. You might want to review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Tue, Oct 21, 2014 at 12:34 AM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: In my case, injest rate is very high(above 300K docs/sec) and data are kept inserted. So CPU is already bottleneck because of indexing. older-style master/slave replication with http or scp takes long to copy big files from master/slave. That's why I setup two separate Solr Clouds. One for indexing and the other for
Re: Shared Directory for two Solr Clouds(Writer and Reader)
Hmmm, I sure hope you have _lots_ of shards. At that rate, a single shard is probably going to run up against internal limits in a _very_ short time (the most docs I've seen successfully served on a single shard run around 300M). It seems, to handle any reasonable retention period, you need lots and lots and lots of physical machines out there. Which hints at using regular SolrCloud since each machine would then be handling much less of the load. This is what I mean by the XY problem. Your setup, at least from what you've told us so far, has so many unknowns that it's impossible to say much. If you go with your original e-mail and get it all set up and running on, say, 3 shards, it would work fine for about an hour. At that point you would have 300M docs on each shard and your query performance would start having... problems. You'd be hitting the hard limit of 2B docs/shard in less than 10 hours. And all the work you've put into this complex coordination setup would be totally wasted. So, you _really_ have to explain a lot more about the problem before we talk about writing code. You might want to review: http://wiki.apache.org/solr/UsingMailingLists Best, Erick On Tue, Oct 21, 2014 at 12:34 AM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: In my case, injest rate is very high(above 300K docs/sec) and data are kept inserted. So CPU is already bottleneck because of indexing. older-style master/slave replication with http or scp takes long to copy big files from master/slave. That's why I setup two separate Solr Clouds. One for indexing and the other for query. Thanks, Jae On Mon, Oct 20, 2014 at 6:22 PM, Erick Erickson erickerick...@gmail.com wrote: I guess I'm not quite sure what the point is. So can you back up a bit and explain what problem this is trying to solve? Because all it really appears to be doing that's not already done with stock Solr is saving some disk space, and perhaps your reader SolrCloud is having some more cycles to devote to serving queries rather than indexing. So I'm curious why 1 standard SolrCloud with selective hard and soft commits doesn't satisfy the need and 2 If 1 is not reasonable, why older-style master/slave replication doesn't work. Unless there's a compelling use-case for this, it seems like there's a lot of complexity here for questionable value. Please note I'm not saying this is a bad idea. It would just be good to understand what problem it's trying to solve. I'm reluctant to introduce complexity without discussing the use-case. Perhaps the existing code could provide a good enough solution. Best, Erick On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: Hi Folks, Here are some my ideas to use shared file system with two separate Solr Clouds(Writer Solr Cloud and Reader Solr Cloud). I want to get your valuable feedbacks For prototype, I setup two separate Solr Clouds(one for Writer and the other for Reader). Basically big picture of my prototype is like below. 1. Reader and Writer Solr clouds share the same directory 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud inside postCommit eventHandler. That is, when new data are added to Writer Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr Cloud. 3. Reader opens searcher only when it receives openSearcher commands from Writer SolrCloud 4. Writer has own deletionPolicy to keep old commit points which might be used by running queries on Reader Solr Cloud when new searcher is opened on reader SolrCloud. 5. Reader has no update/no commits. Everything on reader Solr Cloud are read-only. It also creates searcher from directory not from indexer(nrtMode=false). That is, In Writer Solr Cloud, I added postCommit eventListner. Inside the postCommit eventListner, it sends own openSearcher command to reader Solr Cloud's own handler. Then reader Solr Cloud will create openSearcher directly without commit and return the writer's request. With this approach, Writer and Reader can use the same commit points in shared file system in synchronous way. When a Reader SolrCloud starts, it doesn't create openSearcher. Instead. Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in the reader SolrCloud, writer sends openSearcher command to reader Solr Cloud. Does it make sense? Or am I missing some important stuff? any feedback would be very helpful to me. Thanks, Jae
Shared Directory for two Solr Clouds(Writer and Reader)
Hi Folks, Here are some my ideas to use shared file system with two separate Solr Clouds(Writer Solr Cloud and Reader Solr Cloud). I want to get your valuable feedbacks For prototype, I setup two separate Solr Clouds(one for Writer and the other for Reader). Basically big picture of my prototype is like below. 1. Reader and Writer Solr clouds share the same directory 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud inside postCommit eventHandler. That is, when new data are added to Writer Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr Cloud. 3. Reader opens searcher only when it receives openSearcher commands from Writer SolrCloud 4. Writer has own deletionPolicy to keep old commit points which might be used by running queries on Reader Solr Cloud when new searcher is opened on reader SolrCloud. 5. Reader has no update/no commits. Everything on reader Solr Cloud are read-only. It also creates searcher from directory not from indexer(nrtMode=false). That is, In Writer Solr Cloud, I added postCommit eventListner. Inside the postCommit eventListner, it sends own openSearcher command to reader Solr Cloud's own handler. Then reader Solr Cloud will create openSearcher directly without commit and return the writer's request. With this approach, Writer and Reader can use the same commit points in shared file system in synchronous way. When a Reader SolrCloud starts, it doesn't create openSearcher. Instead. Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in the reader SolrCloud, writer sends openSearcher command to reader Solr Cloud. Does it make sense? Or am I missing some important stuff? any feedback would be very helpful to me. Thanks, Jae
Re: Shared Directory for two Solr Clouds(Writer and Reader)
Hi Jae, Sounds a bit complicated and messy to me, but maybe I'm missing something. What are you trying to accomplish with this approach? Which problems do you have that are making you look for non-straight forward setup? Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr Elasticsearch Support * http://sematext.com/ On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: Hi Folks, Here are some my ideas to use shared file system with two separate Solr Clouds(Writer Solr Cloud and Reader Solr Cloud). I want to get your valuable feedbacks For prototype, I setup two separate Solr Clouds(one for Writer and the other for Reader). Basically big picture of my prototype is like below. 1. Reader and Writer Solr clouds share the same directory 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud inside postCommit eventHandler. That is, when new data are added to Writer Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr Cloud. 3. Reader opens searcher only when it receives openSearcher commands from Writer SolrCloud 4. Writer has own deletionPolicy to keep old commit points which might be used by running queries on Reader Solr Cloud when new searcher is opened on reader SolrCloud. 5. Reader has no update/no commits. Everything on reader Solr Cloud are read-only. It also creates searcher from directory not from indexer(nrtMode=false). That is, In Writer Solr Cloud, I added postCommit eventListner. Inside the postCommit eventListner, it sends own openSearcher command to reader Solr Cloud's own handler. Then reader Solr Cloud will create openSearcher directly without commit and return the writer's request. With this approach, Writer and Reader can use the same commit points in shared file system in synchronous way. When a Reader SolrCloud starts, it doesn't create openSearcher. Instead. Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in the reader SolrCloud, writer sends openSearcher command to reader Solr Cloud. Does it make sense? Or am I missing some important stuff? any feedback would be very helpful to me. Thanks, Jae
Re: Shared Directory for two Solr Clouds(Writer and Reader)
I guess I'm not quite sure what the point is. So can you back up a bit and explain what problem this is trying to solve? Because all it really appears to be doing that's not already done with stock Solr is saving some disk space, and perhaps your reader SolrCloud is having some more cycles to devote to serving queries rather than indexing. So I'm curious why 1 standard SolrCloud with selective hard and soft commits doesn't satisfy the need and 2 If 1 is not reasonable, why older-style master/slave replication doesn't work. Unless there's a compelling use-case for this, it seems like there's a lot of complexity here for questionable value. Please note I'm not saying this is a bad idea. It would just be good to understand what problem it's trying to solve. I'm reluctant to introduce complexity without discussing the use-case. Perhaps the existing code could provide a good enough solution. Best, Erick On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: Hi Folks, Here are some my ideas to use shared file system with two separate Solr Clouds(Writer Solr Cloud and Reader Solr Cloud). I want to get your valuable feedbacks For prototype, I setup two separate Solr Clouds(one for Writer and the other for Reader). Basically big picture of my prototype is like below. 1. Reader and Writer Solr clouds share the same directory 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud inside postCommit eventHandler. That is, when new data are added to Writer Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr Cloud. 3. Reader opens searcher only when it receives openSearcher commands from Writer SolrCloud 4. Writer has own deletionPolicy to keep old commit points which might be used by running queries on Reader Solr Cloud when new searcher is opened on reader SolrCloud. 5. Reader has no update/no commits. Everything on reader Solr Cloud are read-only. It also creates searcher from directory not from indexer(nrtMode=false). That is, In Writer Solr Cloud, I added postCommit eventListner. Inside the postCommit eventListner, it sends own openSearcher command to reader Solr Cloud's own handler. Then reader Solr Cloud will create openSearcher directly without commit and return the writer's request. With this approach, Writer and Reader can use the same commit points in shared file system in synchronous way. When a Reader SolrCloud starts, it doesn't create openSearcher. Instead. Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in the reader SolrCloud, writer sends openSearcher command to reader Solr Cloud. Does it make sense? Or am I missing some important stuff? any feedback would be very helpful to me. Thanks, Jae
Re: Shared Directory for two Solr Clouds(Writer and Reader)
In my case, injest rate is very high(above 300K docs/sec) and data are kept inserted. So CPU is already bottleneck because of indexing. older-style master/slave replication with http or scp takes long to copy big files from master/slave. That's why I setup two separate Solr Clouds. One for indexing and the other for query. Thanks, Jae On Mon, Oct 20, 2014 at 6:22 PM, Erick Erickson erickerick...@gmail.com wrote: I guess I'm not quite sure what the point is. So can you back up a bit and explain what problem this is trying to solve? Because all it really appears to be doing that's not already done with stock Solr is saving some disk space, and perhaps your reader SolrCloud is having some more cycles to devote to serving queries rather than indexing. So I'm curious why 1 standard SolrCloud with selective hard and soft commits doesn't satisfy the need and 2 If 1 is not reasonable, why older-style master/slave replication doesn't work. Unless there's a compelling use-case for this, it seems like there's a lot of complexity here for questionable value. Please note I'm not saying this is a bad idea. It would just be good to understand what problem it's trying to solve. I'm reluctant to introduce complexity without discussing the use-case. Perhaps the existing code could provide a good enough solution. Best, Erick On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote: Hi Folks, Here are some my ideas to use shared file system with two separate Solr Clouds(Writer Solr Cloud and Reader Solr Cloud). I want to get your valuable feedbacks For prototype, I setup two separate Solr Clouds(one for Writer and the other for Reader). Basically big picture of my prototype is like below. 1. Reader and Writer Solr clouds share the same directory 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud inside postCommit eventHandler. That is, when new data are added to Writer Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr Cloud. 3. Reader opens searcher only when it receives openSearcher commands from Writer SolrCloud 4. Writer has own deletionPolicy to keep old commit points which might be used by running queries on Reader Solr Cloud when new searcher is opened on reader SolrCloud. 5. Reader has no update/no commits. Everything on reader Solr Cloud are read-only. It also creates searcher from directory not from indexer(nrtMode=false). That is, In Writer Solr Cloud, I added postCommit eventListner. Inside the postCommit eventListner, it sends own openSearcher command to reader Solr Cloud's own handler. Then reader Solr Cloud will create openSearcher directly without commit and return the writer's request. With this approach, Writer and Reader can use the same commit points in shared file system in synchronous way. When a Reader SolrCloud starts, it doesn't create openSearcher. Instead. Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in the reader SolrCloud, writer sends openSearcher command to reader Solr Cloud. Does it make sense? Or am I missing some important stuff? any feedback would be very helpful to me. Thanks, Jae