Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-29 Thread Jaeyoung Yoon
Hi Erick,

Thanks for your kind reply.

In order to deal with more documents in SolrCloud, we are thinking to use
many collections and each of collection will also have several shards.
Basic idea to deal with much document is that when a collection is filled
with much data, we will create a new collection.
That is, we will create many collections to contain more data. Currently we
are thinking to use timestamp of each file to decide which collection will
contain which documents.
For example, we will create a new collection like every day or ever hour.
But not fixed interval. We will maintain start/end time of each collection.
Then once a collection is filled with the limit of number of documents. We
will create a new collection. In order to avoid too many running
collections in a SolrCloud, we are also thinking to unload(disable) old
collections without deleting index files. So in future, we could enable the
collection again on demand in a different SolrCloud.

This way is what we are thinking to deal with many documents.

But problem is
Because our document ingest rate is very high, most of
resources(CPU/Memory) in Solr is used for indexing. So when we try to do
some query on the same machine, the query might be slow because of lack
resources. Also the queries reduces indexing performance.
So we have been investigating the way to use two separate Solr Clouds; one
for indexing and the other for query. These two clouds will share data but
use separate computing resources.


Here is what we already setup in our prototype.

Setup

Currently we setup two separate Solr Clouds.

   1. Two Solr Clouds.
   2. One zooKeeper for each SolrCloud. Indexing Solr Cloud need to know
   Search Solr Cloud's zookeeper address. But Search SolrCloud doesn't need to
   know indexing Zookeeper.
   3. Indexing SolrCloud and query SolrCloud will have the same collection
   name and same number of shards for the collection.
   4. Indexing SolrCloud and query SolrCloud uses their own solrHome but
   indexing data directory are shared between them.
   5. In indexing SolrCloud, each shard will have one only node.
   6. In query SolrCloud, each shard could have more than one node for more
   query capability.

 For example,

 [image: Inline image 1]
How it works

In order to keep consistent view between Indexing and Search Solr Clouds,

   1. Search Solr Cloud doesn't have any updateHandler/commit. It uses
   ReadOnlyDirectory(Factory) and NoOpUpdateHandler for /update. Also
   solrcloud.skip.autorecovery=true
   2. Search Solr Cloud doesn't open Searcher by itself. It opens
   Searcher only when it receives openSearcherCmd from Indexing Solr Cloud.
   3. Indexing Solr Cloud sends openSearcherCmd to search Solr Cloud
   after commit. That is, after each commit on Indexing SolrCloud, it
   schedules openSearcherCmd with remoteOpenSearcherMaxDelayAfterCommit
   interval. After the interval(default is 80 secs), Indexing SolrCloud will
   send openSearcherCmd to Search Solr Cloud.
   4. Indexing SolrCloud has own deletionPolicy to keep old commit points
   which might be used by running queries on Search Cloud. Currently Indexing
   SolrCloud keep last 20 minutes commit points.

Any feedback or your opinion would be very helpful to us.

Thanks in advance.
Jae


On Tue, Oct 21, 2014 at 7:30 AM, Erick Erickson erickerick...@gmail.com
wrote:

 Hmmm, I sure hope you have _lots_ of shards. At that rate, a single
 shard is probably going to run up against internal limits in a _very_
 short time (the most docs I've seen successfully served on a single
 shard run around 300M).

 It seems, to handle any reasonable retention period, you need lots and
 lots and lots of physical machines out there. Which hints at using
 regular SolrCloud since each machine would then be handling much less
 of the load.

 This is what I mean by the XY problem. Your setup, at least from
 what you've told us so far, has so many unknowns that it's impossible
 to say much. If you go with your original e-mail and get it all set up
 and running on, say, 3 shards, it would work fine for about an hour.
 At that point you would have 300M docs on each shard and your query
 performance would start having... problems. You'd be hitting the hard
 limit of 2B docs/shard in less than 10 hours. And all the work you've
 put into this complex coordination setup would be totally wasted.

 So, you _really_ have to explain a lot more about the problem before
 we talk about writing code. You might want to review:
 http://wiki.apache.org/solr/UsingMailingLists

 Best,
 Erick

 On Tue, Oct 21, 2014 at 12:34 AM, Jaeyoung Yoon jaeyoungy...@gmail.com
 wrote:
  In my case, injest rate is very high(above 300K docs/sec) and data are
 kept
  inserted. So CPU is already bottleneck because of indexing.
 
  older-style master/slave replication with http or scp takes long to copy
  big files from master/slave.
 
  That's why I setup two separate Solr Clouds. One for indexing and the
 other
  for 

Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-21 Thread Erick Erickson
Hmmm, I sure hope you have _lots_ of shards. At that rate, a single
shard is probably going to run up against internal limits in a _very_
short time (the most docs I've seen successfully served on a single
shard run around 300M).

It seems, to handle any reasonable retention period, you need lots and
lots and lots of physical machines out there. Which hints at using
regular SolrCloud since each machine would then be handling much less
of the load.

This is what I mean by the XY problem. Your setup, at least from
what you've told us so far, has so many unknowns that it's impossible
to say much. If you go with your original e-mail and get it all set up
and running on, say, 3 shards, it would work fine for about an hour.
At that point you would have 300M docs on each shard and your query
performance would start having... problems. You'd be hitting the hard
limit of 2B docs/shard in less than 10 hours. And all the work you've
put into this complex coordination setup would be totally wasted.

So, you _really_ have to explain a lot more about the problem before
we talk about writing code. You might want to review:
http://wiki.apache.org/solr/UsingMailingLists

Best,
Erick

On Tue, Oct 21, 2014 at 12:34 AM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote:
 In my case, injest rate is very high(above 300K docs/sec) and data are kept
 inserted. So CPU is already bottleneck because of indexing.

 older-style master/slave replication with http or scp takes long to copy
 big files from master/slave.

 That's why I setup two separate Solr Clouds. One for indexing and the other
 for query.

 Thanks,
 Jae

 On Mon, Oct 20, 2014 at 6:22 PM, Erick Erickson erickerick...@gmail.com
 wrote:

 I guess I'm not quite sure what the point is. So can you back up a bit
 and explain what problem this is trying to solve? Because all it
 really appears to be doing that's not already done with stock Solr
 is saving some disk space, and perhaps your reader SolrCloud
 is having some more cycles to devote to serving queries rather
 than indexing.

 So I'm curious why
 1 standard SolrCloud with selective hard and soft commits doesn't
 satisfy the need
 and
 2 If 1 is not reasonable, why older-style master/slave replication
 doesn't work.

 Unless there's a compelling use-case for this, it seems like there's
 a lot of complexity here for questionable value.

 Please note I'm not saying this is a bad idea. It would just be good
 to  understand what problem it's trying to solve. I'm reluctant to
 introduce complexity without discussing the use-case. Perhaps
 the existing code could provide a good enough solution.

 Best,
 Erick

 On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com
 wrote:
  Hi Folks,
 
  Here are some my ideas to use shared file system with two separate Solr
  Clouds(Writer Solr Cloud and Reader Solr Cloud).
 
  I want to get your valuable feedbacks
 
  For prototype, I setup two separate Solr Clouds(one for Writer and the
  other for Reader).
 
  Basically big picture of my prototype is like below.
 
  1. Reader and Writer Solr clouds share the same directory
  2. Writer SolrCloud sends the openSearcher commands to Reader Solr
 Cloud
  inside postCommit eventHandler. That is, when new data are added to
 Writer
  Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader
 Solr
  Cloud.
  3. Reader opens searcher only when it receives openSearcher commands
  from Writer SolrCloud
  4. Writer has own deletionPolicy to keep old commit points which might be
  used by running queries on Reader Solr Cloud when new searcher is opened
 on
  reader SolrCloud.
  5. Reader has no update/no commits. Everything on reader Solr Cloud are
  read-only. It also creates searcher from directory not from
  indexer(nrtMode=false).
 
  That is,
  In Writer Solr Cloud, I added postCommit eventListner. Inside the
  postCommit eventListner, it sends own openSearcher command to reader
 Solr
  Cloud's own handler. Then reader Solr Cloud will create openSearcher
  directly without commit and return the writer's request.
 
  With this approach, Writer and Reader can use the same commit points in
  shared file system in synchronous way.
  When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
  Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change
 in
  the reader SolrCloud, writer sends openSearcher command to reader Solr
  Cloud.
 
  Does it make sense? Or am I missing some important stuff?
 
  any feedback would be very helpful to me.
 
  Thanks,
  Jae



Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-20 Thread Jaeyoung Yoon
Hi Folks,

Here are some my ideas to use shared file system with two separate Solr
Clouds(Writer Solr Cloud and Reader Solr Cloud).

I want to get your valuable feedbacks

For prototype, I setup two separate Solr Clouds(one for Writer and the
other for Reader).

Basically big picture of my prototype is like below.

1. Reader and Writer Solr clouds share the same directory
2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud
inside postCommit eventHandler. That is, when new data are added to Writer
Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr
Cloud.
3. Reader opens searcher only when it receives openSearcher commands
from Writer SolrCloud
4. Writer has own deletionPolicy to keep old commit points which might be
used by running queries on Reader Solr Cloud when new searcher is opened on
reader SolrCloud.
5. Reader has no update/no commits. Everything on reader Solr Cloud are
read-only. It also creates searcher from directory not from
indexer(nrtMode=false).

That is,
In Writer Solr Cloud, I added postCommit eventListner. Inside the
postCommit eventListner, it sends own openSearcher command to reader Solr
Cloud's own handler. Then reader Solr Cloud will create openSearcher
directly without commit and return the writer's request.

With this approach, Writer and Reader can use the same commit points in
shared file system in synchronous way.
When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in
the reader SolrCloud, writer sends openSearcher command to reader Solr
Cloud.

Does it make sense? Or am I missing some important stuff?

any feedback would be very helpful to me.

Thanks,
Jae


Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-20 Thread Otis Gospodnetic
Hi Jae,

Sounds a bit complicated and messy to me, but maybe I'm missing something.
What are you trying to accomplish with this approach?  Which problems do
you have that are making you look for non-straight forward setup?

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr  Elasticsearch Support * http://sematext.com/


On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com
wrote:

 Hi Folks,

 Here are some my ideas to use shared file system with two separate Solr
 Clouds(Writer Solr Cloud and Reader Solr Cloud).

 I want to get your valuable feedbacks

 For prototype, I setup two separate Solr Clouds(one for Writer and the
 other for Reader).

 Basically big picture of my prototype is like below.

 1. Reader and Writer Solr clouds share the same directory
 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud
 inside postCommit eventHandler. That is, when new data are added to Writer
 Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr
 Cloud.
 3. Reader opens searcher only when it receives openSearcher commands
 from Writer SolrCloud
 4. Writer has own deletionPolicy to keep old commit points which might be
 used by running queries on Reader Solr Cloud when new searcher is opened on
 reader SolrCloud.
 5. Reader has no update/no commits. Everything on reader Solr Cloud are
 read-only. It also creates searcher from directory not from
 indexer(nrtMode=false).

 That is,
 In Writer Solr Cloud, I added postCommit eventListner. Inside the
 postCommit eventListner, it sends own openSearcher command to reader Solr
 Cloud's own handler. Then reader Solr Cloud will create openSearcher
 directly without commit and return the writer's request.

 With this approach, Writer and Reader can use the same commit points in
 shared file system in synchronous way.
 When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
 Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in
 the reader SolrCloud, writer sends openSearcher command to reader Solr
 Cloud.

 Does it make sense? Or am I missing some important stuff?

 any feedback would be very helpful to me.

 Thanks,
 Jae



Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-20 Thread Erick Erickson
I guess I'm not quite sure what the point is. So can you back up a bit
and explain what problem this is trying to solve? Because all it
really appears to be doing that's not already done with stock Solr
is saving some disk space, and perhaps your reader SolrCloud
is having some more cycles to devote to serving queries rather
than indexing.

So I'm curious why
1 standard SolrCloud with selective hard and soft commits doesn't
satisfy the need
and
2 If 1 is not reasonable, why older-style master/slave replication
doesn't work.

Unless there's a compelling use-case for this, it seems like there's
a lot of complexity here for questionable value.

Please note I'm not saying this is a bad idea. It would just be good
to  understand what problem it's trying to solve. I'm reluctant to
introduce complexity without discussing the use-case. Perhaps
the existing code could provide a good enough solution.

Best,
Erick

On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com wrote:
 Hi Folks,

 Here are some my ideas to use shared file system with two separate Solr
 Clouds(Writer Solr Cloud and Reader Solr Cloud).

 I want to get your valuable feedbacks

 For prototype, I setup two separate Solr Clouds(one for Writer and the
 other for Reader).

 Basically big picture of my prototype is like below.

 1. Reader and Writer Solr clouds share the same directory
 2. Writer SolrCloud sends the openSearcher commands to Reader Solr Cloud
 inside postCommit eventHandler. That is, when new data are added to Writer
 Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader Solr
 Cloud.
 3. Reader opens searcher only when it receives openSearcher commands
 from Writer SolrCloud
 4. Writer has own deletionPolicy to keep old commit points which might be
 used by running queries on Reader Solr Cloud when new searcher is opened on
 reader SolrCloud.
 5. Reader has no update/no commits. Everything on reader Solr Cloud are
 read-only. It also creates searcher from directory not from
 indexer(nrtMode=false).

 That is,
 In Writer Solr Cloud, I added postCommit eventListner. Inside the
 postCommit eventListner, it sends own openSearcher command to reader Solr
 Cloud's own handler. Then reader Solr Cloud will create openSearcher
 directly without commit and return the writer's request.

 With this approach, Writer and Reader can use the same commit points in
 shared file system in synchronous way.
 When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
 Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change in
 the reader SolrCloud, writer sends openSearcher command to reader Solr
 Cloud.

 Does it make sense? Or am I missing some important stuff?

 any feedback would be very helpful to me.

 Thanks,
 Jae


Re: Shared Directory for two Solr Clouds(Writer and Reader)

2014-10-20 Thread Jaeyoung Yoon
In my case, injest rate is very high(above 300K docs/sec) and data are kept
inserted. So CPU is already bottleneck because of indexing.

older-style master/slave replication with http or scp takes long to copy
big files from master/slave.

That's why I setup two separate Solr Clouds. One for indexing and the other
for query.

Thanks,
Jae

On Mon, Oct 20, 2014 at 6:22 PM, Erick Erickson erickerick...@gmail.com
wrote:

 I guess I'm not quite sure what the point is. So can you back up a bit
 and explain what problem this is trying to solve? Because all it
 really appears to be doing that's not already done with stock Solr
 is saving some disk space, and perhaps your reader SolrCloud
 is having some more cycles to devote to serving queries rather
 than indexing.

 So I'm curious why
 1 standard SolrCloud with selective hard and soft commits doesn't
 satisfy the need
 and
 2 If 1 is not reasonable, why older-style master/slave replication
 doesn't work.

 Unless there's a compelling use-case for this, it seems like there's
 a lot of complexity here for questionable value.

 Please note I'm not saying this is a bad idea. It would just be good
 to  understand what problem it's trying to solve. I'm reluctant to
 introduce complexity without discussing the use-case. Perhaps
 the existing code could provide a good enough solution.

 Best,
 Erick

 On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon jaeyoungy...@gmail.com
 wrote:
  Hi Folks,
 
  Here are some my ideas to use shared file system with two separate Solr
  Clouds(Writer Solr Cloud and Reader Solr Cloud).
 
  I want to get your valuable feedbacks
 
  For prototype, I setup two separate Solr Clouds(one for Writer and the
  other for Reader).
 
  Basically big picture of my prototype is like below.
 
  1. Reader and Writer Solr clouds share the same directory
  2. Writer SolrCloud sends the openSearcher commands to Reader Solr
 Cloud
  inside postCommit eventHandler. That is, when new data are added to
 Writer
  Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader
 Solr
  Cloud.
  3. Reader opens searcher only when it receives openSearcher commands
  from Writer SolrCloud
  4. Writer has own deletionPolicy to keep old commit points which might be
  used by running queries on Reader Solr Cloud when new searcher is opened
 on
  reader SolrCloud.
  5. Reader has no update/no commits. Everything on reader Solr Cloud are
  read-only. It also creates searcher from directory not from
  indexer(nrtMode=false).
 
  That is,
  In Writer Solr Cloud, I added postCommit eventListner. Inside the
  postCommit eventListner, it sends own openSearcher command to reader
 Solr
  Cloud's own handler. Then reader Solr Cloud will create openSearcher
  directly without commit and return the writer's request.
 
  With this approach, Writer and Reader can use the same commit points in
  shared file system in synchronous way.
  When a Reader SolrCloud starts, it doesn't create openSearcher. Instead.
  Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any change
 in
  the reader SolrCloud, writer sends openSearcher command to reader Solr
  Cloud.
 
  Does it make sense? Or am I missing some important stuff?
 
  any feedback would be very helpful to me.
 
  Thanks,
  Jae