Hi Erick,

Thanks for your kind reply.

In order to deal with more documents in SolrCloud, we are thinking to use
many collections and each of collection will also have several shards.
Basic idea to deal with much document is that when a collection is filled
with much data, we will create a new collection.
That is, we will create many collections to contain more data. Currently we
are thinking to use timestamp of each file to decide which collection will
contain which documents.
For example, we will create a new collection like every day or ever hour.
But not fixed interval. We will maintain start/end time of each collection.
Then once a collection is filled with the limit of number of documents. We
will create a new collection. In order to avoid too many running
collections in a SolrCloud, we are also thinking to unload(disable) old
collections without deleting index files. So in future, we could enable the
collection again on demand in a different SolrCloud.

This way is what we are thinking to deal with many documents.

But problem is
Because our document ingest rate is very high, most of
resources(CPU/Memory) in Solr is used for indexing. So when we try to do
some query on the same machine, the query might be slow because of lack
resources. Also the queries reduces indexing performance.
So we have been investigating the way to use two separate Solr Clouds; one
for indexing and the other for query. These two clouds will share data but
use separate computing resources.


Here is what we already setup in our prototype.

Setup

Currently we setup two separate Solr Clouds.

   1. Two Solr Clouds.
   2. One zooKeeper for each SolrCloud. Indexing Solr Cloud need to know
   Search Solr Cloud's zookeeper address. But Search SolrCloud doesn't need to
   know indexing Zookeeper.
   3. Indexing SolrCloud and query SolrCloud will have the same collection
   name and same number of shards for the collection.
   4. Indexing SolrCloud and query SolrCloud uses their own solrHome but
   indexing data directory are shared between them.
   5. In indexing SolrCloud, each shard will have one only node.
   6. In query SolrCloud, each shard could have more than one node for more
   query capability.

     For example,

 [image: Inline image 1]
How it works

In order to keep consistent view between Indexing and Search Solr Clouds,

   1. Search Solr Cloud doesn't have any updateHandler/commit. It uses
   ReadOnlyDirectory(Factory) and NoOpUpdateHandler for /update. Also
   solrcloud.skip.autorecovery=true
   2. Search Solr Cloud doesn't open "Searcher" by itself. It opens
   "Searcher" only when it receives "openSearcherCmd" from Indexing Solr Cloud.
   3. Indexing Solr Cloud sends "openSearcherCmd" to search Solr Cloud
   after commit. That is, after each commit on Indexing SolrCloud, it
   schedules "openSearcherCmd" with remoteOpenSearcherMaxDelayAfterCommit
   interval. After the interval(default is 80 secs), Indexing SolrCloud will
   send "openSearcherCmd" to Search Solr Cloud.
   4. Indexing SolrCloud has own deletionPolicy to keep old commit points
   which might be used by running queries on Search Cloud. Currently Indexing
   SolrCloud keep last 20 minutes commit points.

Any feedback or your opinion would be very helpful to us.

Thanks in advance.
Jae


On Tue, Oct 21, 2014 at 7:30 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> Hmmm, I sure hope you have _lots_ of shards. At that rate, a single
> shard is probably going to run up against internal limits in a _very_
> short time (the most docs I've seen successfully served on a single
> shard run around 300M).
>
> It seems, to handle any reasonable retention period, you need lots and
> lots and lots of physical machines out there. Which hints at using
> regular SolrCloud since each machine would then be handling much less
> of the load.
>
> This is what I mean by "the XY problem". Your setup, at least from
> what you've told us so far, has so many unknowns that it's impossible
> to say much. If you go with your original e-mail and get it all set up
> and running on, say, 3 shards, it would work fine for about an hour.
> At that point you would have 300M docs on each shard and your query
> performance would start having... problems. You'd be hitting the hard
> limit of 2B docs/shard in less than 10 hours. And all the work you've
> put into this complex coordination setup would be totally wasted.
>
> So, you _really_ have to explain a lot more about the problem before
> we talk about writing code. You might want to review:
> http://wiki.apache.org/solr/UsingMailingLists
>
> Best,
> Erick
>
> On Tue, Oct 21, 2014 at 12:34 AM, Jaeyoung Yoon <jaeyoungy...@gmail.com>
> wrote:
> > In my case, injest rate is very high(above 300K docs/sec) and data are
> kept
> > inserted. So CPU is already bottleneck because of indexing.
> >
> > older-style master/slave replication with http or scp takes long to copy
> > big files from master/slave.
> >
> > That's why I setup two separate Solr Clouds. One for indexing and the
> other
> > for query.
> >
> > Thanks,
> > Jae
> >
> > On Mon, Oct 20, 2014 at 6:22 PM, Erick Erickson <erickerick...@gmail.com
> >
> > wrote:
> >
> >> I guess I'm not quite sure what the point is. So can you back up a bit
> >> and explain what problem this is trying to solve? Because all it
> >> really appears to be doing that's not already done with stock Solr
> >> is saving some disk space, and perhaps your "reader" SolrCloud
> >> is having some more cycles to devote to serving queries rather
> >> than indexing.
> >>
> >> So I'm curious why
> >> 1> standard SolrCloud with selective hard and soft commits doesn't
> >> satisfy the need
> >> and
> >> 2> If <1> is not reasonable, why older-style master/slave replication
> >> doesn't work.
> >>
> >> Unless there's a compelling use-case for this, it seems like there's
> >> a lot of complexity here for questionable value.
> >>
> >> Please note I'm not saying this is a bad idea. It would just be good
> >> to  understand what problem it's trying to solve. I'm reluctant to
> >> introduce complexity without discussing the use-case. Perhaps
> >> the existing code could provide a "good enough" solution.
> >>
> >> Best,
> >> Erick
> >>
> >> On Mon, Oct 20, 2014 at 7:35 PM, Jaeyoung Yoon <jaeyoungy...@gmail.com>
> >> wrote:
> >> > Hi Folks,
> >> >
> >> > Here are some my ideas to use shared file system with two separate
> Solr
> >> > Clouds(Writer Solr Cloud and Reader Solr Cloud).
> >> >
> >> > I want to get your valuable feedbacks
> >> >
> >> > For prototype, I setup two separate Solr Clouds(one for Writer and the
> >> > other for Reader).
> >> >
> >> > Basically big picture of my prototype is like below.
> >> >
> >> > 1. Reader and Writer Solr clouds share the same directory
> >> > 2. Writer SolrCloud sends the "openSearcher" commands to Reader Solr
> >> Cloud
> >> > inside postCommit eventHandler. That is, when new data are added to
> >> Writer
> >> > Solr Cloud, writer Solr Cloud sends own openSearcher command to Reader
> >> Solr
> >> > Cloud.
> >> > 3. Reader opens "searcher" only when it receives "openSearcher"
> commands
> >> > from Writer SolrCloud
> >> > 4. Writer has own deletionPolicy to keep old commit points which
> might be
> >> > used by running queries on Reader Solr Cloud when new searcher is
> opened
> >> on
> >> > reader SolrCloud.
> >> > 5. Reader has no update/no commits. Everything on reader Solr Cloud
> are
> >> > read-only. It also creates searcher from directory not from
> >> > indexer(nrtMode=false).
> >> >
> >> > That is,
> >> > In Writer Solr Cloud, I added postCommit eventListner. Inside the
> >> > postCommit eventListner, it sends own "openSearcher" command to reader
> >> Solr
> >> > Cloud's own handler. Then reader Solr Cloud will create openSearcher
> >> > directly without commit and return the writer's request.
> >> >
> >> > With this approach, Writer and Reader can use the same commit points
> in
> >> > shared file system in synchronous way.
> >> > When a Reader SolrCloud starts, it doesn't create openSearcher.
> Instead.
> >> > Writer Solr Cloud listens the zookeeper of Reader Solr Cloud. Any
> change
> >> in
> >> > the reader SolrCloud, writer sends "openSearcher" command to reader
> Solr
> >> > Cloud.
> >> >
> >> > Does it make sense? Or am I missing some important stuff?
> >> >
> >> > any feedback would be very helpful to me.
> >> >
> >> > Thanks,
> >> > Jae
> >>
>

Reply via email to