Re: Solr Cloud support

Srikanth Thu, 03 Sep 2015 19:16:17 -0700

Joe,

Yes, that will be a really nice feature to have.
Is there a jira or wiki on the discussion you guys had on HA NCM?


Srikanth

On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote:

> <- general commentary not specific to the solr case ->
>
> This concept of being able to have nodes share information about
> 'which partition' they should be responsible for is a generically
> useful and very powerful thing.  We need to support it.  It isn't
> immediately obvious to me how best to do this as a generic and useful
> thing but a controller service on the NCM could potentially assign
> 'partitions' to the nodes.  Zookeeper could be an important part.  I
> think we need to tackle the HA NCM construct we talked about months
> ago before we can do this one nicely.
>
> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
> > Bryan,
> >
> > <Bryan> --> "I'm still a little bit unclear about the use case for
> querying
> > the shards individually... is the reason to do this because of a
> > performance/failover concern?"
> > <Srikanth> --> Reason to do this is to achieve better performance with
> the
> > convenience of automatic failover.
> > In the current mode, we do get very good failover offered by Solr.
> Failover
> > is seamless.
> > At the same time, we are not getting best performance. I guess its clear
> to
> > us why having each NiFi process query each shard with distrib=false will
> > give better performance.
> >
> > Now, question is how do we achieve this. Making user configure one NiFi
> > processor for each Solr node is one way to go.
> > I'm afraid this will make failover a tricky process. May even need human
> > intervention.
> >
> > Another approach is to have cluster master in NiFi talk to ZK and decide
> > which shards to query. Divide these shards among slave nodes.
> > My understanding is NiFi cluster master is not indented for such purpose.
> > I'm not sure if this even possible.
> >
> > Hope I'm a bit more clear now.
> >
> > Srikanth
> >
> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
> >>
> >> Srikanth,
> >>
> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
> both
> >> the dev and users list now :)
> >>
> >> I'm still a little bit unclear about the use case for querying the
> shards
> >> individually... is the reason to do this because of a
> performance/failover
> >> concern? or is it something specific about how the data is shared?
> >>
> >> Lets say you have your Solr cluster with 10 shards, each on their own
> node
> >> for simplicity, and then your ZooKeeper cluster.
> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi
> >> instance, the first node designated as the primary, and a fourth node
> as the
> >> cluster manager.
> >>
> >> Now if you want to extract data from your Solr cluster, you would do the
> >> following...
> >> - Drag GetSolr on to the graph
> >> - Set type to "cloud"
> >> - Set the Solr Location to the ZK hosts string
> >> - Set the scheduling to "Primary Node"
> >>
> >> When you start the processor it is now only running on the first NiFi
> >> node, and it it is extracting data from all your shards at the same
> time.
> >> If a Solr shard/node fails this would be handled for us by the SolrJ
> >> SolrCloudClient which is using ZooKeeper to know about the state of
> things,
> >> and would choose a healthy replica of the shard if it existed.
> >> If the primary NiFi node failed, you would manually elect a new primary
> >> node and the extraction would resume on that node (this will get better
> in
> >> the future).
> >>
> >> I think if we expose the distrib=false it would allow you to query
> shards
> >> individually, either by having a nifi instance with a GetSolr processor
> per
> >> shard, or several mini-NiFis each with a single GetSolr, but
> >> I'm not sure if we could achieve the dynamic assignment you are thinking
> >> of.
> >>
> >> Let me know if I'm not making sense, happy to keep discussing and trying
> >> to figure out what else can be done.
> >>
> >> -Bryan
> >>
> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> wrote:
> >>>
> >>>
> >>> Bryan,
> >>>
> >>> That is correct, having the ability to query nodes with "distrib=false"
> >>> is what I was talking about.
> >>>
> >>> Instead of user having to configure each Solr node in a separate NiFi
> >>> processor, can we provide a single configuration??
> >>> It would be great if we can take just Zookeeper(ZK) host as input from
> >>> user and
> >>>   i) Determine all nodes for a container from ZK
> >>>   ii) Let each NiFi processor takes ownership of querying a node with
> >>> "distrib=false"
> >>>
> >>> From what I understand, NiFi slaves in cluster can't talk to each
> other.
> >>> Will it be possible to do the ZK query part in cluster master and have
> >>> individual Solr nodes propagated to each slave?
> >>> I don't know how we can achieve this in NiFi, if at all.
> >>>
> >>> This will make Solr interface to NiFi much simpler. User needs to
> provide
> >>> just ZK.
> >>> We'll be able to take care rest. Including failing over to an alternate
> >>> Solr node with current one fails.
> >>>
> >>> Let me know your thoughts.
> >>>
> >>> Rgds,
> >>> Srikanth
> >>>
> >>> P.S : I had subscribed only to digest and didn't receive your original
> >>> reply. Had to pull this up from mail archive.
> >>> Only Dev list is in Nabble!!
> >>>
> >>>
> >>>
> ***************************************************************************************************
> >>>
> >>> Hi Srikanth,
> >>>
> >>> You are correct that in a NiFi cluster the intent would be to schedule
> >>> GetSolr on the primary node only (on the scheduling tab) so that only
> one
> >>> node in your cluster was extracting data.
> >>>
> >>> GetSolr determines which SolrJ client to use based on the "Solr Type"
> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
> would
> >>> send the query to one node based on the cluster state from ZooKeeper,
> and
> >>> then that Solr node performs the distributed query.
> >>>
> >>> Did you have a specific use case where you wanted to query each shard
> >>> individually?
> >>>
> >>> I think it would be straight forward to expose something on GetSolr
> that
> >>> would set "distrib=false" on the query so that Solr would not execute a
> >>> distributed query. You would then most likely create separate instances
> >>> of
> >>> GetSolr and configure them as Standard type pointing at the respective
> >>> shards. Let us know if that is something you are interested in.
> >>>
> >>> Thanks,
> >>>
> >>> Bryan
> >>>
> >>>
> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com>
> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > I started to explore NiFi project a few days back. I'm still trying
> it
> >>> > out.
> >>> >
> >>> > I have a few basic question on GetSolr.
> >>> >
> >>> > Should GetSolr be run as an Isolated Processor?
> >>> >
> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
> nodes,
> >>> > will GetSolr be able to query each shard from one specific NiFi node?
> >>> > I'm
> >>> > guessing it doesn't work that way.
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Srikanth
> >>> >
> >>> >
> >>
> >>
> >
>

Re: Solr Cloud support

Reply via email to