Re: Solr Cloud support

Bryan Bende Thu, 03 Sep 2015 06:33:14 -0700

Srikanth/Joe,

I think I understand the scenario a little better now, and to Joe's points
- it will probably be clearer how to do this in a more generic way as we
work towards the High-Availability NCM.


Thinking out loud here given the current state of things, I'm wondering if
the desired functionality could be achieve by doing something similar to
ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and
ExecuteSolrCommand?

DistributeSolrCommand would be set to run on the Primary Node and would be
configured with similar properties to what GetSolr has now (zk hosts, a
query, timestamp field, distrib=false, etc), it would query ZooKeeper and
produce a FlowFile for each shard, and the FlowFile would use either the
attributes, or payload, to capture all of the processor property values
plus the shard information, basically producing a command for a downstream
processor to run.

ExecuteSolrCommand would be running on every node and would be responsible
for interpreting the incoming FlowFile and executing whatever operation was
being specified, and then passing on the results.

In a cluster you would likely set this up by having DistributeSolrCommand
send to a Remote Process Group that points back to an input port of itself,
and the input port feeds into ExecuteSolrCommand.
This would get the automatic querying of each shard, but you would still
have DistributeSolrCommand running on one node and needing to be manually
failed over, until we address the HA stuff.

This would be a fair amount of work, but food for thought.

-Bryan


On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <[email protected]> wrote:

> <- general commentary not specific to the solr case ->
>
> This concept of being able to have nodes share information about
> 'which partition' they should be responsible for is a generically
> useful and very powerful thing.  We need to support it.  It isn't
> immediately obvious to me how best to do this as a generic and useful
> thing but a controller service on the NCM could potentially assign
> 'partitions' to the nodes.  Zookeeper could be an important part.  I
> think we need to tackle the HA NCM construct we talked about months
> ago before we can do this one nicely.
>
> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <[email protected]> wrote:
> > Bryan,
> >
> > <Bryan> --> "I'm still a little bit unclear about the use case for
> querying
> > the shards individually... is the reason to do this because of a
> > performance/failover concern?"
> > <Srikanth> --> Reason to do this is to achieve better performance with
> the
> > convenience of automatic failover.
> > In the current mode, we do get very good failover offered by Solr.
> Failover
> > is seamless.
> > At the same time, we are not getting best performance. I guess its clear
> to
> > us why having each NiFi process query each shard with distrib=false will
> > give better performance.
> >
> > Now, question is how do we achieve this. Making user configure one NiFi
> > processor for each Solr node is one way to go.
> > I'm afraid this will make failover a tricky process. May even need human
> > intervention.
> >
> > Another approach is to have cluster master in NiFi talk to ZK and decide
> > which shards to query. Divide these shards among slave nodes.
> > My understanding is NiFi cluster master is not indented for such purpose.
> > I'm not sure if this even possible.
> >
> > Hope I'm a bit more clear now.
> >
> > Srikanth
> >
> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <[email protected]> wrote:
> >>
> >> Srikanth,
> >>
> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
> both
> >> the dev and users list now :)
> >>
> >> I'm still a little bit unclear about the use case for querying the
> shards
> >> individually... is the reason to do this because of a
> performance/failover
> >> concern? or is it something specific about how the data is shared?
> >>
> >> Lets say you have your Solr cluster with 10 shards, each on their own
> node
> >> for simplicity, and then your ZooKeeper cluster.
> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi
> >> instance, the first node designated as the primary, and a fourth node
> as the
> >> cluster manager.
> >>
> >> Now if you want to extract data from your Solr cluster, you would do the
> >> following...
> >> - Drag GetSolr on to the graph
> >> - Set type to "cloud"
> >> - Set the Solr Location to the ZK hosts string
> >> - Set the scheduling to "Primary Node"
> >>
> >> When you start the processor it is now only running on the first NiFi
> >> node, and it it is extracting data from all your shards at the same
> time.
> >> If a Solr shard/node fails this would be handled for us by the SolrJ
> >> SolrCloudClient which is using ZooKeeper to know about the state of
> things,
> >> and would choose a healthy replica of the shard if it existed.
> >> If the primary NiFi node failed, you would manually elect a new primary
> >> node and the extraction would resume on that node (this will get better
> in
> >> the future).
> >>
> >> I think if we expose the distrib=false it would allow you to query
> shards
> >> individually, either by having a nifi instance with a GetSolr processor
> per
> >> shard, or several mini-NiFis each with a single GetSolr, but
> >> I'm not sure if we could achieve the dynamic assignment you are thinking
> >> of.
> >>
> >> Let me know if I'm not making sense, happy to keep discussing and trying
> >> to figure out what else can be done.
> >>
> >> -Bryan
> >>
> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <[email protected]> wrote:
> >>>
> >>>
> >>> Bryan,
> >>>
> >>> That is correct, having the ability to query nodes with "distrib=false"
> >>> is what I was talking about.
> >>>
> >>> Instead of user having to configure each Solr node in a separate NiFi
> >>> processor, can we provide a single configuration??
> >>> It would be great if we can take just Zookeeper(ZK) host as input from
> >>> user and
> >>>   i) Determine all nodes for a container from ZK
> >>>   ii) Let each NiFi processor takes ownership of querying a node with
> >>> "distrib=false"
> >>>
> >>> From what I understand, NiFi slaves in cluster can't talk to each
> other.
> >>> Will it be possible to do the ZK query part in cluster master and have
> >>> individual Solr nodes propagated to each slave?
> >>> I don't know how we can achieve this in NiFi, if at all.
> >>>
> >>> This will make Solr interface to NiFi much simpler. User needs to
> provide
> >>> just ZK.
> >>> We'll be able to take care rest. Including failing over to an alternate
> >>> Solr node with current one fails.
> >>>
> >>> Let me know your thoughts.
> >>>
> >>> Rgds,
> >>> Srikanth
> >>>
> >>> P.S : I had subscribed only to digest and didn't receive your original
> >>> reply. Had to pull this up from mail archive.
> >>> Only Dev list is in Nabble!!
> >>>
> >>>
> >>>
> ***************************************************************************************************
> >>>
> >>> Hi Srikanth,
> >>>
> >>> You are correct that in a NiFi cluster the intent would be to schedule
> >>> GetSolr on the primary node only (on the scheduling tab) so that only
> one
> >>> node in your cluster was extracting data.
> >>>
> >>> GetSolr determines which SolrJ client to use based on the "Solr Type"
> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
> would
> >>> send the query to one node based on the cluster state from ZooKeeper,
> and
> >>> then that Solr node performs the distributed query.
> >>>
> >>> Did you have a specific use case where you wanted to query each shard
> >>> individually?
> >>>
> >>> I think it would be straight forward to expose something on GetSolr
> that
> >>> would set "distrib=false" on the query so that Solr would not execute a
> >>> distributed query. You would then most likely create separate instances
> >>> of
> >>> GetSolr and configure them as Standard type pointing at the respective
> >>> shards. Let us know if that is something you are interested in.
> >>>
> >>> Thanks,
> >>>
> >>> Bryan
> >>>
> >>>
> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <[email protected]>
> wrote:
> >>>
> >>> > Hello,
> >>> >
> >>> > I started to explore NiFi project a few days back. I'm still trying
> it
> >>> > out.
> >>> >
> >>> > I have a few basic question on GetSolr.
> >>> >
> >>> > Should GetSolr be run as an Isolated Processor?
> >>> >
> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
> nodes,
> >>> > will GetSolr be able to query each shard from one specific NiFi node?
> >>> > I'm
> >>> > guessing it doesn't work that way.
> >>> >
> >>> >
> >>> > Thanks,
> >>> > Srikanth
> >>> >
> >>> >
> >>
> >>
> >
>

Re: Solr Cloud support

Reply via email to