Re: Solr Cloud support

Bryan Bende Fri, 04 Sep 2015 05:31:45 -0700

Srikanth,

Sure... if you have two nifi instances, or clusters, you can send data
between them using site-to-site communication. You can do a push or pull
model, but lets say a push model...
The receiving nifi instance would have an Input Port on the graph, and the
sending nifi instance would have a Remote Process Group connected to that
Input Port. Any data routed to that Remote Process Group will get sent
directly to the other instance/cluster.


Now with in a cluster you can use the same technique, on the graph you
would have:
- Input Port -> Processor2
- Processor1 (Primary Node Only scheduling) -> Remote Process Group
(connected to its own Input Port)

When data leaves Processor1 and goes into the Remote Process Group, NiFi
will then distribute the data to all of the other nodes which each have the
Input Port.
Currently this is the way to re-distribute the data across the cluster from
a processor that is only running on the primary node.

-Bryan

On Thu, Sep 3, 2015 at 10:25 PM, Srikanth <srikanth...@gmail.com> wrote:

> Bryan,
>
> That was my first thought too but then I felt I was over complicating it
> ;-)
> I didn't realize such pattern was used in other processors.
>
> If you don't mind, can you elaborate more on this...
>
>> *"In a cluster you would likely set this up by having
>> DistributeSolrCommand send to a Remote Process Group that points back to an
>> input port of itself, and the input port feeds into ExecuteSolrCommand."*
>
>
> Srikanth
>
> On Thu, Sep 3, 2015 at 9:32 AM, Bryan Bende <bbe...@gmail.com> wrote:
>
>> Srikanth/Joe,
>>
>> I think I understand the scenario a little better now, and to Joe's
>> points - it will probably be clearer how to do this in a more generic way
>> as we work towards the High-Availability NCM.
>>
>> Thinking out loud here given the current state of things, I'm wondering
>> if the desired functionality could be achieve by doing something similar to
>> ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and
>> ExecuteSolrCommand?
>>
>> DistributeSolrCommand would be set to run on the Primary Node and would
>> be configured with similar properties to what GetSolr has now (zk hosts, a
>> query, timestamp field, distrib=false, etc), it would query ZooKeeper and
>> produce a FlowFile for each shard, and the FlowFile would use either the
>> attributes, or payload, to capture all of the processor property values
>> plus the shard information, basically producing a command for a downstream
>> processor to run.
>>
>> ExecuteSolrCommand would be running on every node and would be
>> responsible for interpreting the incoming FlowFile and executing whatever
>> operation was being specified, and then passing on the results.
>>
>> In a cluster you would likely set this up by having DistributeSolrCommand
>> send to a Remote Process Group that points back to an input port of itself,
>> and the input port feeds into ExecuteSolrCommand.
>> This would get the automatic querying of each shard, but you would still
>> have DistributeSolrCommand running on one node and needing to be manually
>> failed over, until we address the HA stuff.
>>
>> This would be a fair amount of work, but food for thought.
>>
>> -Bryan
>>
>>
>> On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>
>>> <- general commentary not specific to the solr case ->
>>>
>>> This concept of being able to have nodes share information about
>>> 'which partition' they should be responsible for is a generically
>>> useful and very powerful thing.  We need to support it.  It isn't
>>> immediately obvious to me how best to do this as a generic and useful
>>> thing but a controller service on the NCM could potentially assign
>>> 'partitions' to the nodes.  Zookeeper could be an important part.  I
>>> think we need to tackle the HA NCM construct we talked about months
>>> ago before we can do this one nicely.
>>>
>>> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
>>> > Bryan,
>>> >
>>> > <Bryan> --> "I'm still a little bit unclear about the use case for
>>> querying
>>> > the shards individually... is the reason to do this because of a
>>> > performance/failover concern?"
>>> > <Srikanth> --> Reason to do this is to achieve better performance with
>>> the
>>> > convenience of automatic failover.
>>> > In the current mode, we do get very good failover offered by Solr.
>>> Failover
>>> > is seamless.
>>> > At the same time, we are not getting best performance. I guess its
>>> clear to
>>> > us why having each NiFi process query each shard with distrib=false
>>> will
>>> > give better performance.
>>> >
>>> > Now, question is how do we achieve this. Making user configure one NiFi
>>> > processor for each Solr node is one way to go.
>>> > I'm afraid this will make failover a tricky process. May even need
>>> human
>>> > intervention.
>>> >
>>> > Another approach is to have cluster master in NiFi talk to ZK and
>>> decide
>>> > which shards to query. Divide these shards among slave nodes.
>>> > My understanding is NiFi cluster master is not indented for such
>>> purpose.
>>> > I'm not sure if this even possible.
>>> >
>>> > Hope I'm a bit more clear now.
>>> >
>>> > Srikanth
>>> >
>>> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>> >>
>>> >> Srikanth,
>>> >>
>>> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
>>> both
>>> >> the dev and users list now :)
>>> >>
>>> >> I'm still a little bit unclear about the use case for querying the
>>> shards
>>> >> individually... is the reason to do this because of a
>>> performance/failover
>>> >> concern? or is it something specific about how the data is shared?
>>> >>
>>> >> Lets say you have your Solr cluster with 10 shards, each on their own
>>> node
>>> >> for simplicity, and then your ZooKeeper cluster.
>>> >> Then you also have a NiFi cluster with 3 nodes each with their own
>>> nifi
>>> >> instance, the first node designated as the primary, and a fourth node
>>> as the
>>> >> cluster manager.
>>> >>
>>> >> Now if you want to extract data from your Solr cluster, you would do
>>> the
>>> >> following...
>>> >> - Drag GetSolr on to the graph
>>> >> - Set type to "cloud"
>>> >> - Set the Solr Location to the ZK hosts string
>>> >> - Set the scheduling to "Primary Node"
>>> >>
>>> >> When you start the processor it is now only running on the first NiFi
>>> >> node, and it it is extracting data from all your shards at the same
>>> time.
>>> >> If a Solr shard/node fails this would be handled for us by the SolrJ
>>> >> SolrCloudClient which is using ZooKeeper to know about the state of
>>> things,
>>> >> and would choose a healthy replica of the shard if it existed.
>>> >> If the primary NiFi node failed, you would manually elect a new
>>> primary
>>> >> node and the extraction would resume on that node (this will get
>>> better in
>>> >> the future).
>>> >>
>>> >> I think if we expose the distrib=false it would allow you to query
>>> shards
>>> >> individually, either by having a nifi instance with a GetSolr
>>> processor per
>>> >> shard, or several mini-NiFis each with a single GetSolr, but
>>> >> I'm not sure if we could achieve the dynamic assignment you are
>>> thinking
>>> >> of.
>>> >>
>>> >> Let me know if I'm not making sense, happy to keep discussing and
>>> trying
>>> >> to figure out what else can be done.
>>> >>
>>> >> -Bryan
>>> >>
>>> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com>
>>> wrote:
>>> >>>
>>> >>>
>>> >>> Bryan,
>>> >>>
>>> >>> That is correct, having the ability to query nodes with
>>> "distrib=false"
>>> >>> is what I was talking about.
>>> >>>
>>> >>> Instead of user having to configure each Solr node in a separate NiFi
>>> >>> processor, can we provide a single configuration??
>>> >>> It would be great if we can take just Zookeeper(ZK) host as input
>>> from
>>> >>> user and
>>> >>>   i) Determine all nodes for a container from ZK
>>> >>>   ii) Let each NiFi processor takes ownership of querying a node with
>>> >>> "distrib=false"
>>> >>>
>>> >>> From what I understand, NiFi slaves in cluster can't talk to each
>>> other.
>>> >>> Will it be possible to do the ZK query part in cluster master and
>>> have
>>> >>> individual Solr nodes propagated to each slave?
>>> >>> I don't know how we can achieve this in NiFi, if at all.
>>> >>>
>>> >>> This will make Solr interface to NiFi much simpler. User needs to
>>> provide
>>> >>> just ZK.
>>> >>> We'll be able to take care rest. Including failing over to an
>>> alternate
>>> >>> Solr node with current one fails.
>>> >>>
>>> >>> Let me know your thoughts.
>>> >>>
>>> >>> Rgds,
>>> >>> Srikanth
>>> >>>
>>> >>> P.S : I had subscribed only to digest and didn't receive your
>>> original
>>> >>> reply. Had to pull this up from mail archive.
>>> >>> Only Dev list is in Nabble!!
>>> >>>
>>> >>>
>>> >>>
>>> ***************************************************************************************************
>>> >>>
>>> >>> Hi Srikanth,
>>> >>>
>>> >>> You are correct that in a NiFi cluster the intent would be to
>>> schedule
>>> >>> GetSolr on the primary node only (on the scheduling tab) so that
>>> only one
>>> >>> node in your cluster was extracting data.
>>> >>>
>>> >>> GetSolr determines which SolrJ client to use based on the "Solr Type"
>>> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
>>> would
>>> >>> send the query to one node based on the cluster state from
>>> ZooKeeper, and
>>> >>> then that Solr node performs the distributed query.
>>> >>>
>>> >>> Did you have a specific use case where you wanted to query each shard
>>> >>> individually?
>>> >>>
>>> >>> I think it would be straight forward to expose something on GetSolr
>>> that
>>> >>> would set "distrib=false" on the query so that Solr would not
>>> execute a
>>> >>> distributed query. You would then most likely create separate
>>> instances
>>> >>> of
>>> >>> GetSolr and configure them as Standard type pointing at the
>>> respective
>>> >>> shards. Let us know if that is something you are interested in.
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Bryan
>>> >>>
>>> >>>
>>> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com>
>>> wrote:
>>> >>>
>>> >>> > Hello,
>>> >>> >
>>> >>> > I started to explore NiFi project a few days back. I'm still
>>> trying it
>>> >>> > out.
>>> >>> >
>>> >>> > I have a few basic question on GetSolr.
>>> >>> >
>>> >>> > Should GetSolr be run as an Isolated Processor?
>>> >>> >
>>> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
>>> nodes,
>>> >>> > will GetSolr be able to query each shard from one specific NiFi
>>> node?
>>> >>> > I'm
>>> >>> > guessing it doesn't work that way.
>>> >>> >
>>> >>> >
>>> >>> > Thanks,
>>> >>> > Srikanth
>>> >>> >
>>> >>> >
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Re: Solr Cloud support

Reply via email to