Re: Solr Cloud support

Srikanth Fri, 04 Sep 2015 20:51:58 -0700

Understood. Thanks!

Srikanth


On Fri, Sep 4, 2015 at 8:31 AM, Bryan Bende <bbe...@gmail.com> wrote:

> Srikanth,
>
> Sure... if you have two nifi instances, or clusters, you can send data
> between them using site-to-site communication. You can do a push or pull
> model, but lets say a push model...
> The receiving nifi instance would have an Input Port on the graph, and the
> sending nifi instance would have a Remote Process Group connected to that
> Input Port. Any data routed to that Remote Process Group will get sent
> directly to the other instance/cluster.
>
> Now with in a cluster you can use the same technique, on the graph you
> would have:
> - Input Port -> Processor2
> - Processor1 (Primary Node Only scheduling) -> Remote Process Group
> (connected to its own Input Port)
>
> When data leaves Processor1 and goes into the Remote Process Group, NiFi
> will then distribute the data to all of the other nodes which each have the
> Input Port.
> Currently this is the way to re-distribute the data across the cluster
> from a processor that is only running on the primary node.
>
> -Bryan
>
> On Thu, Sep 3, 2015 at 10:25 PM, Srikanth <srikanth...@gmail.com> wrote:
>
>> Bryan,
>>
>> That was my first thought too but then I felt I was over complicating it
>> ;-)
>> I didn't realize such pattern was used in other processors.
>>
>> If you don't mind, can you elaborate more on this...
>>
>>> *"In a cluster you would likely set this up by having
>>> DistributeSolrCommand send to a Remote Process Group that points back to an
>>> input port of itself, and the input port feeds into ExecuteSolrCommand."*
>>
>>
>> Srikanth
>>
>> On Thu, Sep 3, 2015 at 9:32 AM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>>> Srikanth/Joe,
>>>
>>> I think I understand the scenario a little better now, and to Joe's
>>> points - it will probably be clearer how to do this in a more generic way
>>> as we work towards the High-Availability NCM.
>>>
>>> Thinking out loud here given the current state of things, I'm wondering
>>> if the desired functionality could be achieve by doing something similar to
>>> ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and
>>> ExecuteSolrCommand?
>>>
>>> DistributeSolrCommand would be set to run on the Primary Node and would
>>> be configured with similar properties to what GetSolr has now (zk hosts, a
>>> query, timestamp field, distrib=false, etc), it would query ZooKeeper and
>>> produce a FlowFile for each shard, and the FlowFile would use either the
>>> attributes, or payload, to capture all of the processor property values
>>> plus the shard information, basically producing a command for a downstream
>>> processor to run.
>>>
>>> ExecuteSolrCommand would be running on every node and would be
>>> responsible for interpreting the incoming FlowFile and executing whatever
>>> operation was being specified, and then passing on the results.
>>>
>>> In a cluster you would likely set this up by having
>>> DistributeSolrCommand send to a Remote Process Group that points back to an
>>> input port of itself, and the input port feeds into ExecuteSolrCommand.
>>> This would get the automatic querying of each shard, but you would still
>>> have DistributeSolrCommand running on one node and needing to be manually
>>> failed over, until we address the HA stuff.
>>>
>>> This would be a fair amount of work, but food for thought.
>>>
>>> -Bryan
>>>
>>>
>>> On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>>
>>>> <- general commentary not specific to the solr case ->
>>>>
>>>> This concept of being able to have nodes share information about
>>>> 'which partition' they should be responsible for is a generically
>>>> useful and very powerful thing.  We need to support it.  It isn't
>>>> immediately obvious to me how best to do this as a generic and useful
>>>> thing but a controller service on the NCM could potentially assign
>>>> 'partitions' to the nodes.  Zookeeper could be an important part.  I
>>>> think we need to tackle the HA NCM construct we talked about months
>>>> ago before we can do this one nicely.
>>>>
>>>> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
>>>> > Bryan,
>>>> >
>>>> > <Bryan> --> "I'm still a little bit unclear about the use case for
>>>> querying
>>>> > the shards individually... is the reason to do this because of a
>>>> > performance/failover concern?"
>>>> > <Srikanth> --> Reason to do this is to achieve better performance
>>>> with the
>>>> > convenience of automatic failover.
>>>> > In the current mode, we do get very good failover offered by Solr.
>>>> Failover
>>>> > is seamless.
>>>> > At the same time, we are not getting best performance. I guess its
>>>> clear to
>>>> > us why having each NiFi process query each shard with distrib=false
>>>> will
>>>> > give better performance.
>>>> >
>>>> > Now, question is how do we achieve this. Making user configure one
>>>> NiFi
>>>> > processor for each Solr node is one way to go.
>>>> > I'm afraid this will make failover a tricky process. May even need
>>>> human
>>>> > intervention.
>>>> >
>>>> > Another approach is to have cluster master in NiFi talk to ZK and
>>>> decide
>>>> > which shards to query. Divide these shards among slave nodes.
>>>> > My understanding is NiFi cluster master is not indented for such
>>>> purpose.
>>>> > I'm not sure if this even possible.
>>>> >
>>>> > Hope I'm a bit more clear now.
>>>> >
>>>> > Srikanth
>>>> >
>>>> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>>> >>
>>>> >> Srikanth,
>>>> >>
>>>> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
>>>> both
>>>> >> the dev and users list now :)
>>>> >>
>>>> >> I'm still a little bit unclear about the use case for querying the
>>>> shards
>>>> >> individually... is the reason to do this because of a
>>>> performance/failover
>>>> >> concern? or is it something specific about how the data is shared?
>>>> >>
>>>> >> Lets say you have your Solr cluster with 10 shards, each on their
>>>> own node
>>>> >> for simplicity, and then your ZooKeeper cluster.
>>>> >> Then you also have a NiFi cluster with 3 nodes each with their own
>>>> nifi
>>>> >> instance, the first node designated as the primary, and a fourth
>>>> node as the
>>>> >> cluster manager.
>>>> >>
>>>> >> Now if you want to extract data from your Solr cluster, you would do
>>>> the
>>>> >> following...
>>>> >> - Drag GetSolr on to the graph
>>>> >> - Set type to "cloud"
>>>> >> - Set the Solr Location to the ZK hosts string
>>>> >> - Set the scheduling to "Primary Node"
>>>> >>
>>>> >> When you start the processor it is now only running on the first NiFi
>>>> >> node, and it it is extracting data from all your shards at the same
>>>> time.
>>>> >> If a Solr shard/node fails this would be handled for us by the SolrJ
>>>> >> SolrCloudClient which is using ZooKeeper to know about the state of
>>>> things,
>>>> >> and would choose a healthy replica of the shard if it existed.
>>>> >> If the primary NiFi node failed, you would manually elect a new
>>>> primary
>>>> >> node and the extraction would resume on that node (this will get
>>>> better in
>>>> >> the future).
>>>> >>
>>>> >> I think if we expose the distrib=false it would allow you to query
>>>> shards
>>>> >> individually, either by having a nifi instance with a GetSolr
>>>> processor per
>>>> >> shard, or several mini-NiFis each with a single GetSolr, but
>>>> >> I'm not sure if we could achieve the dynamic assignment you are
>>>> thinking
>>>> >> of.
>>>> >>
>>>> >> Let me know if I'm not making sense, happy to keep discussing and
>>>> trying
>>>> >> to figure out what else can be done.
>>>> >>
>>>> >> -Bryan
>>>> >>
>>>> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>>
>>>> >>> Bryan,
>>>> >>>
>>>> >>> That is correct, having the ability to query nodes with
>>>> "distrib=false"
>>>> >>> is what I was talking about.
>>>> >>>
>>>> >>> Instead of user having to configure each Solr node in a separate
>>>> NiFi
>>>> >>> processor, can we provide a single configuration??
>>>> >>> It would be great if we can take just Zookeeper(ZK) host as input
>>>> from
>>>> >>> user and
>>>> >>>   i) Determine all nodes for a container from ZK
>>>> >>>   ii) Let each NiFi processor takes ownership of querying a node
>>>> with
>>>> >>> "distrib=false"
>>>> >>>
>>>> >>> From what I understand, NiFi slaves in cluster can't talk to each
>>>> other.
>>>> >>> Will it be possible to do the ZK query part in cluster master and
>>>> have
>>>> >>> individual Solr nodes propagated to each slave?
>>>> >>> I don't know how we can achieve this in NiFi, if at all.
>>>> >>>
>>>> >>> This will make Solr interface to NiFi much simpler. User needs to
>>>> provide
>>>> >>> just ZK.
>>>> >>> We'll be able to take care rest. Including failing over to an
>>>> alternate
>>>> >>> Solr node with current one fails.
>>>> >>>
>>>> >>> Let me know your thoughts.
>>>> >>>
>>>> >>> Rgds,
>>>> >>> Srikanth
>>>> >>>
>>>> >>> P.S : I had subscribed only to digest and didn't receive your
>>>> original
>>>> >>> reply. Had to pull this up from mail archive.
>>>> >>> Only Dev list is in Nabble!!
>>>> >>>
>>>> >>>
>>>> >>>
>>>> ***************************************************************************************************
>>>> >>>
>>>> >>> Hi Srikanth,
>>>> >>>
>>>> >>> You are correct that in a NiFi cluster the intent would be to
>>>> schedule
>>>> >>> GetSolr on the primary node only (on the scheduling tab) so that
>>>> only one
>>>> >>> node in your cluster was extracting data.
>>>> >>>
>>>> >>> GetSolr determines which SolrJ client to use based on the "Solr
>>>> Type"
>>>> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
>>>> would
>>>> >>> send the query to one node based on the cluster state from
>>>> ZooKeeper, and
>>>> >>> then that Solr node performs the distributed query.
>>>> >>>
>>>> >>> Did you have a specific use case where you wanted to query each
>>>> shard
>>>> >>> individually?
>>>> >>>
>>>> >>> I think it would be straight forward to expose something on GetSolr
>>>> that
>>>> >>> would set "distrib=false" on the query so that Solr would not
>>>> execute a
>>>> >>> distributed query. You would then most likely create separate
>>>> instances
>>>> >>> of
>>>> >>> GetSolr and configure them as Standard type pointing at the
>>>> respective
>>>> >>> shards. Let us know if that is something you are interested in.
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> Bryan
>>>> >>>
>>>> >>>
>>>> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> > Hello,
>>>> >>> >
>>>> >>> > I started to explore NiFi project a few days back. I'm still
>>>> trying it
>>>> >>> > out.
>>>> >>> >
>>>> >>> > I have a few basic question on GetSolr.
>>>> >>> >
>>>> >>> > Should GetSolr be run as an Isolated Processor?
>>>> >>> >
>>>> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
>>>> nodes,
>>>> >>> > will GetSolr be able to query each shard from one specific NiFi
>>>> node?
>>>> >>> > I'm
>>>> >>> > guessing it doesn't work that way.
>>>> >>> >
>>>> >>> >
>>>> >>> > Thanks,
>>>> >>> > Srikanth
>>>> >>> >
>>>> >>> >
>>>> >>
>>>> >>
>>>> >
>>>>
>>>
>>>
>>
>

Re: Solr Cloud support

Reply via email to