Understood. Thanks! Srikanth
On Fri, Sep 4, 2015 at 8:31 AM, Bryan Bende <bbe...@gmail.com> wrote: > Srikanth, > > Sure... if you have two nifi instances, or clusters, you can send data > between them using site-to-site communication. You can do a push or pull > model, but lets say a push model... > The receiving nifi instance would have an Input Port on the graph, and the > sending nifi instance would have a Remote Process Group connected to that > Input Port. Any data routed to that Remote Process Group will get sent > directly to the other instance/cluster. > > Now with in a cluster you can use the same technique, on the graph you > would have: > - Input Port -> Processor2 > - Processor1 (Primary Node Only scheduling) -> Remote Process Group > (connected to its own Input Port) > > When data leaves Processor1 and goes into the Remote Process Group, NiFi > will then distribute the data to all of the other nodes which each have the > Input Port. > Currently this is the way to re-distribute the data across the cluster > from a processor that is only running on the primary node. > > -Bryan > > On Thu, Sep 3, 2015 at 10:25 PM, Srikanth <srikanth...@gmail.com> wrote: > >> Bryan, >> >> That was my first thought too but then I felt I was over complicating it >> ;-) >> I didn't realize such pattern was used in other processors. >> >> If you don't mind, can you elaborate more on this... >> >>> *"In a cluster you would likely set this up by having >>> DistributeSolrCommand send to a Remote Process Group that points back to an >>> input port of itself, and the input port feeds into ExecuteSolrCommand."* >> >> >> Srikanth >> >> On Thu, Sep 3, 2015 at 9:32 AM, Bryan Bende <bbe...@gmail.com> wrote: >> >>> Srikanth/Joe, >>> >>> I think I understand the scenario a little better now, and to Joe's >>> points - it will probably be clearer how to do this in a more generic way >>> as we work towards the High-Availability NCM. >>> >>> Thinking out loud here given the current state of things, I'm wondering >>> if the desired functionality could be achieve by doing something similar to >>> ListHDFS and FetchHDFS... what if there was a DistributeSolrCommand and >>> ExecuteSolrCommand? >>> >>> DistributeSolrCommand would be set to run on the Primary Node and would >>> be configured with similar properties to what GetSolr has now (zk hosts, a >>> query, timestamp field, distrib=false, etc), it would query ZooKeeper and >>> produce a FlowFile for each shard, and the FlowFile would use either the >>> attributes, or payload, to capture all of the processor property values >>> plus the shard information, basically producing a command for a downstream >>> processor to run. >>> >>> ExecuteSolrCommand would be running on every node and would be >>> responsible for interpreting the incoming FlowFile and executing whatever >>> operation was being specified, and then passing on the results. >>> >>> In a cluster you would likely set this up by having >>> DistributeSolrCommand send to a Remote Process Group that points back to an >>> input port of itself, and the input port feeds into ExecuteSolrCommand. >>> This would get the automatic querying of each shard, but you would still >>> have DistributeSolrCommand running on one node and needing to be manually >>> failed over, until we address the HA stuff. >>> >>> This would be a fair amount of work, but food for thought. >>> >>> -Bryan >>> >>> >>> On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote: >>> >>>> <- general commentary not specific to the solr case -> >>>> >>>> This concept of being able to have nodes share information about >>>> 'which partition' they should be responsible for is a generically >>>> useful and very powerful thing. We need to support it. It isn't >>>> immediately obvious to me how best to do this as a generic and useful >>>> thing but a controller service on the NCM could potentially assign >>>> 'partitions' to the nodes. Zookeeper could be an important part. I >>>> think we need to tackle the HA NCM construct we talked about months >>>> ago before we can do this one nicely. >>>> >>>> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote: >>>> > Bryan, >>>> > >>>> > <Bryan> --> "I'm still a little bit unclear about the use case for >>>> querying >>>> > the shards individually... is the reason to do this because of a >>>> > performance/failover concern?" >>>> > <Srikanth> --> Reason to do this is to achieve better performance >>>> with the >>>> > convenience of automatic failover. >>>> > In the current mode, we do get very good failover offered by Solr. >>>> Failover >>>> > is seamless. >>>> > At the same time, we are not getting best performance. I guess its >>>> clear to >>>> > us why having each NiFi process query each shard with distrib=false >>>> will >>>> > give better performance. >>>> > >>>> > Now, question is how do we achieve this. Making user configure one >>>> NiFi >>>> > processor for each Solr node is one way to go. >>>> > I'm afraid this will make failover a tricky process. May even need >>>> human >>>> > intervention. >>>> > >>>> > Another approach is to have cluster master in NiFi talk to ZK and >>>> decide >>>> > which shards to query. Divide these shards among slave nodes. >>>> > My understanding is NiFi cluster master is not indented for such >>>> purpose. >>>> > I'm not sure if this even possible. >>>> > >>>> > Hope I'm a bit more clear now. >>>> > >>>> > Srikanth >>>> > >>>> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote: >>>> >> >>>> >> Srikanth, >>>> >> >>>> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to >>>> both >>>> >> the dev and users list now :) >>>> >> >>>> >> I'm still a little bit unclear about the use case for querying the >>>> shards >>>> >> individually... is the reason to do this because of a >>>> performance/failover >>>> >> concern? or is it something specific about how the data is shared? >>>> >> >>>> >> Lets say you have your Solr cluster with 10 shards, each on their >>>> own node >>>> >> for simplicity, and then your ZooKeeper cluster. >>>> >> Then you also have a NiFi cluster with 3 nodes each with their own >>>> nifi >>>> >> instance, the first node designated as the primary, and a fourth >>>> node as the >>>> >> cluster manager. >>>> >> >>>> >> Now if you want to extract data from your Solr cluster, you would do >>>> the >>>> >> following... >>>> >> - Drag GetSolr on to the graph >>>> >> - Set type to "cloud" >>>> >> - Set the Solr Location to the ZK hosts string >>>> >> - Set the scheduling to "Primary Node" >>>> >> >>>> >> When you start the processor it is now only running on the first NiFi >>>> >> node, and it it is extracting data from all your shards at the same >>>> time. >>>> >> If a Solr shard/node fails this would be handled for us by the SolrJ >>>> >> SolrCloudClient which is using ZooKeeper to know about the state of >>>> things, >>>> >> and would choose a healthy replica of the shard if it existed. >>>> >> If the primary NiFi node failed, you would manually elect a new >>>> primary >>>> >> node and the extraction would resume on that node (this will get >>>> better in >>>> >> the future). >>>> >> >>>> >> I think if we expose the distrib=false it would allow you to query >>>> shards >>>> >> individually, either by having a nifi instance with a GetSolr >>>> processor per >>>> >> shard, or several mini-NiFis each with a single GetSolr, but >>>> >> I'm not sure if we could achieve the dynamic assignment you are >>>> thinking >>>> >> of. >>>> >> >>>> >> Let me know if I'm not making sense, happy to keep discussing and >>>> trying >>>> >> to figure out what else can be done. >>>> >> >>>> >> -Bryan >>>> >> >>>> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> >>>> >>> Bryan, >>>> >>> >>>> >>> That is correct, having the ability to query nodes with >>>> "distrib=false" >>>> >>> is what I was talking about. >>>> >>> >>>> >>> Instead of user having to configure each Solr node in a separate >>>> NiFi >>>> >>> processor, can we provide a single configuration?? >>>> >>> It would be great if we can take just Zookeeper(ZK) host as input >>>> from >>>> >>> user and >>>> >>> i) Determine all nodes for a container from ZK >>>> >>> ii) Let each NiFi processor takes ownership of querying a node >>>> with >>>> >>> "distrib=false" >>>> >>> >>>> >>> From what I understand, NiFi slaves in cluster can't talk to each >>>> other. >>>> >>> Will it be possible to do the ZK query part in cluster master and >>>> have >>>> >>> individual Solr nodes propagated to each slave? >>>> >>> I don't know how we can achieve this in NiFi, if at all. >>>> >>> >>>> >>> This will make Solr interface to NiFi much simpler. User needs to >>>> provide >>>> >>> just ZK. >>>> >>> We'll be able to take care rest. Including failing over to an >>>> alternate >>>> >>> Solr node with current one fails. >>>> >>> >>>> >>> Let me know your thoughts. >>>> >>> >>>> >>> Rgds, >>>> >>> Srikanth >>>> >>> >>>> >>> P.S : I had subscribed only to digest and didn't receive your >>>> original >>>> >>> reply. Had to pull this up from mail archive. >>>> >>> Only Dev list is in Nabble!! >>>> >>> >>>> >>> >>>> >>> >>>> *************************************************************************************************** >>>> >>> >>>> >>> Hi Srikanth, >>>> >>> >>>> >>> You are correct that in a NiFi cluster the intent would be to >>>> schedule >>>> >>> GetSolr on the primary node only (on the scheduling tab) so that >>>> only one >>>> >>> node in your cluster was extracting data. >>>> >>> >>>> >>> GetSolr determines which SolrJ client to use based on the "Solr >>>> Type" >>>> >>> property, so if you select "Cloud" it will use SolrCloudClient. It >>>> would >>>> >>> send the query to one node based on the cluster state from >>>> ZooKeeper, and >>>> >>> then that Solr node performs the distributed query. >>>> >>> >>>> >>> Did you have a specific use case where you wanted to query each >>>> shard >>>> >>> individually? >>>> >>> >>>> >>> I think it would be straight forward to expose something on GetSolr >>>> that >>>> >>> would set "distrib=false" on the query so that Solr would not >>>> execute a >>>> >>> distributed query. You would then most likely create separate >>>> instances >>>> >>> of >>>> >>> GetSolr and configure them as Standard type pointing at the >>>> respective >>>> >>> shards. Let us know if that is something you are interested in. >>>> >>> >>>> >>> Thanks, >>>> >>> >>>> >>> Bryan >>>> >>> >>>> >>> >>>> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com> >>>> wrote: >>>> >>> >>>> >>> > Hello, >>>> >>> > >>>> >>> > I started to explore NiFi project a few days back. I'm still >>>> trying it >>>> >>> > out. >>>> >>> > >>>> >>> > I have a few basic question on GetSolr. >>>> >>> > >>>> >>> > Should GetSolr be run as an Isolated Processor? >>>> >>> > >>>> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4 >>>> nodes, >>>> >>> > will GetSolr be able to query each shard from one specific NiFi >>>> node? >>>> >>> > I'm >>>> >>> > guessing it doesn't work that way. >>>> >>> > >>>> >>> > >>>> >>> > Thanks, >>>> >>> > Srikanth >>>> >>> > >>>> >>> > >>>> >> >>>> >> >>>> > >>>> >>> >>> >> >