Here is a JIRA for it: https://issues.apache.org/jira/browse/NIFI-338 And a related one: https://issues.apache.org/jira/browse/NIFI-540
There may also be some discussion in the mailing list about it but I think the main points are captured in those JIRA tickets. Thanks joe On Thu, Sep 3, 2015 at 7:15 PM, Srikanth <srikanth...@gmail.com> wrote: > Joe, > > Yes, that will be a really nice feature to have. > Is there a jira or wiki on the discussion you guys had on HA NCM? > > Srikanth > > On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote: >> >> <- general commentary not specific to the solr case -> >> >> This concept of being able to have nodes share information about >> 'which partition' they should be responsible for is a generically >> useful and very powerful thing. We need to support it. It isn't >> immediately obvious to me how best to do this as a generic and useful >> thing but a controller service on the NCM could potentially assign >> 'partitions' to the nodes. Zookeeper could be an important part. I >> think we need to tackle the HA NCM construct we talked about months >> ago before we can do this one nicely. >> >> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote: >> > Bryan, >> > >> > <Bryan> --> "I'm still a little bit unclear about the use case for >> > querying >> > the shards individually... is the reason to do this because of a >> > performance/failover concern?" >> > <Srikanth> --> Reason to do this is to achieve better performance with >> > the >> > convenience of automatic failover. >> > In the current mode, we do get very good failover offered by Solr. >> > Failover >> > is seamless. >> > At the same time, we are not getting best performance. I guess its clear >> > to >> > us why having each NiFi process query each shard with distrib=false will >> > give better performance. >> > >> > Now, question is how do we achieve this. Making user configure one NiFi >> > processor for each Solr node is one way to go. >> > I'm afraid this will make failover a tricky process. May even need human >> > intervention. >> > >> > Another approach is to have cluster master in NiFi talk to ZK and decide >> > which shards to query. Divide these shards among slave nodes. >> > My understanding is NiFi cluster master is not indented for such >> > purpose. >> > I'm not sure if this even possible. >> > >> > Hope I'm a bit more clear now. >> > >> > Srikanth >> > >> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote: >> >> >> >> Srikanth, >> >> >> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to >> >> both >> >> the dev and users list now :) >> >> >> >> I'm still a little bit unclear about the use case for querying the >> >> shards >> >> individually... is the reason to do this because of a >> >> performance/failover >> >> concern? or is it something specific about how the data is shared? >> >> >> >> Lets say you have your Solr cluster with 10 shards, each on their own >> >> node >> >> for simplicity, and then your ZooKeeper cluster. >> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi >> >> instance, the first node designated as the primary, and a fourth node >> >> as the >> >> cluster manager. >> >> >> >> Now if you want to extract data from your Solr cluster, you would do >> >> the >> >> following... >> >> - Drag GetSolr on to the graph >> >> - Set type to "cloud" >> >> - Set the Solr Location to the ZK hosts string >> >> - Set the scheduling to "Primary Node" >> >> >> >> When you start the processor it is now only running on the first NiFi >> >> node, and it it is extracting data from all your shards at the same >> >> time. >> >> If a Solr shard/node fails this would be handled for us by the SolrJ >> >> SolrCloudClient which is using ZooKeeper to know about the state of >> >> things, >> >> and would choose a healthy replica of the shard if it existed. >> >> If the primary NiFi node failed, you would manually elect a new primary >> >> node and the extraction would resume on that node (this will get better >> >> in >> >> the future). >> >> >> >> I think if we expose the distrib=false it would allow you to query >> >> shards >> >> individually, either by having a nifi instance with a GetSolr processor >> >> per >> >> shard, or several mini-NiFis each with a single GetSolr, but >> >> I'm not sure if we could achieve the dynamic assignment you are >> >> thinking >> >> of. >> >> >> >> Let me know if I'm not making sense, happy to keep discussing and >> >> trying >> >> to figure out what else can be done. >> >> >> >> -Bryan >> >> >> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> wrote: >> >>> >> >>> >> >>> Bryan, >> >>> >> >>> That is correct, having the ability to query nodes with >> >>> "distrib=false" >> >>> is what I was talking about. >> >>> >> >>> Instead of user having to configure each Solr node in a separate NiFi >> >>> processor, can we provide a single configuration?? >> >>> It would be great if we can take just Zookeeper(ZK) host as input from >> >>> user and >> >>> i) Determine all nodes for a container from ZK >> >>> ii) Let each NiFi processor takes ownership of querying a node with >> >>> "distrib=false" >> >>> >> >>> From what I understand, NiFi slaves in cluster can't talk to each >> >>> other. >> >>> Will it be possible to do the ZK query part in cluster master and have >> >>> individual Solr nodes propagated to each slave? >> >>> I don't know how we can achieve this in NiFi, if at all. >> >>> >> >>> This will make Solr interface to NiFi much simpler. User needs to >> >>> provide >> >>> just ZK. >> >>> We'll be able to take care rest. Including failing over to an >> >>> alternate >> >>> Solr node with current one fails. >> >>> >> >>> Let me know your thoughts. >> >>> >> >>> Rgds, >> >>> Srikanth >> >>> >> >>> P.S : I had subscribed only to digest and didn't receive your original >> >>> reply. Had to pull this up from mail archive. >> >>> Only Dev list is in Nabble!! >> >>> >> >>> >> >>> >> >>> *************************************************************************************************** >> >>> >> >>> Hi Srikanth, >> >>> >> >>> You are correct that in a NiFi cluster the intent would be to schedule >> >>> GetSolr on the primary node only (on the scheduling tab) so that only >> >>> one >> >>> node in your cluster was extracting data. >> >>> >> >>> GetSolr determines which SolrJ client to use based on the "Solr Type" >> >>> property, so if you select "Cloud" it will use SolrCloudClient. It >> >>> would >> >>> send the query to one node based on the cluster state from ZooKeeper, >> >>> and >> >>> then that Solr node performs the distributed query. >> >>> >> >>> Did you have a specific use case where you wanted to query each shard >> >>> individually? >> >>> >> >>> I think it would be straight forward to expose something on GetSolr >> >>> that >> >>> would set "distrib=false" on the query so that Solr would not execute >> >>> a >> >>> distributed query. You would then most likely create separate >> >>> instances >> >>> of >> >>> GetSolr and configure them as Standard type pointing at the respective >> >>> shards. Let us know if that is something you are interested in. >> >>> >> >>> Thanks, >> >>> >> >>> Bryan >> >>> >> >>> >> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com> >> >>> wrote: >> >>> >> >>> > Hello, >> >>> > >> >>> > I started to explore NiFi project a few days back. I'm still trying >> >>> > it >> >>> > out. >> >>> > >> >>> > I have a few basic question on GetSolr. >> >>> > >> >>> > Should GetSolr be run as an Isolated Processor? >> >>> > >> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4 >> >>> > nodes, >> >>> > will GetSolr be able to query each shard from one specific NiFi >> >>> > node? >> >>> > I'm >> >>> > guessing it doesn't work that way. >> >>> > >> >>> > >> >>> > Thanks, >> >>> > Srikanth >> >>> > >> >>> > >> >> >> >> >> > > >