Joe, Yes, that will be a really nice feature to have. Is there a jira or wiki on the discussion you guys had on HA NCM?
Srikanth On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote: > <- general commentary not specific to the solr case -> > > This concept of being able to have nodes share information about > 'which partition' they should be responsible for is a generically > useful and very powerful thing. We need to support it. It isn't > immediately obvious to me how best to do this as a generic and useful > thing but a controller service on the NCM could potentially assign > 'partitions' to the nodes. Zookeeper could be an important part. I > think we need to tackle the HA NCM construct we talked about months > ago before we can do this one nicely. > > On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote: > > Bryan, > > > > <Bryan> --> "I'm still a little bit unclear about the use case for > querying > > the shards individually... is the reason to do this because of a > > performance/failover concern?" > > <Srikanth> --> Reason to do this is to achieve better performance with > the > > convenience of automatic failover. > > In the current mode, we do get very good failover offered by Solr. > Failover > > is seamless. > > At the same time, we are not getting best performance. I guess its clear > to > > us why having each NiFi process query each shard with distrib=false will > > give better performance. > > > > Now, question is how do we achieve this. Making user configure one NiFi > > processor for each Solr node is one way to go. > > I'm afraid this will make failover a tricky process. May even need human > > intervention. > > > > Another approach is to have cluster master in NiFi talk to ZK and decide > > which shards to query. Divide these shards among slave nodes. > > My understanding is NiFi cluster master is not indented for such purpose. > > I'm not sure if this even possible. > > > > Hope I'm a bit more clear now. > > > > Srikanth > > > > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote: > >> > >> Srikanth, > >> > >> Sorry you hadn't seen the reply, but hopefully you are subscribed to > both > >> the dev and users list now :) > >> > >> I'm still a little bit unclear about the use case for querying the > shards > >> individually... is the reason to do this because of a > performance/failover > >> concern? or is it something specific about how the data is shared? > >> > >> Lets say you have your Solr cluster with 10 shards, each on their own > node > >> for simplicity, and then your ZooKeeper cluster. > >> Then you also have a NiFi cluster with 3 nodes each with their own nifi > >> instance, the first node designated as the primary, and a fourth node > as the > >> cluster manager. > >> > >> Now if you want to extract data from your Solr cluster, you would do the > >> following... > >> - Drag GetSolr on to the graph > >> - Set type to "cloud" > >> - Set the Solr Location to the ZK hosts string > >> - Set the scheduling to "Primary Node" > >> > >> When you start the processor it is now only running on the first NiFi > >> node, and it it is extracting data from all your shards at the same > time. > >> If a Solr shard/node fails this would be handled for us by the SolrJ > >> SolrCloudClient which is using ZooKeeper to know about the state of > things, > >> and would choose a healthy replica of the shard if it existed. > >> If the primary NiFi node failed, you would manually elect a new primary > >> node and the extraction would resume on that node (this will get better > in > >> the future). > >> > >> I think if we expose the distrib=false it would allow you to query > shards > >> individually, either by having a nifi instance with a GetSolr processor > per > >> shard, or several mini-NiFis each with a single GetSolr, but > >> I'm not sure if we could achieve the dynamic assignment you are thinking > >> of. > >> > >> Let me know if I'm not making sense, happy to keep discussing and trying > >> to figure out what else can be done. > >> > >> -Bryan > >> > >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> wrote: > >>> > >>> > >>> Bryan, > >>> > >>> That is correct, having the ability to query nodes with "distrib=false" > >>> is what I was talking about. > >>> > >>> Instead of user having to configure each Solr node in a separate NiFi > >>> processor, can we provide a single configuration?? > >>> It would be great if we can take just Zookeeper(ZK) host as input from > >>> user and > >>> i) Determine all nodes for a container from ZK > >>> ii) Let each NiFi processor takes ownership of querying a node with > >>> "distrib=false" > >>> > >>> From what I understand, NiFi slaves in cluster can't talk to each > other. > >>> Will it be possible to do the ZK query part in cluster master and have > >>> individual Solr nodes propagated to each slave? > >>> I don't know how we can achieve this in NiFi, if at all. > >>> > >>> This will make Solr interface to NiFi much simpler. User needs to > provide > >>> just ZK. > >>> We'll be able to take care rest. Including failing over to an alternate > >>> Solr node with current one fails. > >>> > >>> Let me know your thoughts. > >>> > >>> Rgds, > >>> Srikanth > >>> > >>> P.S : I had subscribed only to digest and didn't receive your original > >>> reply. Had to pull this up from mail archive. > >>> Only Dev list is in Nabble!! > >>> > >>> > >>> > *************************************************************************************************** > >>> > >>> Hi Srikanth, > >>> > >>> You are correct that in a NiFi cluster the intent would be to schedule > >>> GetSolr on the primary node only (on the scheduling tab) so that only > one > >>> node in your cluster was extracting data. > >>> > >>> GetSolr determines which SolrJ client to use based on the "Solr Type" > >>> property, so if you select "Cloud" it will use SolrCloudClient. It > would > >>> send the query to one node based on the cluster state from ZooKeeper, > and > >>> then that Solr node performs the distributed query. > >>> > >>> Did you have a specific use case where you wanted to query each shard > >>> individually? > >>> > >>> I think it would be straight forward to expose something on GetSolr > that > >>> would set "distrib=false" on the query so that Solr would not execute a > >>> distributed query. You would then most likely create separate instances > >>> of > >>> GetSolr and configure them as Standard type pointing at the respective > >>> shards. Let us know if that is something you are interested in. > >>> > >>> Thanks, > >>> > >>> Bryan > >>> > >>> > >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com> > wrote: > >>> > >>> > Hello, > >>> > > >>> > I started to explore NiFi project a few days back. I'm still trying > it > >>> > out. > >>> > > >>> > I have a few basic question on GetSolr. > >>> > > >>> > Should GetSolr be run as an Isolated Processor? > >>> > > >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4 > nodes, > >>> > will GetSolr be able to query each shard from one specific NiFi node? > >>> > I'm > >>> > guessing it doesn't work that way. > >>> > > >>> > > >>> > Thanks, > >>> > Srikanth > >>> > > >>> > > >> > >> > > >