Re: Solr Cloud support

Joe Witt Thu, 03 Sep 2015 19:20:07 -0700

Here is a JIRA for it: https://issues.apache.org/jira/browse/NIFI-338
And a related one: https://issues.apache.org/jira/browse/NIFI-540


There may also be some discussion in the mailing list about it but I
think the main points are captured in those JIRA tickets.

Thanks
joe

On Thu, Sep 3, 2015 at 7:15 PM, Srikanth <srikanth...@gmail.com> wrote:
> Joe,
>
> Yes, that will be a really nice feature to have.
> Is there a jira or wiki on the discussion you guys had on HA NCM?
>
> Srikanth
>
> On Wed, Sep 2, 2015 at 11:08 PM, Joe Witt <joe.w...@gmail.com> wrote:
>>
>> <- general commentary not specific to the solr case ->
>>
>> This concept of being able to have nodes share information about
>> 'which partition' they should be responsible for is a generically
>> useful and very powerful thing.  We need to support it.  It isn't
>> immediately obvious to me how best to do this as a generic and useful
>> thing but a controller service on the NCM could potentially assign
>> 'partitions' to the nodes.  Zookeeper could be an important part.  I
>> think we need to tackle the HA NCM construct we talked about months
>> ago before we can do this one nicely.
>>
>> On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
>> > Bryan,
>> >
>> > <Bryan> --> "I'm still a little bit unclear about the use case for
>> > querying
>> > the shards individually... is the reason to do this because of a
>> > performance/failover concern?"
>> > <Srikanth> --> Reason to do this is to achieve better performance with
>> > the
>> > convenience of automatic failover.
>> > In the current mode, we do get very good failover offered by Solr.
>> > Failover
>> > is seamless.
>> > At the same time, we are not getting best performance. I guess its clear
>> > to
>> > us why having each NiFi process query each shard with distrib=false will
>> > give better performance.
>> >
>> > Now, question is how do we achieve this. Making user configure one NiFi
>> > processor for each Solr node is one way to go.
>> > I'm afraid this will make failover a tricky process. May even need human
>> > intervention.
>> >
>> > Another approach is to have cluster master in NiFi talk to ZK and decide
>> > which shards to query. Divide these shards among slave nodes.
>> > My understanding is NiFi cluster master is not indented for such
>> > purpose.
>> > I'm not sure if this even possible.
>> >
>> > Hope I'm a bit more clear now.
>> >
>> > Srikanth
>> >
>> > On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
>> >>
>> >> Srikanth,
>> >>
>> >> Sorry you hadn't seen the reply, but hopefully you are subscribed to
>> >> both
>> >> the dev and users list now :)
>> >>
>> >> I'm still a little bit unclear about the use case for querying the
>> >> shards
>> >> individually... is the reason to do this because of a
>> >> performance/failover
>> >> concern? or is it something specific about how the data is shared?
>> >>
>> >> Lets say you have your Solr cluster with 10 shards, each on their own
>> >> node
>> >> for simplicity, and then your ZooKeeper cluster.
>> >> Then you also have a NiFi cluster with 3 nodes each with their own nifi
>> >> instance, the first node designated as the primary, and a fourth node
>> >> as the
>> >> cluster manager.
>> >>
>> >> Now if you want to extract data from your Solr cluster, you would do
>> >> the
>> >> following...
>> >> - Drag GetSolr on to the graph
>> >> - Set type to "cloud"
>> >> - Set the Solr Location to the ZK hosts string
>> >> - Set the scheduling to "Primary Node"
>> >>
>> >> When you start the processor it is now only running on the first NiFi
>> >> node, and it it is extracting data from all your shards at the same
>> >> time.
>> >> If a Solr shard/node fails this would be handled for us by the SolrJ
>> >> SolrCloudClient which is using ZooKeeper to know about the state of
>> >> things,
>> >> and would choose a healthy replica of the shard if it existed.
>> >> If the primary NiFi node failed, you would manually elect a new primary
>> >> node and the extraction would resume on that node (this will get better
>> >> in
>> >> the future).
>> >>
>> >> I think if we expose the distrib=false it would allow you to query
>> >> shards
>> >> individually, either by having a nifi instance with a GetSolr processor
>> >> per
>> >> shard, or several mini-NiFis each with a single GetSolr, but
>> >> I'm not sure if we could achieve the dynamic assignment you are
>> >> thinking
>> >> of.
>> >>
>> >> Let me know if I'm not making sense, happy to keep discussing and
>> >> trying
>> >> to figure out what else can be done.
>> >>
>> >> -Bryan
>> >>
>> >> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> wrote:
>> >>>
>> >>>
>> >>> Bryan,
>> >>>
>> >>> That is correct, having the ability to query nodes with
>> >>> "distrib=false"
>> >>> is what I was talking about.
>> >>>
>> >>> Instead of user having to configure each Solr node in a separate NiFi
>> >>> processor, can we provide a single configuration??
>> >>> It would be great if we can take just Zookeeper(ZK) host as input from
>> >>> user and
>> >>>   i) Determine all nodes for a container from ZK
>> >>>   ii) Let each NiFi processor takes ownership of querying a node with
>> >>> "distrib=false"
>> >>>
>> >>> From what I understand, NiFi slaves in cluster can't talk to each
>> >>> other.
>> >>> Will it be possible to do the ZK query part in cluster master and have
>> >>> individual Solr nodes propagated to each slave?
>> >>> I don't know how we can achieve this in NiFi, if at all.
>> >>>
>> >>> This will make Solr interface to NiFi much simpler. User needs to
>> >>> provide
>> >>> just ZK.
>> >>> We'll be able to take care rest. Including failing over to an
>> >>> alternate
>> >>> Solr node with current one fails.
>> >>>
>> >>> Let me know your thoughts.
>> >>>
>> >>> Rgds,
>> >>> Srikanth
>> >>>
>> >>> P.S : I had subscribed only to digest and didn't receive your original
>> >>> reply. Had to pull this up from mail archive.
>> >>> Only Dev list is in Nabble!!
>> >>>
>> >>>
>> >>>
>> >>> ***************************************************************************************************
>> >>>
>> >>> Hi Srikanth,
>> >>>
>> >>> You are correct that in a NiFi cluster the intent would be to schedule
>> >>> GetSolr on the primary node only (on the scheduling tab) so that only
>> >>> one
>> >>> node in your cluster was extracting data.
>> >>>
>> >>> GetSolr determines which SolrJ client to use based on the "Solr Type"
>> >>> property, so if you select "Cloud" it will use SolrCloudClient. It
>> >>> would
>> >>> send the query to one node based on the cluster state from ZooKeeper,
>> >>> and
>> >>> then that Solr node performs the distributed query.
>> >>>
>> >>> Did you have a specific use case where you wanted to query each shard
>> >>> individually?
>> >>>
>> >>> I think it would be straight forward to expose something on GetSolr
>> >>> that
>> >>> would set "distrib=false" on the query so that Solr would not execute
>> >>> a
>> >>> distributed query. You would then most likely create separate
>> >>> instances
>> >>> of
>> >>> GetSolr and configure them as Standard type pointing at the respective
>> >>> shards. Let us know if that is something you are interested in.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Bryan
>> >>>
>> >>>
>> >>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com>
>> >>> wrote:
>> >>>
>> >>> > Hello,
>> >>> >
>> >>> > I started to explore NiFi project a few days back. I'm still trying
>> >>> > it
>> >>> > out.
>> >>> >
>> >>> > I have a few basic question on GetSolr.
>> >>> >
>> >>> > Should GetSolr be run as an Isolated Processor?
>> >>> >
>> >>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4
>> >>> > nodes,
>> >>> > will GetSolr be able to query each shard from one specific NiFi
>> >>> > node?
>> >>> > I'm
>> >>> > guessing it doesn't work that way.
>> >>> >
>> >>> >
>> >>> > Thanks,
>> >>> > Srikanth
>> >>> >
>> >>> >
>> >>
>> >>
>> >
>
>

Re: Solr Cloud support

Reply via email to