Re: Solr Cloud support

Joe Witt Wed, 02 Sep 2015 20:09:19 -0700

<- general commentary not specific to the solr case ->

This concept of being able to have nodes share information about
'which partition' they should be responsible for is a generically
useful and very powerful thing.  We need to support it.  It isn't
immediately obvious to me how best to do this as a generic and useful
thing but a controller service on the NCM could potentially assign
'partitions' to the nodes.  Zookeeper could be an important part.  I
think we need to tackle the HA NCM construct we talked about months
ago before we can do this one nicely.


On Wed, Sep 2, 2015 at 7:47 PM, Srikanth <srikanth...@gmail.com> wrote:
> Bryan,
>
> <Bryan> --> "I'm still a little bit unclear about the use case for querying
> the shards individually... is the reason to do this because of a
> performance/failover concern?"
> <Srikanth> --> Reason to do this is to achieve better performance with the
> convenience of automatic failover.
> In the current mode, we do get very good failover offered by Solr. Failover
> is seamless.
> At the same time, we are not getting best performance. I guess its clear to
> us why having each NiFi process query each shard with distrib=false will
> give better performance.
>
> Now, question is how do we achieve this. Making user configure one NiFi
> processor for each Solr node is one way to go.
> I'm afraid this will make failover a tricky process. May even need human
> intervention.
>
> Another approach is to have cluster master in NiFi talk to ZK and decide
> which shards to query. Divide these shards among slave nodes.
> My understanding is NiFi cluster master is not indented for such purpose.
> I'm not sure if this even possible.
>
> Hope I'm a bit more clear now.
>
> Srikanth
>
> On Wed, Sep 2, 2015 at 5:58 PM, Bryan Bende <bbe...@gmail.com> wrote:
>>
>> Srikanth,
>>
>> Sorry you hadn't seen the reply, but hopefully you are subscribed to both
>> the dev and users list now :)
>>
>> I'm still a little bit unclear about the use case for querying the shards
>> individually... is the reason to do this because of a performance/failover
>> concern? or is it something specific about how the data is shared?
>>
>> Lets say you have your Solr cluster with 10 shards, each on their own node
>> for simplicity, and then your ZooKeeper cluster.
>> Then you also have a NiFi cluster with 3 nodes each with their own nifi
>> instance, the first node designated as the primary, and a fourth node as the
>> cluster manager.
>>
>> Now if you want to extract data from your Solr cluster, you would do the
>> following...
>> - Drag GetSolr on to the graph
>> - Set type to "cloud"
>> - Set the Solr Location to the ZK hosts string
>> - Set the scheduling to "Primary Node"
>>
>> When you start the processor it is now only running on the first NiFi
>> node, and it it is extracting data from all your shards at the same time.
>> If a Solr shard/node fails this would be handled for us by the SolrJ
>> SolrCloudClient which is using ZooKeeper to know about the state of things,
>> and would choose a healthy replica of the shard if it existed.
>> If the primary NiFi node failed, you would manually elect a new primary
>> node and the extraction would resume on that node (this will get better in
>> the future).
>>
>> I think if we expose the distrib=false it would allow you to query shards
>> individually, either by having a nifi instance with a GetSolr processor per
>> shard, or several mini-NiFis each with a single GetSolr, but
>> I'm not sure if we could achieve the dynamic assignment you are thinking
>> of.
>>
>> Let me know if I'm not making sense, happy to keep discussing and trying
>> to figure out what else can be done.
>>
>> -Bryan
>>
>> On Wed, Sep 2, 2015 at 4:38 PM, Srikanth <srikanth...@gmail.com> wrote:
>>>
>>>
>>> Bryan,
>>>
>>> That is correct, having the ability to query nodes with "distrib=false"
>>> is what I was talking about.
>>>
>>> Instead of user having to configure each Solr node in a separate NiFi
>>> processor, can we provide a single configuration??
>>> It would be great if we can take just Zookeeper(ZK) host as input from
>>> user and
>>>   i) Determine all nodes for a container from ZK
>>>   ii) Let each NiFi processor takes ownership of querying a node with
>>> "distrib=false"
>>>
>>> From what I understand, NiFi slaves in cluster can't talk to each other.
>>> Will it be possible to do the ZK query part in cluster master and have
>>> individual Solr nodes propagated to each slave?
>>> I don't know how we can achieve this in NiFi, if at all.
>>>
>>> This will make Solr interface to NiFi much simpler. User needs to provide
>>> just ZK.
>>> We'll be able to take care rest. Including failing over to an alternate
>>> Solr node with current one fails.
>>>
>>> Let me know your thoughts.
>>>
>>> Rgds,
>>> Srikanth
>>>
>>> P.S : I had subscribed only to digest and didn't receive your original
>>> reply. Had to pull this up from mail archive.
>>> Only Dev list is in Nabble!!
>>>
>>>
>>> ***************************************************************************************************
>>>
>>> Hi Srikanth,
>>>
>>> You are correct that in a NiFi cluster the intent would be to schedule
>>> GetSolr on the primary node only (on the scheduling tab) so that only one
>>> node in your cluster was extracting data.
>>>
>>> GetSolr determines which SolrJ client to use based on the "Solr Type"
>>> property, so if you select "Cloud" it will use SolrCloudClient. It would
>>> send the query to one node based on the cluster state from ZooKeeper, and
>>> then that Solr node performs the distributed query.
>>>
>>> Did you have a specific use case where you wanted to query each shard
>>> individually?
>>>
>>> I think it would be straight forward to expose something on GetSolr that
>>> would set "distrib=false" on the query so that Solr would not execute a
>>> distributed query. You would then most likely create separate instances
>>> of
>>> GetSolr and configure them as Standard type pointing at the respective
>>> shards. Let us know if that is something you are interested in.
>>>
>>> Thanks,
>>>
>>> Bryan
>>>
>>>
>>> On Sun, Aug 30, 2015 at 7:32 PM, Srikanth <srikanth...@gmail.com> wrote:
>>>
>>> > Hello,
>>> >
>>> > I started to explore NiFi project a few days back. I'm still trying it
>>> > out.
>>> >
>>> > I have a few basic question on GetSolr.
>>> >
>>> > Should GetSolr be run as an Isolated Processor?
>>> >
>>> > If I have SolrCloud with 4 shards/nodes and NiFi cluster with 4 nodes,
>>> > will GetSolr be able to query each shard from one specific NiFi node?
>>> > I'm
>>> > guessing it doesn't work that way.
>>> >
>>> >
>>> > Thanks,
>>> > Srikanth
>>> >
>>> >
>>
>>
>

Re: Solr Cloud support

Reply via email to