Re: Correct way to redistribute work from disconnected instances?

2016-10-19 Thread Lei Xia
is just sitting) >>>> >>>> Then drop N1: >>>> - N2 becomes LEADER >>>> - Nothing happens to N3 >>>> >>>> Naively, I would have expected N3 to transition from Offline to >>>> Standby, but that doesn't happen. >>>> >>>> I can force redistribution from GenericHelixController#onLiveInstanceChange >>>> by >>>> - dropping non-live instances from the cluster >>>> - calling rebalance >>>> >>>> The instance dropping seems pretty unsafe! Is there a better way? >>>> >>> >>> >> > -- Lei Xia

Re: Correct way to redistribute work from disconnected instances?

2016-10-20 Thread Lei Xia
t; partition and 1 replica. Each resource would reside on exactly 1 node, and > there is no limit on the number of resources per node. > > On Wed, Oct 19, 2016 at 9:23 PM, Lei Xia wrote: > >> Hi, Michael >> >> Could you be more specific on the issue you see? Specificall

Re: Too-aggressive FULL_AUTO rebalancing? (maybe fixed @ master)

2016-10-24 Thread Lei Xia
o the cluster. For example, with 2 nodes + 1 resource >>>>>>> (1 >>>>>>> replica, 1 partition) + OnlineOffline: https://gist.gi >>>>>>> thub.com/mkscrg/628ab964995c0be914d44654d26ae561/99348c870e9 >>>>>>> f028048c1d1cfdd15976325f293f9 >>>>>>> >>>>>>> However, this seems to be fixed at the current master branch on >>>>>>> GitHub: https://gist.github.com/mkscrg/628ab964995c0be914d44 >>>>>>> 654d26ae561/ec26a64a74b50c8c125ccd1f9bde1d8aa848a0b5 >>>>>>> >>>>>>> Will this fix be released in an 0.6.x version? >>>>>>> >>>>>> >>>>> >>> >> > -- *Lei Xia *Senior Software Engineer Data Infra/Nuage & Helix LinkedIn l...@linkedin.com www.linkedin.com/in/lxia1

[ANNOUNCE] Apache Helix 0.6.6 Release

2016-11-10 Thread Lei Xia
The Apache Helix Team is pleased to announce the 9th release, 0.6.6, of the Apache Helix project. Apache Helix is a generic cluster management framework that makes it easy to build partitioned, fault tolerant, and scalable distributed systems. The full release notes are available here:http://heli

Re: Merge 0.6.x and 0.7.x to new 0.8.x branch

2016-11-22 Thread Lei Xia
if you have any suggestions. Thanks Lei On Mon, Nov 21, 2016 at 10:28 PM, kishore g wrote: > I like the overall idea. One concern is that it might be hard to maintain > backward compatibility with both 0.6 and 0.7. > > On Mon, Nov 21, 2016 at 10:17 PM, Lei Xia wrote: &

[ANNOUNCE] Apache Helix 0.6.7 Release

2017-01-26 Thread Lei Xia
The Apache Helix Team is pleased to announce the 10th release, 0.6.7, of the Apache Helix project. Apache Helix is a generic cluster management framework that makes it easy to build partitioned, fault tolerant, and scalable distributed systems. The full release notes are available here: http://he

Re: Dynamic partition management

2017-06-07 Thread Lei Xia
Hi, Subramanian Helix actually allows you to dynamically change the number of partitions in a resource. If you are using your own customized rebalancer, i.e, your rebalance mode set in resource's IdealState is CUSTOMIZED, what you can do is to manipulate the IdealState's MapFields when adding

Re: Data synch between replicas

2017-10-31 Thread Lei Xia
Hi, Leela The Master/Slave model does not support that because there is no way Helix can differeniate two slave replicas unless your application have customized logic to perform the check. However, for your case, you can create your own state model instead of using defaut MasterSlave model. Yo

Re: differentiate between bootstrap and a soft failure

2018-01-23 Thread Lei Xia
cipant will be back online soon and also you can tolerate losing one or more replica in short-term, then you can set a delay time here. In which Helix will not bring a new replica before this time. Hope that makes it more clear. Thanks Lei Lei Xia Data Infra/Helix l...@linkedin.c

Re: differentiate between bootstrap and a soft failure

2018-01-26 Thread Lei Xia
at a >>>> time? During a state transition, a participant needs to setup proper >>>> replication upstream for itself (in the case where it is transiting to >>>> Slave) or other replicas (in the case it is transiting to Master). So the >>

Re: build failure for helix-front

2018-01-31 Thread Lei Xia
helix-ui is written in node.js and it does not publish any Jar or other artifact along with our release, that is why we did not find this issue in our release process. Our release script did not bump the version in helix-ui submodule pom file. Let us fix the script and regenerate our release cand

Re: build failure for helix-front

2018-01-31 Thread Lei Xia
By heix-ui I meaned helix-front. Lei On Wed, Jan 31, 2018 at 8:49 AM Lei Xia wrote: > helix-ui is written in node.js and it does not publish any Jar or other > artifact along with our release, that is why we did not find this issue in > our release process. Our release script did

Re: build failure for helix-front

2018-01-31 Thread Lei Xia
n use > the admin features like adding a cluster etc. > > On Jan 31, 2018 08:51, "Lei Xia" wrote: > >> By heix-ui I meaned helix-front. >> >> >> Lei >> >> On Wed, Jan 31, 2018 at 8:49 AM Lei Xia wrote: >> >>> helix-ui is writt

Re: Topology doesn't work as expected

2018-02-13 Thread Lei Xia
Hi, Bo Please add "TOPOLOGY_AWARE_ENABLED" : "true" to your clusterConfig and try again? Thanks Lei On Tue, Feb 13, 2018 at 2:48 PM, Bo Liu wrote: > Hi Helix Team, > > I am doing some test for the Helix topology > > The cluster configuration is: > DELAY_REBALANCE_DISABLE : "false"DELAY_REB

Re: Is there a way to reset an OnlineOffline resource?

2018-02-16 Thread Lei Xia
et hosted on the resource. >> We tried to disable&enable the resource. It doesn't change the states of >> any partitions. So I guess it only disables Helix management for the >> resource? >> >> -- >> Best regards, >> Bo >> >> > -- Lei Xia

Re: Throttle at partition level?

2018-02-16 Thread Lei Xia
upport throttling state transition at partition level? >>>> I only find cluster, resource and instance level throttling as below: >>>> >>>> public enum ThrottleScope { >>>> CLUSTER, >>>> RESOURCE, >>>> INSTANCE >>>> } >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>> >>> >>> -- >>> Best regards, >>> Bo >>> >>> >> >> >> -- >> Best regards, >> Bo >> >> > -- Lei Xia

Re: Is there a way to reset an OnlineOffline resource?

2018-02-16 Thread Lei Xia
We tried to disable it through helix ui and restful api. > > Yes, I think it's not caused by the delay feature. Because the disabled > resource stayed at online state forever. > > On Feb 16, 2018 08:48, "Lei Xia" wrote: > >> Hi, Bo >> >> Disable a res

Re: Unable to get Rebalance Delay to work using the distributed lock manager recipe

2018-02-24 Thread Lei Xia
ed By > == > lock-group_0 localhost_12000 > lock-group_1 localhost_12001 > lock-group_10 localhost_12002 > lock-group_11 localhost_12000 > lock-group_2 localhost_12001 > lock-group_3 localhost_12002 > lock-group_4 localhost_12000 > lock-group_5 localhost_12001 > lock-group_6 localhost_12002 > lock-group_7 localhost_12000 > lock-group_8 localhost_12001 > lock-group_9 localhost_12002 > -- Lei Xia

Re: Unable to get Rebalance Delay to work using the distributed lock manager recipe

2018-03-01 Thread Lei Xia
host_12002 > lock-group_11 localhost_12000 > lock-group_2 localhost_12001 > lock-group_3 localhost_12002 > lock-group_4 localhost_12000 > lock-group_5 localhost_12001 > lock-group_6 localhost_12002 > lock-group_7 localhost_12000 > lock-group_8 localhost_12001 >

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
Sorry, I totally missed this email thread. Yes, we do have such feature in 0.8 to protect the cluster in case of disasters happening. A new config option "MAX_OFFLINE_INSTANCES_ALLOWED" can be set in ClusterConfig. If it is set, and the number of offline instances reach to the set limit in the c

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
ve to call HelixAdmin.enableMaintenanceMode() manually to exit the maintenance mode. Support of auto existing maintenance mode is on our road-map. Lei Lei Xia Data Infra/Helix l...@linkedin.com<mailto:l...@linkedin.com> www.linkedin.com/in/lxia1<http://www.li

Re: protect a cluster during broad range outage

2018-03-19 Thread Lei Xia
tting started for the first time. Will it >> get enabled only after min nodes are started? >> >> thanks >> >> On Mon, Mar 19, 2018 at 6:42 PM, Lei Xia wrote: >> >>> Actually we already supported maintenance mode in 0.8.0. My bad. >>> >>&g

Re: Helix UI issue

2018-06-03 Thread Lei Xia
issue before? > > Thanks, > -- > Best regards, > Bo > > -- Lei Xia

Re: Does participant create new thread to handle state transition messages?

2018-07-16 Thread Lei Xia
Hi, Bo Helix participant creates a thread-pool to handle the state transition by default, and the application can supply its own thread-pool for specific state-transition too. The default thread-pool size is 40, which is configurable. Lei On Mon, Jul 16, 2018 at 11:26 AM, Bo Liu wrote: > H

Re: Does participant create new thread to handle state transition messages?

2018-07-17 Thread Lei Xia
s a little suspicious, as it is a cached > thread pool, which could terminate and create threads on the fly. However, > I am not sure if it is used for state transitioning for OnlineOffline model? > > > > On Mon, Jul 16, 2018 at 12:33 PM Lei Xia wrote: > >> Hi, Bo >>

Re: Helix and routing pattern to a server that has a particular partition

2018-10-05 Thread Lei Xia
ure requests are routed to the correct node, in this case > a node that is the master of that particular partition? > > Regards, > > Rob > -- Lei Xia

Re: Helix and routing pattern to a server that has a particular partition

2018-10-05 Thread Lei Xia
list = routingTableProvider.getInstances("data2", "data2_0","MASTER"); On Fri, Oct 5, 2018 at 4:07 PM Rob McKinnon wrote: > Lei - I am using version 0.8.2 > > On Fri, Oct 5, 2018 at 7:02 PM Lei Xia wrote: > >> Hi, Rob >> >>Which

Re: User-Define Rebalancing Issue

2018-10-09 Thread Lei Xia
tate(Conf.CLUSTER_NAME, Conf.RESOURCE_NAME, > idealState); > > > admin.rebalance(Conf.CLUSTER_NAME, RESOURCE_NAME, NUM_REPLICAS); > } > > == > I was expecting that when calling the "admin.rebalance" method, it would > invoke "MyRebalance" code but when I run it "MyRebalance" code was not > invoked. > > > Thanks, > > Rob > -- Lei Xia

Re: who uses helix

2019-10-20 Thread Lei Xia
covery-for-rocksplicator-f1f8fd35c833 * * Airbnb’s Change Data Capture system: https://medium.com/airbnb-engineering/capturing-data-evolution-in-a-service-oriented-architecture-72f7c643ee6f Lei Xia Data Infra/Helix l...@linkedin.com<mailto:l...@linkedin.com> www.linkedin.c

Re: who uses helix

2019-10-20 Thread Lei Xia
tps://github.com/apache/helix/wiki/Weight-aware-Globally-Evenly-distributed-Rebalancer> Mirror of Apache Helix. Contribute to apache/helix development by creating an account on GitHub. github.com Lei Xia Data Infra/Helix l...@linkedin.com<mailto:l...@linkedin.com> www.linkedin.com/in/

Re: Long running jobs and node drain

2020-05-11 Thread Lei Xia
partition (P) is also started on the new node (N2). >>>>> >>>>> 3. N1 can be put out of service only when all running jobs (J) on it >>>>> are over, at this point only N2 will serve P request. >>>>> >>>>> Questions : >>>>> 1. Can drain process be modeled using helix? >>>>> 2. If yes, Is there any recipe / pointers for a helix state model? >>>>> 3. Is there any custom way to trigger state transitions? From >>>>> documentation, I gather that Helix controller in full auto mode, triggers >>>>> state transitions only when number of partitions change or cluster changes >>>>> (node addition or deletion) >>>>> 3.I guess spectator will be needed, to custom routing logic in such >>>>> cases, any pointers for the the same? >>>>> >>>>> Thank You >>>>> Santosh >>>>> >>>> -- Lei Xia

Re: Long running jobs and node drain

2020-05-13 Thread Lei Xia
x27;s a >>> central strorage for state of cluster which I can use for my routing logic. >>> 3. A job could be running for hours and thus drain can happen for a long >>> time. >>> >>> >>> " How long you would expect OFFLINE->UP take here, if i

Re: Long running jobs and node drain

2020-05-13 Thread Lei Xia
void offlineToSlave(Message message, NotificationContext context) { > //don't return until long long running job is running > } > > On Wed, May 13, 2020 at 10:40 PM Lei Xia wrote: > >> Hi, Santosh >> >> Thanks for explaining your case in detail. In this cas

Re: Question: How to transition from Master to Error

2020-06-15 Thread Lei Xia
Updating the state on the model just updates the local variable and doesn't > notify the controller. > > Any pointers or examples would be appreciated. > > Thanks, > Imran > -- Lei Xia