Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-09 Thread Yingjie Cao
Hi Zhu, Thanks for the reply. > One question which might be out of the scope is that whether we should do similar things for ShuffleEnvironment? I agree we should also consider fatal error handling for ShuffleEnvironment eventually. > I think the proposal does not conflict with this target.

Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-08 Thread Zhu Zhu
Thanks for starting this discussion. Here are some of my thoughts regarding the proposal and discussions above. *+1 to enable ShuffleMaster to stop track partitions proactively* In production we have encountered problems that it needs *hours* to recover from a remote shuffle worker lost problem.

Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-07 Thread Yingjie Cao
Hi, Thanks for the reply. @Guowei I agree that we can move forward step by step and start from the most important part. Apart from the two points mentioned in your reply, initializing and shutting down some external resources gracefully is also important which is a reason for the open/close

Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-07 Thread Till Rohrmann
One quick comment: When developing the ShuffleService abstraction we also thought that different jobs might want to use different ShuffleServices depending on their workload (e.g. batch vs. streaming workload). So ideally, the chosen solution here can also support this use case eventually.

Re: [DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-07-07 Thread Guowei Ma
Hi, Thank Yingjie for initiating this discussion. What I understand that the document[1] actually mainly discusses two issues: 1. ShuffleMaster should be at the cluster level instead of the job level 2. ShuffleMaster should notify PartitionTracker that some data has been lost Relatively speaking,

[DISCUSS] Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker

2021-06-11 Thread Yingjie Cao
Hi devs, I'd like to start a discussion about "Lifecycle of ShuffleMaster and its Relationship with JobMaster and PartitionTracker". (These are things we found when moving our external shuffle to the pluggable shuffle service framework.) The mail client may fail to display the right format. If