Hi Zhu,
Thanks for the reply.
> One question which might be out of the scope is that whether we should do
similar things for ShuffleEnvironment?
I agree we should also consider fatal error handling for ShuffleEnvironment
eventually.
> I think the proposal does not conflict with this target.
Thanks for starting this discussion.
Here are some of my thoughts regarding the proposal and discussions
above.
*+1 to enable ShuffleMaster to stop track partitions proactively*
In production we have encountered problems that it needs *hours* to
recover from a remote shuffle worker lost problem.
Hi,
Thanks for the reply.
@Guowei
I agree that we can move forward step by step and start from the most
important part. Apart from the two points mentioned in your reply,
initializing and shutting down some external resources gracefully is also
important which is a reason for the open/close
One quick comment: When developing the ShuffleService abstraction we also
thought that different jobs might want to use different ShuffleServices
depending on their workload (e.g. batch vs. streaming workload). So
ideally, the chosen solution here can also support this use case eventually.
Hi,
Thank Yingjie for initiating this discussion. What I understand that the
document[1] actually mainly discusses two issues:
1. ShuffleMaster should be at the cluster level instead of the job level
2. ShuffleMaster should notify PartitionTracker that some data has been lost
Relatively speaking,
Hi devs,
I'd like to start a discussion about "Lifecycle of ShuffleMaster and its
Relationship with JobMaster and PartitionTracker". (These are things we
found when moving our external shuffle to the pluggable shuffle service
framework.)
The mail client may fail to display the right format. If