Hi Spark Community, Thank you for all the awesome features of Apache Spark 3.X like
- GA of Spark On K8 - Push-based external shuffle - Easy mounting of volume claims in Spark K8s We run 1000s of spark jobs on K8s, Yarn (EMR), and Databricks. We run into shuffle block loss and RDD loss due to a variety of infrastructure reasons like spot loss, memory issues, etc. Not all resource managers that we use today support external shuffle service natively, While all the existing solution to this block loss revolves around using the ShuffleManager interface to manage shuffle data into more reliable locations like cloud storage or remotely managed storages (Uber's shuffle service). My understanding is shuffle manager only handles shuffle blocks whereas the external shuffle service (client) can handle any shuffle or RDD blocks and can also support push shuffle merge. I was wondering what would be the impact of implementing an alternate solution using shared mounted volumes of high-performance filesystems and forcing the host local read mechanism in spark to fetch these blocks in any executors that need them. This should ideally solve both shuffle and RDD blocks. The abstract of the idea is as follows - Mount shared volumes (eg: AWS FSx Lustre, EFS, etc, or other network filesystems) in spark driver and executors - The Network Block Store Client (extending the existing block store client ) retrieves executor info based on metadata stored in the shared volume. - override the getHostLocalDirs external block tore client to return all the registered executor local directories (to trigger host local read in spark codebase) - update ShuffleBlockFetcherIterator's partitionBlockByFetchMode to consider Shared volume mount blocks as host local - Deploy a small number of external shuffle servers to perform Shuffle block merge based on existing push merge logic thereby adding support for the same in non-yarn resource managers. - Implement Push Based Merge functions in Network Block Store Client to work with this external shuffle servers Thanks in advance for all the feedback and suggestions. Arun Ravi M V B.Tech (Batch: 2010-2014) Computer Science and Engineering Govt. Model Engineering College Cochin University Of Science And Technology Kochi arunrav...@gmail.com +91 9995354581 Skype : arunravimv