A colleague and I were having a discussion and we were disagreeing about something in Spark/Mesos that perhaps someone can shed some light into.
We have a mesos cluster that runs spark via a sparkHome, rather than downloading an executable and such. My colleague says that say we have parquet files in S3, that slaves should know what data is in their partition and only pull from the S3 the partitions of parquet data they need, but this seems inherinitly wrong to me. as I have no idea how it’s possible for Spark or Mesos to know what partitions to know what to pull on the slave. It makes much more sense to me for the partitioning to be done on the driver and then distributed to the slaves so the slaves don’t have to necessarily worry about these details. If this were the case there is some data loading that is done on the driver, correct? Or does spark/mesos do some magic to pass a reference so the slaves know what to pull per say? So I guess in summation, where does partitioning and data loading happen? On the driver or on the executor? Thanks, Steve This e-mail is intended solely for the above-mentioned recipient and it may contain confidential or privileged information. If you have received it in error, please notify us immediately and delete the e-mail. You must not copy, distribute, disclose or take any action in reliance on it. In addition, the contents of an attachment to this e-mail may contain software viruses which could damage your own computer system. While ColdLight Solutions, LLC has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage which you sustain as a result of software viruses. You should perform your own virus checks before opening the attachment.