A colleague and I were having a discussion and we were disagreeing about 
something in Spark/Mesos that perhaps someone can shed some light into.

We have a mesos cluster that runs spark via a sparkHome, rather than 
downloading an executable and such.

My colleague says that say we have parquet files in S3, that slaves should know 
what data is in their partition and only pull from the S3 the partitions of 
parquet data they need, but this seems inherinitly wrong to me.
as I have no idea how it’s possible for Spark or Mesos to know what partitions 
to know what to pull on the slave. It makes much more sense to me for the 
partitioning to be done on the driver and then distributed to the
slaves so the slaves don’t have to necessarily worry about these details. If 
this were the case there is some data loading that is done on the driver, 
correct? Or does spark/mesos do some magic to pass a reference so the slaves
know what to pull per say?

So I guess in summation, where does partitioning and data loading happen? On 
the driver or on the executor?

Thanks,
Steve
This e-mail is intended solely for the above-mentioned recipient and it may 
contain confidential or privileged information. If you have received it in 
error, please notify us immediately and delete the e-mail. You must not copy, 
distribute, disclose or take any action in reliance on it. In addition, the 
contents of an attachment to this e-mail may contain software viruses which 
could damage your own computer system. While ColdLight Solutions, LLC has taken 
every reasonable precaution to minimize this risk, we cannot accept liability 
for any damage which you sustain as a result of software viruses. You should 
perform your own virus checks before opening the attachment.

Reply via email to