Re: Spark on other parallel filesystems

Christopher Nguyen Sun, 06 Apr 2014 08:00:06 -0700

Venkat, correct, though to be sure, I'm referring to I/O related to
loading/saving data from/to their persistence locations, and not I/O
related to local operations like RDD caching or shuffling.


Sent while mobile. Pls excuse typos etc.
On Apr 5, 2014 11:11 AM, "Venkat Krishnamurthy" <ven...@yarcdata.com> wrote:

>  Christopher
>
>  Just to clarify - by 'load ops' do you mean RDD actions that result in
> IO?
>
>  Venkat
>  From: Christopher Nguyen <c...@adatao.com>
> Reply-To: "user@spark.apache.org" <user@spark.apache.org>
> Date: Saturday, April 5, 2014 at 8:49 AM
> To: "user@spark.apache.org" <user@spark.apache.org>
> Subject: Re: Spark on other parallel filesystems
>
>   Avati, depending on your specific deployment config, there can be up to
> a 10X difference in data loading time. For example, we routinely parallel
> load 10+GB data files across small 8-node clusters in 10-20 seconds, which
> would take about 100s if bottlenecked over a 1GigE network. That's about
> the max difference for that config. If you use multiple local SSDs the
> difference can be correspondingly greater, and likewise 10x smaller for
> 10GigE networks.
>
> Lastly, an interesting dimension to consider is that the difference
> diminishes as your data size gets much larger relative to your cluster
> size, since the load ops have to be serialized in time anyway.
>
> There is no difference after loading.
>
> Sent while mobile. Pls excuse typos etc.
> On Apr 4, 2014 10:45 PM, "Anand Avati" <av...@gluster.org> wrote:
>
>>
>>
>>
>> On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote:
>>
>>> As long as the filesystem is mounted at the same path on every node, you
>>> should be able to just run Spark and use a file:// URL for your files.
>>>
>>>  The only downside with running it this way is that Lustre won't expose
>>> data locality info to Spark, the way HDFS does. That may not matter if it's
>>> a network-mounted file system though.
>>>
>>
>>  Is the locality querying mechanism specific to HDFS mode, or is it
>> possible to implement plugins in Spark to query location in other ways on
>> other filesystems? I ask because, glusterfs can expose data location of a
>> file through virtual extended attributes and I would be interested in
>> making Spark exploit that locality when the file location is specified as
>> glusterfs:// (or querying the xattr blindly for file://). How much of a
>> difference does data locality make for Spark use cases anyways (since most
>> of the computation happens in memory)? Any sort of numbers?
>>
>>  Thanks!
>> Avati
>>
>>
>>>
>>>
>>   Matei
>>>
>>>  On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy <ven...@yarcdata.com>
>>> wrote:
>>>
>>>  All
>>>
>>>  Are there any drawbacks or technical challenges (or any information,
>>> really) related to using Spark directly on a global parallel filesystem
>>>  like Lustre/GPFS?
>>>
>>>  Any idea of what would be involved in doing a minimal proof of
>>> concept? Is it just possible to run Spark unmodified (without the HDFS
>>> substrate) for a start, or will that not work at all? I do know that it's
>>> possible to implement Tachyon on Lustre and get the HDFS interface - just
>>> looking at other options.
>>>
>>>  Venkat
>>>
>>>
>>>
>>

Re: Spark on other parallel filesystems

Reply via email to