Venkat, correct, though to be sure, I'm referring to I/O related to loading/saving data from/to their persistence locations, and not I/O related to local operations like RDD caching or shuffling.
Sent while mobile. Pls excuse typos etc. On Apr 5, 2014 11:11 AM, "Venkat Krishnamurthy" <ven...@yarcdata.com> wrote: > Christopher > > Just to clarify - by 'load ops' do you mean RDD actions that result in > IO? > > Venkat > From: Christopher Nguyen <c...@adatao.com> > Reply-To: "user@spark.apache.org" <user@spark.apache.org> > Date: Saturday, April 5, 2014 at 8:49 AM > To: "user@spark.apache.org" <user@spark.apache.org> > Subject: Re: Spark on other parallel filesystems > > Avati, depending on your specific deployment config, there can be up to > a 10X difference in data loading time. For example, we routinely parallel > load 10+GB data files across small 8-node clusters in 10-20 seconds, which > would take about 100s if bottlenecked over a 1GigE network. That's about > the max difference for that config. If you use multiple local SSDs the > difference can be correspondingly greater, and likewise 10x smaller for > 10GigE networks. > > Lastly, an interesting dimension to consider is that the difference > diminishes as your data size gets much larger relative to your cluster > size, since the load ops have to be serialized in time anyway. > > There is no difference after loading. > > Sent while mobile. Pls excuse typos etc. > On Apr 4, 2014 10:45 PM, "Anand Avati" <av...@gluster.org> wrote: > >> >> >> >> On Fri, Apr 4, 2014 at 5:12 PM, Matei Zaharia <matei.zaha...@gmail.com>wrote: >> >>> As long as the filesystem is mounted at the same path on every node, you >>> should be able to just run Spark and use a file:// URL for your files. >>> >>> The only downside with running it this way is that Lustre won't expose >>> data locality info to Spark, the way HDFS does. That may not matter if it's >>> a network-mounted file system though. >>> >> >> Is the locality querying mechanism specific to HDFS mode, or is it >> possible to implement plugins in Spark to query location in other ways on >> other filesystems? I ask because, glusterfs can expose data location of a >> file through virtual extended attributes and I would be interested in >> making Spark exploit that locality when the file location is specified as >> glusterfs:// (or querying the xattr blindly for file://). How much of a >> difference does data locality make for Spark use cases anyways (since most >> of the computation happens in memory)? Any sort of numbers? >> >> Thanks! >> Avati >> >> >>> >>> >> Matei >>> >>> On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy <ven...@yarcdata.com> >>> wrote: >>> >>> All >>> >>> Are there any drawbacks or technical challenges (or any information, >>> really) related to using Spark directly on a global parallel filesystem >>> like Lustre/GPFS? >>> >>> Any idea of what would be involved in doing a minimal proof of >>> concept? Is it just possible to run Spark unmodified (without the HDFS >>> substrate) for a start, or will that not work at all? I do know that it's >>> possible to implement Tachyon on Lustre and get the HDFS interface - just >>> looking at other options. >>> >>> Venkat >>> >>> >>> >>