We run Spark (in Standalone mode) on top of a network-mounted file system (NFS), rather than HDFS, and find it to work great. It required no modification or special configuration to set this up; as Matei says, we just point Spark to data using the file location.
-- Jeremy On Apr 4, 2014, at 8:12 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > As long as the filesystem is mounted at the same path on every node, you > should be able to just run Spark and use a file:// URL for your files. > > The only downside with running it this way is that Lustre won’t expose data > locality info to Spark, the way HDFS does. That may not matter if it’s a > network-mounted file system though. > > Matei > > On Apr 4, 2014, at 4:56 PM, Venkat Krishnamurthy <ven...@yarcdata.com> wrote: > >> All >> >> Are there any drawbacks or technical challenges (or any information, really) >> related to using Spark directly on a global parallel filesystem like >> Lustre/GPFS? >> >> Any idea of what would be involved in doing a minimal proof of concept? Is >> it just possible to run Spark unmodified (without the HDFS substrate) for a >> start, or will that not work at all? I do know that it’s possible to >> implement Tachyon on Lustre and get the HDFS interface – just looking at >> other options. >> >> Venkat >