Alluxio allows for data sharing between applications through a File System
API (Native Java Alluxio client, Hadoop FileSystem, or POSIX through fuse).
If your MPI applications can use any of these interfaces, you should be
able to use Alluxio for data sharing out of the box.

In terms of duplicating in-memory data, you should only need one copy in
Alluxio if you are able to stream your dataset. As for the performance of
using Alluxio to back your data compared to using Spark's native in-memory
representation, here is a blog
<http://www.alluxio.com/2016/08/effective-spark-rdds-with-alluxio/> which
details the pros and cons of the two approaches. At a high level, Alluxio
performs better with larger datasets or if you plan to use your dataset in
more than one Spark job.

Hope this helps,

Reply via email to