Ankur -
To answer your specific question re:
Q: Is a s3 path considered non-hdfs?
A: At this time no, it uses the hdfs layer to resolve (for better or worse).
---------------------------------------------------------------------
// Grab the resource using the hadoop client if it's one of the known schemes
// TODO(tarnfeld): This isn't very scalable with hadoop's pluggable
// filesystem implementations.
// TODO(matei): Enforce some size limits on files we get from HDFS
if (strings::startsWith(uri, "hdfs://") ||
strings::startsWith(uri, "hftp://") ||
strings::startsWith(uri, "s3://") ||
strings::startsWith(uri, "s3n://")) {
Try<string> base = os::basename(uri);
if (base.isError()) {
LOG(ERROR) << "Invalid basename for URI: " << base.error();
return Error("Invalid basename for URI");
}
string path = path::join(directory, base.get());
HDFS hdfs;
LOG(INFO) << "Downloading resource from '" << uri
<< "' to '" << path << "'";
Try<Nothing> result = hdfs.copyToLocal(uri, path);
if (result.isError()) {
LOG(ERROR) << "HDFS copyToLocal failed: " << result.error();
return Error(result.error());
}
---------------------------------------------------------------------
----- Original Message -----
> From: "Ankur Chauhan" <[email protected]>
> To: [email protected]
> Sent: Tuesday, October 21, 2014 10:28:50 AM
> Subject: Re: Do i really need HDFS?
> This is what I also intend to do. Is a s3 path considered non-hdfs? If so,
> how does it know the credentials to use to fetch the file.
> Sent from my iPhone
> On Oct 21, 2014, at 5:16 AM, David Greenberg < [email protected] >
> wrote:
> > We use spark without HDFS--in our case, we just use ansible to copy the
> > spark
> > executors onto all hosts at the same path. We also load and store our spark
> > data from non-HDFS sources.
>
> > On Tue, Oct 21, 2014 at 4:57 AM, Dick Davies < [email protected] >
> > wrote:
>
> > > I think Spark needs a way to send jobs to/from the workers - the Spark
> >
>
> > > distro itself
> >
>
> > > will pull down the executor ok, but in my (very basic) tests I got
> >
>
> > > stuck without HDFS.
> >
>
> > > So basically it depends on the framework. I think in Sparks case they
> >
>
> > > assume most
> >
>
> > > users are migrating from an existing Hadoop deployment, so HDFS is
> >
>
> > > sort of assumed.
> >
>
> > > On 20 October 2014 23:18, CCAAT < [email protected] > wrote:
> >
>
> > > > On 10/20/14 11:46, Steven Schlansker wrote:
> >
>
> > > >
> >
>
> > > >
> >
>
> > > >> We are running Mesos entirely without HDFS with no problems. We use
> >
>
> > > >> Docker to distribute our
> >
>
> > > >> application to slave nodes, and keep no state on individual nodes.
> >
>
> > > >
> >
>
> > > >
> >
>
> > > >
> >
>
> > > > Background: I'm building up a 3 node cluster to run mesos and spark. No
> >
>
> > > > legacy Hadoop needed or wanted. I am using btrfs for the local file
> > > > system,
> >
>
> > > > with (2) drives set up for raid1 on each system.
> >
>
> > > >
> >
>
> > > > So you are suggesting that I can install mesos + spark + docker
> >
>
> > > > and not a DFS on these (3) machines?
> >
>
> > > >
> >
>
> > > >
> >
>
> > > > Will I need any other softwares? My application is a geophysical
> >
>
> > > > fluid simulator, so scala, R, and all sorts of advanced math will
> >
>
> > > > be required on the cluster for the Finite Element Methods.
> >
>
> > > >
> >
>
> > > >
> >
>
> > > > James
> >
>
> > > >
> >
>
> > > >
> >
>
--
--
Cheers,
Timothy St. Clair
Red Hat Inc.