Hi Ognen,

It’s true that the documentation is partly targeting Hadoop users, and that’s 
something we need to fix. Perhaps the best solution would be some kind of 
tutorial on “here’s how to set up Spark by hand on EC2”. However it also sounds 
like you ran into some issues with S3 that it would be good to report 
separately.

To answer the specific questions:

> For example, the thing supports using S3 to get files but when you actually 
> try to read a large file, it just sits there and sits there and eventually 
> comes back with an error that really does not tell me anything (so the task 
> was killed - why? there is nothing in the logs). So, do I actually need an 
> HDFS setup over S3 so it can support block access? Who knows, I can't find 
> anything.

This sounds like either a bug or somehow the S3 library requiring lots of 
memory to read a block. There isn’t a separate way to run HDFS over S3. Hadoop 
just has different implementations of “file systems”, one of which is S3. 
There’s a pointer to these versions at the bottom of 
http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3
 but it is indeed pretty hidden in the docs.

> Even basic questions I have to ask on this list - does Spark support parallel 
> reads from files in a shared filesystem? Someone answered - yes. Does this 
> extend to S3? Who knows? Nowhere to be found. Does it extend to S3 only if 
> used through HDFS? Who knows.

Everything in Hadoop and Spark is read in parallel, including S3.

> Does Spark need a running Hadoop cluster to realize its full potential? Who 
> knows, it is not stated explicitly anywhere but any time I google stuff 
> people mention Hadoop.

Not unless you want to use HDFS.

> Can Spark do EVERYTHING in standalone mode? The documentation is not explicit 
> but it leads you to believe it can (or maybe I am overly optimistic?).

Yes, there’s no difference on what you can run on Spark in the different 
deployment modes. They’re just different ways to get tasks on a cluster.

Anyway, these are really good questions as I said, since the docs kind of 
target a Hadoop audience. We can improve these both in the online docs and by 
having some kind of walk-throughs or tutorial. Do you have any suggestions on 
how you’d like the docs structured to show this stuff? E.g. should there be a 
separate section on S3, or different input sources?

One final thing — as someone mentioned, using Spark’s EC2 scripts to launch a 
cluster is not a bad idea. We’ve supported those scripts pretty much since 
Spark was released and they do a lot of the configuration for you. You can even 
pause/restart the cluster if you want, etc.

Matei

Reply via email to