Hi Matei, thanks for replying! On Mon, Jan 20, 2014 at 8:08 PM, Matei Zaharia <[email protected]>wrote:
> It’s true that the documentation is partly targeting Hadoop users, and > that’s something we need to fix. Perhaps the best solution would be some > kind of tutorial on “here’s how to set up Spark by hand on EC2”. However it > also sounds like you ran into some issues with S3 that it would be good to > report separately. > > To answer the specific questions: > > > For example, the thing supports using S3 to get files but when you > actually try to read a large file, it just sits there and sits there and > eventually comes back with an error that really does not tell me anything > (so the task was killed - why? there is nothing in the logs). So, do I > actually need an HDFS setup over S3 so it can support block access? Who > knows, I can't find anything. > > This sounds like either a bug or somehow the S3 library requiring lots of > memory to read a block. There isn’t a separate way to run HDFS over S3. > Hadoop just has different implementations of “file systems”, one of which > is S3. There’s a pointer to these versions at the bottom of > http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3but > it is indeed pretty hidden in the docs. > Hmmm. Maybe a bug then. If I read a small 600 byte file via the s3n:// uri - it works on a spark cluster. If I try a 20GB file it just sits and sits and sits frozen. Is there anything I can do to instrument this and figure out what is going on? > > > Even basic questions I have to ask on this list - does Spark support > parallel reads from files in a shared filesystem? Someone answered - yes. > Does this extend to S3? Who knows? Nowhere to be found. Does it extend to > S3 only if used through HDFS? Who knows. > > Everything in Hadoop and Spark is read in parallel, including S3. > OK good to know! > > > Does Spark need a running Hadoop cluster to realize its full potential? > Who knows, it is not stated explicitly anywhere but any time I google stuff > people mention Hadoop. > > Not unless you want to use HDFS. > Ahh, OK. I don't particularly want HDFS but I suspect I will need it since it seems to be the only "free" distributed parallel FS. I suspect running it over EBS volumes is probably as slow as molasses though. Right now the s3:// freezing bug is a show stopper for me and I am considering putting the ephemeral storage on all the nodes in the spark cluster in some kind of a distributed file system like GPFS or Lustre or https://code.google.com/p/mogilefs/ to provide a "shared" file system for all the nodes. It is next to impossible to find online what the standard practices in the industry are for this kind of a setup so I guess I am going to set my own industry standards ;) Anyway, these are really good questions as I said, since the docs kind of > target a Hadoop audience. We can improve these both in the online docs and > by having some kind of walk-throughs or tutorial. Do you have any > suggestions on how you’d like the docs structured to show this stuff? E.g. > should there be a separate section on S3, or different input sources? > Not sure. For starters it would be nice to document the real use cases. I am more than happy (and I think the people I work for are happy too) to document the pipeline I am setting up. In the process I have found that the industry is remarkably "tight lipped" as to how to do these things in practice. For example, what if you want to expose a point on the internet where you can send millions of data points into your "firehose"? What do you use? How? I have people people recommending kafka but even those people don't exactly say HOW. I have gone the route of elactic load balancing with autoscaling exposing a bunch of mongrel2 instances running zeromq handlers that ingest data and then bounce it into S3 for persistence and into a Spark cluster for real-time analytics but also for post fact analytics. While I have demonstrated the whole pipeline on a toy example, I am now trying to test it in "real life" with historic data that we have from our "previous" data provider - about 1-2 TB of data so far in 20-30GB files. Unfortunately I have not been able to get past the f = textFile(s3://something), f.count basic test on a 20GB file on Amazon S3. I have a test cluster of about 16 m1.xlarge instances that is just sitting there spinning :) > > One final thing — as someone mentioned, using Spark’s EC2 scripts to > launch a cluster is not a bad idea. We’ve supported those scripts pretty > much since Spark was released and they do a lot of the configuration for > you. You can even pause/restart the cluster if you want, etc. > Yes, but things get complicated in people's setups. I run mine in a VPC that exposes only one point of entry - the elastic load balancer that takes the traffic from the "outside" and sends it to the "inside" of the VPC where the analytics/spark live. I imagine this would be a common use scenario for a company that has millions of devices hitting their data entry point(s) where the data is important in terms of privacy, for example - the VPC offers much more than EC2 with security groups (and is easier to manage). Thanks! Ognen
