Re: Quality of documentation (rant)

Ognen Duzlevski Mon, 20 Jan 2014 13:36:21 -0800

Hi Matei, thanks for replying!

On Mon, Jan 20, 2014 at 8:08 PM, Matei Zaharia <[email protected]>wrote:


> It’s true that the documentation is partly targeting Hadoop users, and
> that’s something we need to fix. Perhaps the best solution would be some
> kind of tutorial on “here’s how to set up Spark by hand on EC2”. However it
> also sounds like you ran into some issues with S3 that it would be good to
> report separately.
>
> To answer the specific questions:
>
> > For example, the thing supports using S3 to get files but when you
> actually try to read a large file, it just sits there and sits there and
> eventually comes back with an error that really does not tell me anything
> (so the task was killed - why? there is nothing in the logs). So, do I
> actually need an HDFS setup over S3 so it can support block access? Who
> knows, I can't find anything.
>
> This sounds like either a bug or somehow the S3 library requiring lots of
> memory to read a block. There isn’t a separate way to run HDFS over S3.
> Hadoop just has different implementations of “file systems”, one of which
> is S3. There’s a pointer to these versions at the bottom of
> http://spark.incubator.apache.org/docs/latest/ec2-scripts.html#accessing-data-in-s3but
>  it is indeed pretty hidden in the docs.
>

Hmmm. Maybe a bug then. If I read a small 600 byte file via the s3n:// uri
- it works on a spark cluster. If I try a 20GB file it just sits and sits
and sits frozen. Is there anything I can do to instrument this and figure
out what is going on?



>
> > Even basic questions I have to ask on this list - does Spark support
> parallel reads from files in a shared filesystem? Someone answered - yes.
> Does this extend to S3? Who knows? Nowhere to be found. Does it extend to
> S3 only if used through HDFS? Who knows.
>
> Everything in Hadoop and Spark is read in parallel, including S3.
>

OK good to know!


>
> > Does Spark need a running Hadoop cluster to realize its full potential?
> Who knows, it is not stated explicitly anywhere but any time I google stuff
> people mention Hadoop.
>
> Not unless you want to use HDFS.
>

Ahh, OK. I don't particularly want HDFS but I suspect I will need it since
it seems to be the only "free" distributed parallel FS. I suspect running
it over EBS volumes is probably as slow as molasses though. Right now the
s3:// freezing bug is a show stopper for me and I am considering putting
the ephemeral storage on all the nodes in the spark cluster in some kind of
a distributed file system like GPFS or Lustre or
https://code.google.com/p/mogilefs/ to provide a "shared" file system for
all the nodes. It is next to impossible to find online what the standard
practices in the industry are for this kind of a setup so I guess I am
going to set my own industry standards ;)


Anyway, these are really good questions as I said, since the docs kind of
> target a Hadoop audience. We can improve these both in the online docs and
> by having some kind of walk-throughs or tutorial. Do you have any
> suggestions on how you’d like the docs structured to show this stuff? E.g.
> should there be a separate section on S3, or different input sources?
>

Not sure. For starters it would be nice to document the real use cases.

I am more than happy (and I think the people I work for are happy too) to
document the pipeline I am setting up. In the process I have found that the
industry is remarkably "tight lipped" as to how to do these things in
practice. For example, what if you want to expose a point on the internet
where you can send millions of data points into your "firehose"? What do
you use? How? I have people people recommending kafka but even those people
don't exactly say HOW. I have gone the route of elactic load balancing with
autoscaling exposing a bunch of mongrel2 instances running zeromq handlers
that ingest data and then bounce it into S3 for persistence and into a
Spark cluster for real-time analytics but also for post fact analytics.
While I have demonstrated the whole pipeline on a toy example, I am now
trying to test it in "real life" with historic data that we have from our
"previous" data provider - about 1-2 TB of data so far in 20-30GB files.
Unfortunately I have not been able to get past the f =
textFile(s3://something), f.count basic test on a 20GB file on Amazon S3. I
have a test cluster of about 16 m1.xlarge instances that is just sitting
there spinning :)


>
> One final thing — as someone mentioned, using Spark’s EC2 scripts to
> launch a cluster is not a bad idea. We’ve supported those scripts pretty
> much since Spark was released and they do a lot of the configuration for
> you. You can even pause/restart the cluster if you want, etc.
>

Yes, but things get complicated in people's setups. I run mine in a VPC
that exposes only one point of entry - the elastic load balancer that takes
the traffic from the "outside" and sends it to the "inside" of the VPC
where the analytics/spark live. I imagine this would be a common use
scenario for a company that has millions of devices hitting their data
entry point(s) where the data is important in terms of privacy, for example
- the VPC offers much more than EC2 with security groups (and is easier to
manage).

Thanks!
Ognen

Re: Quality of documentation (rant)

Reply via email to