I have to second this. Spark documentations make a lot of non-obvious assumptions. On top of this, when asking a question in the mailing list, you are often referred to those documentations by the developers.
On Sun, Jan 19, 2014 at 12:52 PM, Ognen Duzlevski <[email protected]>wrote: > Hello, > > I have been trying to set up a running spark cluster for a while now. > Being new to all this, I have tried to rely on the documentation, however, > I find it sorely lacking on a few fronts. > > For example, I think it has a number of built-in assumptions about a > person's knowledge of Hadoop or Mesos. I have been using and programming > computers for almost two decades so I don't think I am a total idiot when > it comes to these things, however, I am left with staring at the console > wondering what the hell is going on. > > For example, the thing supports using S3 to get files but when you > actually try to read a large file, it just sits there and sits there and > eventually comes back with an error that really does not tell me anything > (so the task was killed - why? there is nothing in the logs). So, do I > actually need an HDFS setup over S3 so it can support block access? Who > knows, I can't find anything. > > Even basic questions I have to ask on this list - does Spark support > parallel reads from files in a shared filesystem? Someone answered - yes. > Does this extend to S3? Who knows? Nowhere to be found. Does it extend to > S3 only if used through HDFS? Who knows. > > Does Spark need a running Hadoop cluster to realize its full potential? > Who knows, it is not stated explicitly anywhere but any time I google stuff > people mention Hadoop. > > Can Spark do EVERYTHING in standalone mode? The documentation is not > explicit but it leads you to believe it can (or maybe I am overly > optimistic?). > > So what does one do when they have a problem? How do they instrument stuff? > > I do not want to just rant - I am willing to put work into writing proper > documentation for something that is advertised to work but in practice ends > up costing you weeks of hunting for crap left and right and feeling lost. I > am going through this process and would be happy to document a whole story > of setting up a data analysis pipeline from aggregating data via https > exposed over an ELB to sending it to a spark cluster via zeromq collectors > to actual Spark cluster setup to.... - is anyone willing to help answer my > questions so we can all benefit from this "hair-greying" experience? ;) > > Thanks! > Ognen > > >
