Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, "Kane Kim" <kane.ist...@gmail.com> a écrit :
> I'm trying to process 5TB of data, not doing anything fancy, just > map/filter and reduceByKey. Spent whole day today trying to get it > processed, but never succeeded. I've tried to deploy to ec2 with the > script provided with spark on pretty beefy machines (100 r3.2xlarge > nodes). Really frustrated that spark doesn't work out of the box for > anything bigger than word count sample. One big problem is that > defaults are not suitable for processing big datasets, provided ec2 > script could do a better job, knowing instance type requested. Second > it takes hours to figure out what is wrong, when spark job fails > almost finished processing. Even after raising all limits as per > https://spark.apache.org/docs/latest/tuning.html it still fails (now > with: error communicating with MapOutputTracker). > > After all I have only one question - how to get spark tuned up for > processing terabytes of data and is there a way to make this > configuration easier and more transparent? > > Thanks. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >