Re: processing large dataset
This is kinda a how-long-is-a-piece-of-string question. There is no one tuning for 'terabytes of data'. You can easily run a Spark job that processes hundreds of terabytes with no problem with defaults -- something trivial like counting. You can create Spark jobs that will never complete -- trying to pull the entire data set into a worker. You haven't said what you're doing exactly, although it sounds simple, and haven't said what the problem is? is it out of memory? that would be essential to know to say what if anything you need to change in your program or cluster. On Fri, Jan 23, 2015 at 4:52 AM, Kane Kim wrote: > I'm trying to process 5TB of data, not doing anything fancy, just > map/filter and reduceByKey. Spent whole day today trying to get it > processed, but never succeeded. I've tried to deploy to ec2 with the > script provided with spark on pretty beefy machines (100 r3.2xlarge > nodes). Really frustrated that spark doesn't work out of the box for > anything bigger than word count sample. One big problem is that > defaults are not suitable for processing big datasets, provided ec2 > script could do a better job, knowing instance type requested. Second > it takes hours to figure out what is wrong, when spark job fails > almost finished processing. Even after raising all limits as per > https://spark.apache.org/docs/latest/tuning.html it still fails (now > with: error communicating with MapOutputTracker). > > After all I have only one question - how to get spark tuned up for > processing terabytes of data and is there a way to make this > configuration easier and more transparent? > > Thanks. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: processing large dataset
Often when this happens to me, it is actually an exception parsing a few messages. Easy to miss this, as error messages aren't always informative. I would be blaming spark, but in reality it was missing fields in a CSV file. As has been said, make a file with a few records and see if your job works. On Thursday, January 22, 2015, Jörn Franke wrote: > Did you try it with a smaller subset of the data first? > Le 23 janv. 2015 05:54, "Kane Kim" > a écrit : > >> I'm trying to process 5TB of data, not doing anything fancy, just >> map/filter and reduceByKey. Spent whole day today trying to get it >> processed, but never succeeded. I've tried to deploy to ec2 with the >> script provided with spark on pretty beefy machines (100 r3.2xlarge >> nodes). Really frustrated that spark doesn't work out of the box for >> anything bigger than word count sample. One big problem is that >> defaults are not suitable for processing big datasets, provided ec2 >> script could do a better job, knowing instance type requested. Second >> it takes hours to figure out what is wrong, when spark job fails >> almost finished processing. Even after raising all limits as per >> https://spark.apache.org/docs/latest/tuning.html it still fails (now >> with: error communicating with MapOutputTracker). >> >> After all I have only one question - how to get spark tuned up for >> processing terabytes of data and is there a way to make this >> configuration easier and more transparent? >> >> Thanks. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: processing large dataset
Did you try it with a smaller subset of the data first? Le 23 janv. 2015 05:54, "Kane Kim" a écrit : > I'm trying to process 5TB of data, not doing anything fancy, just > map/filter and reduceByKey. Spent whole day today trying to get it > processed, but never succeeded. I've tried to deploy to ec2 with the > script provided with spark on pretty beefy machines (100 r3.2xlarge > nodes). Really frustrated that spark doesn't work out of the box for > anything bigger than word count sample. One big problem is that > defaults are not suitable for processing big datasets, provided ec2 > script could do a better job, knowing instance type requested. Second > it takes hours to figure out what is wrong, when spark job fails > almost finished processing. Even after raising all limits as per > https://spark.apache.org/docs/latest/tuning.html it still fails (now > with: error communicating with MapOutputTracker). > > After all I have only one question - how to get spark tuned up for > processing terabytes of data and is there a way to make this > configuration easier and more transparent? > > Thanks. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >