Re: processing large dataset

Jörn Franke Thu, 22 Jan 2015 22:35:19 -0800

Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, "Kane Kim" <kane.ist...@gmail.com> a écrit :


> I'm trying to process 5TB of data, not doing anything fancy, just
> map/filter and reduceByKey. Spent whole day today trying to get it
> processed, but never succeeded. I've tried to deploy to ec2 with the
> script provided with spark on pretty beefy machines (100 r3.2xlarge
> nodes). Really frustrated that spark doesn't work out of the box for
> anything bigger than word count sample. One big problem is that
> defaults are not suitable for processing big datasets, provided ec2
> script could do a better job, knowing instance type requested. Second
> it takes hours to figure out what is wrong, when spark job fails
> almost finished processing. Even after raising all limits as per
> https://spark.apache.org/docs/latest/tuning.html it still fails (now
> with: error communicating with MapOutputTracker).
>
> After all I have only one question - how to get spark tuned up for
> processing terabytes of data and is there a way to make this
> configuration easier and more transparent?
>
> Thanks.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: processing large dataset

Reply via email to