subject:"Re\: processing large dataset"

Re: processing large dataset

2015-01-23 Thread Sean Owen

This is kinda a how-long-is-a-piece-of-string question. There is no
one tuning for 'terabytes of data'. You can easily run a Spark job
that processes hundreds of terabytes with no problem with defaults --
something trivial like counting. You can create Spark jobs that will
never complete -- trying to pull the entire data set into a worker.

You haven't said what you're doing exactly, although it sounds simple,
and haven't said what the problem is? is it out of memory? that would
be essential to know to say what if anything you need to change in
your program or cluster.

On Fri, Jan 23, 2015 at 4:52 AM, Kane Kim  wrote:
> I'm trying to process 5TB of data, not doing anything fancy, just
> map/filter and reduceByKey. Spent whole day today trying to get it
> processed, but never succeeded. I've tried to deploy to ec2 with the
> script provided with spark on pretty beefy machines (100 r3.2xlarge
> nodes). Really frustrated that spark doesn't work out of the box for
> anything bigger than word count sample. One big problem is that
> defaults are not suitable for processing big datasets, provided ec2
> script could do a better job, knowing instance type requested. Second
> it takes hours to figure out what is wrong, when spark job fails
> almost finished processing. Even after raising all limits as per
> https://spark.apache.org/docs/latest/tuning.html it still fails (now
> with: error communicating with MapOutputTracker).
>
> After all I have only one question - how to get spark tuned up for
> processing terabytes of data and is there a way to make this
> configuration easier and more transparent?
>
> Thanks.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: processing large dataset

2015-01-22 Thread Russell Jurney

Often when this happens to me, it is actually an exception parsing a few
messages. Easy to miss this, as error messages aren't always informative. I
would be blaming spark, but in reality it was missing fields in a CSV file.

As has been said, make a file with a few records and see if your job works.

On Thursday, January 22, 2015, Jörn Franke  wrote:

> Did you try it with a smaller subset of the data first?
> Le 23 janv. 2015 05:54, "Kane Kim"  > a écrit :
>
>> I'm trying to process 5TB of data, not doing anything fancy, just
>> map/filter and reduceByKey. Spent whole day today trying to get it
>> processed, but never succeeded. I've tried to deploy to ec2 with the
>> script provided with spark on pretty beefy machines (100 r3.2xlarge
>> nodes). Really frustrated that spark doesn't work out of the box for
>> anything bigger than word count sample. One big problem is that
>> defaults are not suitable for processing big datasets, provided ec2
>> script could do a better job, knowing instance type requested. Second
>> it takes hours to figure out what is wrong, when spark job fails
>> almost finished processing. Even after raising all limits as per
>> https://spark.apache.org/docs/latest/tuning.html it still fails (now
>> with: error communicating with MapOutputTracker).
>>
>> After all I have only one question - how to get spark tuned up for
>> processing terabytes of data and is there a way to make this
>> configuration easier and more transparent?
>>
>> Thanks.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>

-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: processing large dataset

2015-01-22 Thread Jörn Franke

Did you try it with a smaller subset of the data first?
Le 23 janv. 2015 05:54, "Kane Kim"  a écrit :

> I'm trying to process 5TB of data, not doing anything fancy, just
> map/filter and reduceByKey. Spent whole day today trying to get it
> processed, but never succeeded. I've tried to deploy to ec2 with the
> script provided with spark on pretty beefy machines (100 r3.2xlarge
> nodes). Really frustrated that spark doesn't work out of the box for
> anything bigger than word count sample. One big problem is that
> defaults are not suitable for processing big datasets, provided ec2
> script could do a better job, knowing instance type requested. Second
> it takes hours to figure out what is wrong, when spark job fails
> almost finished processing. Even after raising all limits as per
> https://spark.apache.org/docs/latest/tuning.html it still fails (now
> with: error communicating with MapOutputTracker).
>
> After all I have only one question - how to get spark tuned up for
> processing terabytes of data and is there a way to make this
> configuration easier and more transparent?
>
> Thanks.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: processing large dataset

Re: processing large dataset

Re: processing large dataset

3 matches

Site Navigation

Mail list logo

Footer information