Re: join operation is taking too much time

2014-06-18 Thread MEETHU MATHEW
Hi,
Thanks Andrew and Daniel for the response.

Setting spark.shuffle.spill to false didnt make any difference. 5 days  
completed in 6 min and 10 days was stuck after around 1hr.


Daniel,in my current use case I cant read all the files to a single RDD.But I 
have another use case where I did it in that way,ie  I read all the files to a 
single RDD and joined with with the RDD of 9 million rows and it worked fine  
and took only 3 minutes.
 
Thanks  Regards, 
Meethu M


On Wednesday, 18 June 2014 12:11 AM, Daniel Darabos 
daniel.dara...@lynxanalytics.com wrote:
 


I've been wondering about this. Is there a difference in performance between 
these two?

valrdd1 =sc.textFile(files.mkString(,))valrdd2 
=sc.union(files.map(sc.textFile(_)))

I don't know about your use-case, Meethu, but it may be worth trying to see if 
reading all the files into one RDD (like rdd1) would perform better in the 
join. (If this is possible in your situation.)




On Tue, Jun 17, 2014 at 6:45 PM, Andrew Or and...@databricks.com wrote:

How long does it get stuck for? This is a common sign for the OS thrashing due 
to out of memory exceptions. If you keep it running longer, does it throw an 
error?


Depending on how large your other RDD is (and your join operation), memory 
pressure may or may not be the problem at all. It could be that spilling your 
shuffles
to disk is slowing you down (but probably shouldn't hang your application). 
For the 5 RDDs case, what happens if you set spark.shuffle.spill to false?



2014-06-17 5:59 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in:




 Hi all,


I want  to do a recursive leftOuterJoin between an RDD (created from  file) 
with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 
30 diff files in each iteration of a loop) varying from 1 to 6 million rows.
When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I 
increase it to 10 or 30 RDDs its gradually slowing down and finally getting 
stuck without showing any warning or error.


I am running in standalone mode with 2 workers of 4GB each and a total of 16 
cores .


Any of you facing similar problems with JOIN  or is it a problem with my 
configuration.


Thanks  Regards, 
Meethu M


join operation is taking too much time

2014-06-17 Thread MEETHU MATHEW


 Hi all,

I want  to do a recursive leftOuterJoin between an RDD (created from  file) 
with 9 million rows(size of the file is 100MB) and 30 other RDDs(created from 
30 diff files in each iteration of a loop) varying from 1 to 6 million rows.
When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I 
increase it to 10 or 30 RDDs its gradually slowing down and finally getting 
stuck without showing any warning or error.

I am running in standalone mode with 2 workers of 4GB each and a total of 16 
cores .

Any of you facing similar problems with JOIN  or is it a problem with my 
configuration.

Thanks  Regards, 
Meethu M

Re: join operation is taking too much time

2014-06-17 Thread Andrew Or
How long does it get stuck for? This is a common sign for the OS thrashing
due to out of memory exceptions. If you keep it running longer, does it
throw an error?

Depending on how large your other RDD is (and your join operation), memory
pressure may or may not be the problem at all. It could be that spilling
your shuffles
to disk is slowing you down (but probably shouldn't hang your application).
For the 5 RDDs case, what happens if you set spark.shuffle.spill to false?


2014-06-17 5:59 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in:


  Hi all,

 I want  to do a recursive leftOuterJoin between an RDD (created from
  file) with 9 million rows(size of the file is 100MB) and 30 other
 RDDs(created from 30 diff files in each iteration of a loop) varying from 1
 to 6 million rows.
 When I run it for 5 RDDs,its running successfully  in 5 minutes.But when I
 increase it to 10 or 30 RDDs its gradually slowing down and finally getting
 stuck without showing any warning or error.

 I am running in standalone mode with 2 workers of 4GB each and a total of
 16 cores .

 Any of you facing similar problems with JOIN  or is it a problem with my
 configuration.

 Thanks  Regards,
 Meethu M



Re: join operation is taking too much time

2014-06-17 Thread Daniel Darabos
I've been wondering about this. Is there a difference in performance
between these two?

val rdd1 = sc.textFile(files.mkString(,)) val rdd2 = sc.union(files.map(sc
.textFile(_)))

I don't know about your use-case, Meethu, but it may be worth trying to see
if reading all the files into one RDD (like rdd1) would perform better in
the join. (If this is possible in your situation.)



On Tue, Jun 17, 2014 at 6:45 PM, Andrew Or and...@databricks.com wrote:

 How long does it get stuck for? This is a common sign for the OS thrashing
 due to out of memory exceptions. If you keep it running longer, does it
 throw an error?

 Depending on how large your other RDD is (and your join operation), memory
 pressure may or may not be the problem at all. It could be that spilling
 your shuffles
 to disk is slowing you down (but probably shouldn't hang your
 application). For the 5 RDDs case, what happens if you set
 spark.shuffle.spill to false?


 2014-06-17 5:59 GMT-07:00 MEETHU MATHEW meethu2...@yahoo.co.in:


  Hi all,

 I want  to do a recursive leftOuterJoin between an RDD (created from
  file) with 9 million rows(size of the file is 100MB) and 30 other
 RDDs(created from 30 diff files in each iteration of a loop) varying from 1
 to 6 million rows.
 When I run it for 5 RDDs,its running successfully  in 5 minutes.But when
 I increase it to 10 or 30 RDDs its gradually slowing down and finally
 getting stuck without showing any warning or error.

 I am running in standalone mode with 2 workers of 4GB each and a total
 of 16 cores .

 Any of you facing similar problems with JOIN  or is it a problem with my
 configuration.

 Thanks  Regards,
 Meethu M