subject:"Re\: number limit of map for spark"

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu

Thanks a lot for Zhan's comment, it really offered much help.

 

On Tuesday, December 22, 2015 5:11 AM, Zhan Zhang  
wrote:
 

 What I mean is to combine multiple map functions into one. Don’t know how 
exactly your algorithms works. Did your one iteration result depend on last 
iteration? If so, how do they depend on?I think either you can optimize your 
implementation, or Spark is not the right one for your specific application.
Thanks.
Zhan Zhang 
On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu  wrote:

What is difference between repartition  / collect and   collapse ...Is collapse 
the same costly as collect or repartition ?
Thanks in advance ~ 

On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang  
wrote:


In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collectall data to driver.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu  wrote:

Dear All,
I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of spark only accept to call around 350 number of map before it meets 
one action Function , besides, dozens of action will obviously increase the run 
time.Is there any proper way ...
As tested, there is piece of codes as follows:
..
 83     int count = 0; 84     JavaRDD dataSet = jsc.parallelize(list, 
1).cache(); //with only 1 partition  85     int m = 350; 86     
JavaRDD r = dataSet.cache(); 87     JavaRDD t = null; 88 89   
  for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 
 90       if(null != t) { 91         r = t; 92       }            //inner loop 
to call map 350 times , if m is much more than 350 (for instance, around 400), 
then the job will throw exception message               "15/12/21 19:36:17 
ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.StackOverflowError java.lang.StackOverflowError") 93       for(int 
i=0; i < m; ++i) {  94        r = r.map(new Function() { 95   
        @Override 96           public Integer call(Integer integer) { 97        
     double x = Math.random() * 2 - 1; 98             double y = Math.random() 
* 2 - 1; 99             return (x * x + y * y < 1) ? 1 : 0;100           }101   
      });
104       }105106       List lt = r.collect(); //then collect this rdd 
to get another rdd, however, dozens of action Function as collect is VERY MUCH 
COST107       t = jsc.parallelize(lt, 1).cache();108109     }110..
Thanks very much in advance!Zhiliang

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang

What I mean is to combine multiple map functions into one. Don’t know how 
exactly your algorithms works. Did your one iteration result depend on last 
iteration? If so, how do they depend on?
I think either you can optimize your implementation, or Spark is not the right 
one for your specific application.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

What is difference between repartition  / collect and   collapse ...
Is collapse the same costly as collect or repartition ?

Thanks in advance ~


On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang 
mailto:zzh...@hortonworks.com>> wrote:


In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collect
all data to driver.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

Dear All,

I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of
spark only accept to call around 350 number of map before it meets one action 
Function ,
besides, dozens of action will obviously increase the run time.
Is there any proper way ...

As tested, there is piece of codes as follows:

..
 83 int count = 0;
 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with 
only 1 partition
 85 int m = 350;
 86 JavaRDD r = dataSet.cache();
 87 JavaRDD t = null;
 88
 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd 
r to t
 90   if(null != t) {
 91 r = t;
 92   }
//inner loop to call map 350 times , if m is much more than 350 
(for instance, around 400), then the job will throw exception message
  "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw 
exception: java.lang.StackOverflowError java.lang.StackOverflowError")
 93   for(int i=0; i < m; ++i) {
 94 r = r.map(new Function() {
 95   @Override
 96   public Integer call(Integer integer) {
 97 double x = Math.random() * 2 - 1;
 98 double y = Math.random() * 2 - 1;
 99 return (x * x + y * y < 1) ? 1 : 0;
100   }
101 });

104   }
105
106   List lt = r.collect(); //then collect this rdd to get 
another rdd, however, dozens of action Function as collect is VERY MUCH COST
107   t = jsc.parallelize(lt, 1).cache();
108
109 }
110
..

Thanks very much in advance!
Zhiliang

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu

What is difference between repartition  / collect and   collapse ...Is collapse 
the same costly as collect or repartition ?
Thanks in advance ~  

On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang  
wrote:
 

 In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collectall data to driver.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu  wrote:

Dear All,
I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of spark only accept to call around 350 number of map before it meets 
one action Function , besides, dozens of action will obviously increase the run 
time.Is there any proper way ...
As tested, there is piece of codes as follows:
..
 83     int count = 0; 84     JavaRDD dataSet = jsc.parallelize(list, 
1).cache(); //with only 1 partition  85     int m = 350; 86     
JavaRDD r = dataSet.cache(); 87     JavaRDD t = null; 88 89   
  for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 
 90       if(null != t) { 91         r = t; 92       }            //inner loop 
to call map 350 times , if m is much more than 350 (for instance, around 400), 
then the job will throw exception message               "15/12/21 19:36:17 
ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.StackOverflowError java.lang.StackOverflowError") 93       for(int 
i=0; i < m; ++i) {  94       r = r.map(new Function() { 95    
       @Override 96           public Integer call(Integer integer) { 97         
    double x = Math.random() * 2 - 1; 98             double y = Math.random() * 
2 - 1; 99             return (x * x + y * y < 1) ? 1 : 0;100           }101     
    });
104       }105106       List lt = r.collect(); //then collect this rdd 
to get another rdd, however, dozens of action Function as collect is VERY MUCH 
COST107       t = jsc.parallelize(lt, 1).cache();108109     }110..
Thanks very much in advance!Zhiliang

Re: number limit of map for spark

2015-12-21 Thread Zhiliang Zhu

Dear Zhan,
Thanks very much for your kind reply!
You may just refer to my another letter with title :
[Beg for help] spark job with very low efficiency


I just need to apply spark to mathematica optimization by genetic algorithm , 
and  theoretically the algorithm would iterate for lots of times.Then I lost 
into the problem:1) spark job will only have limited number of map  successive  
 calling before it meets one action ;2) action Function as collect / reduce 
will increase run time VERY MUCH ;3) if for parallelism, I understand rdd only 
with one partition  will lose all the parallelism provided by spark ,   is it 
...    if it is with many partitions then it is difficult to randomly combine 
all its rows to generate another rdd.
Thank you,Zhiliang
 

On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang  
wrote:
 

 In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collectall data to driver.
Thanks.
Zhan Zhang
On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu  wrote:

Dear All,
I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of spark only accept to call around 350 number of map before it meets 
one action Function , besides, dozens of action will obviously increase the run 
time.Is there any proper way ...
As tested, there is piece of codes as follows:
..
 83     int count = 0; 84     JavaRDD dataSet = jsc.parallelize(list, 
1).cache(); //with only 1 partition  85     int m = 350; 86     
JavaRDD r = dataSet.cache(); 87     JavaRDD t = null; 88 89   
  for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 
 90       if(null != t) { 91         r = t; 92       }            //inner loop 
to call map 350 times , if m is much more than 350 (for instance, around 400), 
then the job will throw exception message               "15/12/21 19:36:17 
ERROR yarn.ApplicationMaster: User class threw exception: 
java.lang.StackOverflowError java.lang.StackOverflowError") 93       for(int 
i=0; i < m; ++i) {  94       r = r.map(new Function() { 95    
       @Override 96           public Integer call(Integer integer) { 97         
    double x = Math.random() * 2 - 1; 98             double y = Math.random() * 
2 - 1; 99             return (x * x + y * y < 1) ? 1 : 0;100           }101     
    });
104       }105106       List lt = r.collect(); //then collect this rdd 
to get another rdd, however, dozens of action Function as collect is VERY MUCH 
COST107       t = jsc.parallelize(lt, 1).cache();108109     }110..
Thanks very much in advance!Zhiliang

Re: number limit of map for spark

2015-12-21 Thread Zhan Zhang

In what situation, you have such cases? If there is no shuffle, you can 
collapse all these functions into one, right? In the meantime, it is not 
recommended to collect
all data to driver.

Thanks.

Zhan Zhang

On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu 
mailto:zchl.j...@yahoo.com.INVALID>> wrote:

Dear All,

I need to iterator some job / rdd quite a lot of times, but just lost in the 
problem of
spark only accept to call around 350 number of map before it meets one action 
Function ,
besides, dozens of action will obviously increase the run time.
Is there any proper way ...

As tested, there is piece of codes as follows:

..
 83 int count = 0;
 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with 
only 1 partition
 85 int m = 350;
 86 JavaRDD r = dataSet.cache();
 87 JavaRDD t = null;
 88
 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd 
r to t
 90   if(null != t) {
 91 r = t;
 92   }
//inner loop to call map 350 times , if m is much more than 350 
(for instance, around 400), then the job will throw exception message
  "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw 
exception: java.lang.StackOverflowError java.lang.StackOverflowError")
 93   for(int i=0; i < m; ++i) {
 94 r = r.map(new Function() {
 95   @Override
 96   public Integer call(Integer integer) {
 97 double x = Math.random() * 2 - 1;
 98 double y = Math.random() * 2 - 1;
 99 return (x * x + y * y < 1) ? 1 : 0;
100   }
101 });

104   }
105
106   List lt = r.collect(); //then collect this rdd to get 
another rdd, however, dozens of action Function as collect is VERY MUCH COST
107   t = jsc.parallelize(lt, 1).cache();
108
109 }
110
..

Thanks very much in advance!
Zhiliang

Re: number limit of map for spark

Re: number limit of map for spark

Re: number limit of map for spark

Re: number limit of map for spark

Re: number limit of map for spark

5 matches

Site Navigation

Mail list logo

Footer information