Re: number limit of map for spark
Thanks a lot for Zhan's comment, it really offered much help. On Tuesday, December 22, 2015 5:11 AM, Zhan Zhang wrote: What I mean is to combine multiple map functions into one. Don’t know how exactly your algorithms works. Did your one iteration result depend on last iteration? If so, how do they depend on?I think either you can optimize your implementation, or Spark is not the right one for your specific application. Thanks. Zhan Zhang On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu wrote: What is difference between repartition / collect and collapse ...Is collapse the same costly as collect or repartition ? Thanks in advance ~ On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang wrote: In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collectall data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time.Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0;100 }101 }); 104 }105106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST107 t = jsc.parallelize(lt, 1).cache();108109 }110.. Thanks very much in advance!Zhiliang
Re: number limit of map for spark
What I mean is to combine multiple map functions into one. Don’t know how exactly your algorithms works. Did your one iteration result depend on last iteration? If so, how do they depend on? I think either you can optimize your implementation, or Spark is not the right one for your specific application. Thanks. Zhan Zhang On Dec 21, 2015, at 10:43 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: What is difference between repartition / collect and collapse ... Is collapse the same costly as collect or repartition ? Thanks in advance ~ On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang mailto:zzh...@hortonworks.com>> wrote: In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time. Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0; 100 } 101 }); 104 } 105 106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST 107 t = jsc.parallelize(lt, 1).cache(); 108 109 } 110 .. Thanks very much in advance! Zhiliang
Re: number limit of map for spark
What is difference between repartition / collect and collapse ...Is collapse the same costly as collect or repartition ? Thanks in advance ~ On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang wrote: In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collectall data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time.Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0;100 }101 }); 104 }105106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST107 t = jsc.parallelize(lt, 1).cache();108109 }110.. Thanks very much in advance!Zhiliang
Re: number limit of map for spark
Dear Zhan, Thanks very much for your kind reply! You may just refer to my another letter with title : [Beg for help] spark job with very low efficiency I just need to apply spark to mathematica optimization by genetic algorithm , and theoretically the algorithm would iterate for lots of times.Then I lost into the problem:1) spark job will only have limited number of map successive calling before it meets one action ;2) action Function as collect / reduce will increase run time VERY MUCH ;3) if for parallelism, I understand rdd only with one partition will lose all the parallelism provided by spark , is it ... if it is with many partitions then it is difficult to randomly combine all its rows to generate another rdd. Thank you,Zhiliang On Tuesday, December 22, 2015 2:24 AM, Zhan Zhang wrote: In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collectall data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time.Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0;100 }101 }); 104 }105106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST107 t = jsc.parallelize(lt, 1).cache();108109 }110.. Thanks very much in advance!Zhiliang
Re: number limit of map for spark
In what situation, you have such cases? If there is no shuffle, you can collapse all these functions into one, right? In the meantime, it is not recommended to collect all data to driver. Thanks. Zhan Zhang On Dec 21, 2015, at 3:44 AM, Zhiliang Zhu mailto:zchl.j...@yahoo.com.INVALID>> wrote: Dear All, I need to iterator some job / rdd quite a lot of times, but just lost in the problem of spark only accept to call around 350 number of map before it meets one action Function , besides, dozens of action will obviously increase the run time. Is there any proper way ... As tested, there is piece of codes as follows: .. 83 int count = 0; 84 JavaRDD dataSet = jsc.parallelize(list, 1).cache(); //with only 1 partition 85 int m = 350; 86 JavaRDD r = dataSet.cache(); 87 JavaRDD t = null; 88 89 for(int j=0; j < m; ++j) { //outer loop to temporarily convert the rdd r to t 90 if(null != t) { 91 r = t; 92 } //inner loop to call map 350 times , if m is much more than 350 (for instance, around 400), then the job will throw exception message "15/12/21 19:36:17 ERROR yarn.ApplicationMaster: User class threw exception: java.lang.StackOverflowError java.lang.StackOverflowError") 93 for(int i=0; i < m; ++i) { 94 r = r.map(new Function() { 95 @Override 96 public Integer call(Integer integer) { 97 double x = Math.random() * 2 - 1; 98 double y = Math.random() * 2 - 1; 99 return (x * x + y * y < 1) ? 1 : 0; 100 } 101 }); 104 } 105 106 List lt = r.collect(); //then collect this rdd to get another rdd, however, dozens of action Function as collect is VERY MUCH COST 107 t = jsc.parallelize(lt, 1).cache(); 108 109 } 110 .. Thanks very much in advance! Zhiliang