Re: the way to compare any two adjacent elements in one rdd
On Monday, December 7, 2015 10:37 AM, DB Tsai wrote: Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Would you help write a few pseudo-code about it...It seems that there is not shuffle related API , or repartition ? Thanks a lot in advance! Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu wrote: > > > > > On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: > > > This is tricky. You need to shuffle the ending and beginning elements > using mapPartitionWithIndex. > > > Does this mean that I need to shuffle the all elements in different > partitions into one partition, then compare them by way of any two adjacent > elements? > It seems good, if it is like that. > > One more issue, will it loss parallelism since there become only one > partition ... > > Thanks very much in advance! > > > > > > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > > On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: >> Hi All, >> >> I would like to compare any two adjacent elements in one given rdd, just >> as >> the single machine code part: >> >> int a[N] = {...}; >> for (int i=0; i < N - 1; ++i) { >> compareFun(a[i], a[i+1]); >> } >> ... >> >> mapPartitions may work for some situations, however, it could not compare >> elements in different partitions. >> foreach also seems not work. >> >> Thanks, >> Zhiliang > >> >> > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: the way to compare any two adjacent elements in one rdd
Only beginning and ending part of data. The rest in the partition can be compared without shuffle. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Sun, Dec 6, 2015 at 6:27 PM, Zhiliang Zhu wrote: > > > > > On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: > > > This is tricky. You need to shuffle the ending and beginning elements > using mapPartitionWithIndex. > > > Does this mean that I need to shuffle the all elements in different > partitions into one partition, then compare them by way of any two adjacent > elements? > It seems good, if it is like that. > > One more issue, will it loss parallelism since there become only one > partition ... > > Thanks very much in advance! > > > > > > > Sincerely, > > DB Tsai > -- > Web: https://www.dbtsai.com > PGP Key ID: 0xAF08DF8D > > > On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: >> Hi All, >> >> I would like to compare any two adjacent elements in one given rdd, just >> as >> the single machine code part: >> >> int a[N] = {...}; >> for (int i=0; i < N - 1; ++i) { >>compareFun(a[i], a[i+1]); >> } >> ... >> >> mapPartitions may work for some situations, however, it could not compare >> elements in different partitions. >> foreach also seems not work. >> >> Thanks, >> Zhiliang > >> >> > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: the way to compare any two adjacent elements in one rdd
On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Does this mean that I need to shuffle the all elements in different partitions into one partition, then compare them by way of any two adjacent elements?It seems good, if it is like that. One more issue, will it loss parallelism since there become only one partition ... Thanks very much in advance! Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: > Hi All, > > I would like to compare any two adjacent elements in one given rdd, just as > the single machine code part: > > int a[N] = {...}; > for (int i=0; i < N - 1; ++i) { > compareFun(a[i], a[i+1]); > } > ... > > mapPartitions may work for some situations, however, it could not compare > elements in different partitions. > foreach also seems not work. > > Thanks, > Zhiliang > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: the way to compare any two adjacent elements in one rdd
For this, mapWithPartitionsWithIndex would also properly work for filter. Here is the code copied for stack-overflow, which is used to remove the first line of a csv file: JavaRDD rawInputRdd = sparkContext.textFile(dataFile); Function2 removeHeader= new Function2, Iterator>() { @Override public Iterator call(Integer index, Iterator iterator) throws Exception { if(index == 0 && iterator.hasNext()) { //for my usage, iterator.next(); //compare any two adjacent elements, or do filter, return iterator; //then index parameter is useless here, just is OK to view iterator as from one logical iterator/partition // is it } else return iterator; } }; JavaRDD inputRdd = rawInputRdd.mapPartitionsWithIndex(removeHeader, false);On Saturday, December 5, 2015 3:52 PM, Zhiliang Zhu wrote: Hi DB Tsai, Thanks very much for your kind reply! Sorry that for one more issue, as tested it seems that filter could only return JavaRDD but not any JavaRDD , is it ?Then it is not much convenient to do general filter for RDD, mapPartitions could work some, but if some partition will left and return none element after filter by mapPartitions, some problemwill be there. Best Wishes!Zhiliang On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: > Hi All, > > I would like to compare any two adjacent elements in one given rdd, just as > the single machine code part: > > int a[N] = {...}; > for (int i=0; i < N - 1; ++i) { > compareFun(a[i], a[i+1]); > } > ... > > mapPartitions may work for some situations, however, it could not compare > elements in different partitions. > foreach also seems not work. > > Thanks, > Zhiliang > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: the way to compare any two adjacent elements in one rdd
Hi DB Tsai, Thanks very much for your kind reply! Sorry that for one more issue, as tested it seems that filter could only return JavaRDD but not any JavaRDD , is it ?Then it is not much convenient to do general filter for RDD, mapPartitions could work some, but if some partition will left and return none element after filter by mapPartitions, some problemwill be there. Best Wishes!Zhiliang On Saturday, December 5, 2015 3:00 PM, DB Tsai wrote: This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: > Hi All, > > I would like to compare any two adjacent elements in one given rdd, just as > the single machine code part: > > int a[N] = {...}; > for (int i=0; i < N - 1; ++i) { > compareFun(a[i], a[i+1]); > } > ... > > mapPartitions may work for some situations, however, it could not compare > elements in different partitions. > foreach also seems not work. > > Thanks, > Zhiliang > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: the way to compare any two adjacent elements in one rdd
This is tricky. You need to shuffle the ending and beginning elements using mapPartitionWithIndex. Sincerely, DB Tsai -- Web: https://www.dbtsai.com PGP Key ID: 0xAF08DF8D On Fri, Dec 4, 2015 at 10:30 PM, Zhiliang Zhu wrote: > Hi All, > > I would like to compare any two adjacent elements in one given rdd, just as > the single machine code part: > > int a[N] = {...}; > for (int i=0; i < N - 1; ++i) { >compareFun(a[i], a[i+1]); > } > ... > > mapPartitions may work for some situations, however, it could not compare > elements in different partitions. > foreach also seems not work. > > Thanks, > Zhiliang > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
the way to compare any two adjacent elements in one rdd
Hi All, I would like to compare any two adjacent elements in one given rdd, just as the single machine code part: int a[N] = {...};for (int i=0; i < N - 1; ++i) { compareFun(a[i], a[i+1]);}... mapPartitions may work for some situations, however, it could not compare elements in different partitions. foreach also seems not work. Thanks,Zhiliang