Re: spark rdd grouping

2015-12-25 Thread Shushant Arora
Hi I have created a jira for this feature https://issues.apache.org/jira/browse/SPARK-12524 Please vote this feature if its necessary. I would like to implement this feature. Thanks Shushant On Wed, Dec 2, 2015 at 1:14 PM, Rajat Kumar wrote: > What if I don't have to use aggregate function onl

Re: spark rdd grouping

2015-12-01 Thread Rajat Kumar
What if I don't have to use aggregate function only groupbykeylocally() and then a map transformation? Will reduceByKeyLocally help here? Or is there any workaround if groupbykey is not locally and is global across all partitions. Thanks On Tue, Dec 1, 2015 at 5:20 PM, ayan guha wrote: > I bel

Re: spark rdd grouping

2015-12-01 Thread ayan guha
I believe reduceByKeyLocally was introduced for this purpose. On Tue, Dec 1, 2015 at 10:21 PM, Jacek Laskowski wrote: > Hi Rajat, > > My quick test has showed that groupBy will preserve the partitions: > > scala> > sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex > { case

Re: spark rdd grouping

2015-12-01 Thread Jacek Laskowski
Hi Rajat, My quick test has showed that groupBy will preserve the partitions: scala> sc.parallelize(Seq(0,0,0,0,1,1,1,1),2).map((_,1)).mapPartitionsWithIndex { case (idx, iter) => val s = iter.toSeq; println(idx + " with " + s.size + " elements: " + s); s.toIterator }.groupBy(_._1).mapPartitionsW

spark rdd grouping

2015-11-30 Thread Rajat Kumar
Hi i have a javaPairRdd rdd1. i want to group by rdd1 by keys but preserve the partitions of original rdd only to avoid shuffle since I know all same keys are already in same partition. PairRdd is basically constrcuted using kafka streaming low level consumer which have all records with same key

Re: RDD Grouping

2014-08-19 Thread TJ Klein
Thanks a lot. Yes, this mapPartitions seems a better way of dealing with this problem as for groupBy() I need to collect() data before applying parallelize(), which is expensive. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407p12424

Re: RDD Grouping

2014-08-19 Thread Sean Owen
something like that but the return-type > is PipelinedRDD, which is not iterable. > Anybody an idea? > Thanks in advance, > Tassilo > > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabb

RDD Grouping

2014-08-19 Thread TJ Klein
ntext: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-Grouping-tp12407.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For addit