Re: parition by multiple columns/keys

Imran Rajjad Mon, 23 Oct 2017 03:39:11 -0700

strangely this is working only for very small dataset of rows.. for very
large datasets apparently the partitioning is not working. is there a limit
to the number of columns or rows when repartitioning according to multiple
columns?


regards,
Imran

On Wed, Oct 18, 2017 at 11:00 AM, Imran Rajjad <raj...@gmail.com> wrote:

> yes..I think I figured out something like below
>
> Serialized Java Class
> -----------------
> public class MyMapPartition implements Serializable,MapPartitionsFunction{
>  @Override
>  public Iterator call(Iterator iter) throws Exception {
>   ArrayList<Row> list = new ArrayList<Row>();
>   // ArrayNode array = mapper.createArrayNode();
>   Row row=null;
>   System.out.println("--------");
>   while(iter.hasNext()){
>
>    row=(Row) iter.next();
>    System.out.println(row);
>    list.add(row);
>   }
>   System.out.println(">>>>");
>   return list.iterator();
>  }
> }
>
> Unit Test
> -----------
> JavaRDD<Row> rdd = jsc.parallelize(Arrays.asList(
> RowFactory.create(11L,21L,1L)
>               ,RowFactory.create(11L,22L,2L)
>               ,RowFactory.create(11L,22L,1L)
>               ,RowFactory.create(12L,23L,3L)
>               ,RowFactory.create(12L,24L,3L)
>               ,RowFactory.create(12L,22L,4L)
>               ,RowFactory.create(13L,22L,4L)
>               ,RowFactory.create(14L,22L,4L)
>     ));
>   StructType structType = new StructType();
>   structType = structType.add("a", DataTypes.LongType, false)
>         .add("b", DataTypes.LongType, false)
>         .add("c", DataTypes.LongType, false);
>   ExpressionEncoder<Row> encoder = RowEncoder.apply(structType);
>
>
>   Dataset<Row> ds = spark.createDataFrame(rdd, encoder.schema());
>   ds.show();
>
>   MyMapPartition mp = new MyMapPartition ();
> //Iterator<Row>
>   //.repartition(new Column("a"),new Column("b"))
>    Dataset<Row> grouped = ds.groupBy("a", "b","c")
>     .count()
>     .repartition(new Column("a"),new Column("b"))
>     .mapPartitions(mp,encoder);
>
>   grouped.count();
>
> ---------------
>
> output
> --------
> --------
> [12,23,3,1]
> >>>>
> --------
> [14,22,4,1]
> >>>>
> --------
> [12,24,3,1]
> >>>>
> --------
> [12,22,4,1]
> >>>>
> --------
> [11,22,1,1]
> [11,22,2,1]
> >>>>
> --------
> [11,21,1,1]
> >>>>
> --------
> [13,22,4,1]
> >>>>
>
>
> On Wed, Oct 18, 2017 at 10:29 AM, ayan guha <guha.a...@gmail.com> wrote:
>
>> How or what you want to achieve? Ie are planning to do some aggregation
>> on group by c1,c2?
>>
>> On Wed, 18 Oct 2017 at 4:13 pm, Imran Rajjad <raj...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I have a set of rows that are a result of a
>>> groupBy(col1,col2,col3).count().
>>>
>>> Is it possible to map rows belong to unique combination inside an
>>> iterator?
>>>
>>> e.g
>>>
>>> col1 col2 col3
>>> a      1      a1
>>> a      1      a2
>>> b      2      b1
>>> b      2      b2
>>>
>>> how can I separate rows with col1 and col2 = (a,1) and (b,2)?
>>>
>>> regards,
>>> Imran
>>>
>>> --
>>> I.R
>>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> I.R
>



-- 
I.R

Re: parition by multiple columns/keys

Reply via email to