The most efficient to determine the number of columns would be to do a
take(1) and split in the driver.

Regards
Sab
On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com> wrote:

> Hi,
>
> what is the most efficient way to split columns and know how many columns
> are created.
>
> Here is the current RDD
> -----------------
> ID   STATE
> -----------------
> 1       TX, NY, FL
> 2       CA, OH
> -----------------
>
> This is the preferred output:
> -------------------------
> ID    STATE_1     STATE_2      STATE_3
> -------------------------
> 1     TX              NY              FL
> 2     CA              OH
> -------------------------
>
> With a separated with the new columns STATE_1, STATE_2, STATE_3
>
>
> It looks like the following output is feasible using a ReduceBy operator
> -------------------------
> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
> -------------------------
> 1     TX                NY               FL            STATE_1, STATE_2,
> STATE_3
> 2     CA                OH                             STATE_1, STATE_2
> -------------------------
>
> Then in the reduce step, the distinct new columns can be calculated.
> Is it possible to get the second output where next to the RDD the
> new_columns are saved somewhere?
> Or is the required to use the second approach?
>
> thanks in advance,
> Richard
>
>

Reply via email to