Re: Split columns in RDD

Richard Siebeling Tue, 19 Jan 2016 12:49:07 -0800

thanks Daniel, this will certainly help,
regards, Richard

On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman <daniel.imber...@gmail.com>
wrote:


> edit 2: filter should be map
>
> val numColumns = separatedInputStrings.map{ case(id, (stateList,
> numStates)) => numStates}.reduce(math.max)
>
> On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <daniel.imber...@gmail.com>
> wrote:
>
>> edit: Mistake in the second code example
>>
>> val numColumns = separatedInputStrings.filter{ case(id, (stateList,
>> numStates)) => numStates}.reduce(math.max)
>>
>>
>> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman <
>> daniel.imber...@gmail.com> wrote:
>>
>>> Hi Richard,
>>>
>>> If I understand the question correctly it sounds like you could probably
>>> do this using mapValues (I'm assuming that you want two pieces of
>>> information out of all rows, the states as individual items, and the number
>>> of states in the row)
>>>
>>>
>>> val separatedInputStrings = input:RDD[(Int, String).mapValues{
>>>     val inputsString = "TX,NV,WY"
>>>     val stringList = inputString.split(",")
>>>     (stringList, stringList.size)
>>> }
>>>
>>> If you then wanted to find out how many state columns you should have in
>>> your table you could use a normal reduce (with a filter beforehand to
>>> reduce how much data you are shuffling)
>>>
>>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max)
>>>
>>> I hope this helps!
>>>
>>>
>>>
>>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com>
>>> wrote:
>>>
>>>> that's true and that's the way we're doing it now but then we're only
>>>> using the first row to determine the number of splitted columns.
>>>> It could be that in the second (or last) row there are 10 new columns
>>>> and we'd like to know that too.
>>>>
>>>> Probably a reduceby operator can be used to do that, but I'm hoping
>>>> that there is a better or another way,
>>>>
>>>> thanks,
>>>> Richard
>>>>
>>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan <
>>>> sabarish.sasidha...@manthan.com> wrote:
>>>>
>>>>> The most efficient to determine the number of columns would be to do a
>>>>> take(1) and split in the driver.
>>>>>
>>>>> Regards
>>>>> Sab
>>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> what is the most efficient way to split columns and know how many
>>>>>> columns are created.
>>>>>>
>>>>>> Here is the current RDD
>>>>>> -----------------
>>>>>> ID   STATE
>>>>>> -----------------
>>>>>> 1       TX, NY, FL
>>>>>> 2       CA, OH
>>>>>> -----------------
>>>>>>
>>>>>> This is the preferred output:
>>>>>> -------------------------
>>>>>> ID    STATE_1     STATE_2      STATE_3
>>>>>> -------------------------
>>>>>> 1     TX              NY              FL
>>>>>> 2     CA              OH
>>>>>> -------------------------
>>>>>>
>>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3
>>>>>>
>>>>>>
>>>>>> It looks like the following output is feasible using a ReduceBy
>>>>>> operator
>>>>>> -------------------------
>>>>>> ID    STATE_1     STATE_2      STATE_3       NEW_COLUMNS
>>>>>> -------------------------
>>>>>> 1     TX                NY               FL            STATE_1,
>>>>>> STATE_2, STATE_3
>>>>>> 2     CA                OH                             STATE_1,
>>>>>> STATE_2
>>>>>> -------------------------
>>>>>>
>>>>>> Then in the reduce step, the distinct new columns can be calculated.
>>>>>> Is it possible to get the second output where next to the RDD the
>>>>>> new_columns are saved somewhere?
>>>>>> Or is the required to use the second approach?
>>>>>>
>>>>>> thanks in advance,
>>>>>> Richard
>>>>>>
>>>>>>
>>>>

Re: Split columns in RDD

Reply via email to