thanks Daniel, this will certainly help, regards, Richard On Tue, Jan 19, 2016 at 6:35 PM, Daniel Imberman <daniel.imber...@gmail.com> wrote:
> edit 2: filter should be map > > val numColumns = separatedInputStrings.map{ case(id, (stateList, > numStates)) => numStates}.reduce(math.max) > > On Tue, Jan 19, 2016 at 8:19 AM Daniel Imberman <daniel.imber...@gmail.com> > wrote: > >> edit: Mistake in the second code example >> >> val numColumns = separatedInputStrings.filter{ case(id, (stateList, >> numStates)) => numStates}.reduce(math.max) >> >> >> On Tue, Jan 19, 2016 at 8:17 AM Daniel Imberman < >> daniel.imber...@gmail.com> wrote: >> >>> Hi Richard, >>> >>> If I understand the question correctly it sounds like you could probably >>> do this using mapValues (I'm assuming that you want two pieces of >>> information out of all rows, the states as individual items, and the number >>> of states in the row) >>> >>> >>> val separatedInputStrings = input:RDD[(Int, String).mapValues{ >>> val inputsString = "TX,NV,WY" >>> val stringList = inputString.split(",") >>> (stringList, stringList.size) >>> } >>> >>> If you then wanted to find out how many state columns you should have in >>> your table you could use a normal reduce (with a filter beforehand to >>> reduce how much data you are shuffling) >>> >>> val numColumns = separatedInputStrings.filter(_._2).reduce(math.max) >>> >>> I hope this helps! >>> >>> >>> >>> On Tue, Jan 19, 2016 at 8:05 AM Richard Siebeling <rsiebel...@gmail.com> >>> wrote: >>> >>>> that's true and that's the way we're doing it now but then we're only >>>> using the first row to determine the number of splitted columns. >>>> It could be that in the second (or last) row there are 10 new columns >>>> and we'd like to know that too. >>>> >>>> Probably a reduceby operator can be used to do that, but I'm hoping >>>> that there is a better or another way, >>>> >>>> thanks, >>>> Richard >>>> >>>> On Tue, Jan 19, 2016 at 4:22 PM, Sabarish Sasidharan < >>>> sabarish.sasidha...@manthan.com> wrote: >>>> >>>>> The most efficient to determine the number of columns would be to do a >>>>> take(1) and split in the driver. >>>>> >>>>> Regards >>>>> Sab >>>>> On 19-Jan-2016 8:48 pm, "Richard Siebeling" <rsiebel...@gmail.com> >>>>> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> what is the most efficient way to split columns and know how many >>>>>> columns are created. >>>>>> >>>>>> Here is the current RDD >>>>>> ----------------- >>>>>> ID STATE >>>>>> ----------------- >>>>>> 1 TX, NY, FL >>>>>> 2 CA, OH >>>>>> ----------------- >>>>>> >>>>>> This is the preferred output: >>>>>> ------------------------- >>>>>> ID STATE_1 STATE_2 STATE_3 >>>>>> ------------------------- >>>>>> 1 TX NY FL >>>>>> 2 CA OH >>>>>> ------------------------- >>>>>> >>>>>> With a separated with the new columns STATE_1, STATE_2, STATE_3 >>>>>> >>>>>> >>>>>> It looks like the following output is feasible using a ReduceBy >>>>>> operator >>>>>> ------------------------- >>>>>> ID STATE_1 STATE_2 STATE_3 NEW_COLUMNS >>>>>> ------------------------- >>>>>> 1 TX NY FL STATE_1, >>>>>> STATE_2, STATE_3 >>>>>> 2 CA OH STATE_1, >>>>>> STATE_2 >>>>>> ------------------------- >>>>>> >>>>>> Then in the reduce step, the distinct new columns can be calculated. >>>>>> Is it possible to get the second output where next to the RDD the >>>>>> new_columns are saved somewhere? >>>>>> Or is the required to use the second approach? >>>>>> >>>>>> thanks in advance, >>>>>> Richard >>>>>> >>>>>> >>>>