Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
All right thanks for inputs is there any way spark can process all combination parallel in one job ? If is it ok to load the input csv file in dataframe and use flat map to create key pair, then use reduceByKey to sum the double array? I believe that will work same like agg function which you

Re: DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread ayan guha
You can explore grouping sets in SQL and write an aggregate function to add array wise sum. It will boil down to something like Select attr1,attr2...,yourAgg(Val) >From t Group by attr1,attr2... Grouping sets((attr1,attr2),(aytr1)) On 12 Nov 2016 04:57, "Anil Langote"

DataSet is not able to handle 50,000 columns to sum

2016-11-11 Thread Anil Langote
Hi All, I have been working on one use case and couldn’t able to think the better solution, I have seen you very active on spark user list please throw your thoughts on implementation. Below is the requirement. I have tried using dataset by splitting the double array column but it fails