Here is a more generic way of doing this: from pyspark.sql import Row df = sc.parallelize([[1,2,3,4],[10,20,30]]).map(lambda x: Row(numbers=x)).toDF() df.show() from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType u = udf(lambda c: sum(c), IntegerType()) df1 = df.withColumn("s",u(df.numbers)) df1.show()
On Tue, Aug 16, 2016 at 11:50 AM, Mike Metzger <m...@flexiblecreations.com> wrote: > Assuming you know the number of elements in the list, this should work: > > df.withColumn('total', df["_1"].getItem(0) + df["_1"].getItem(1) + > df["_1"].getItem(2)) > > Mike > > On Mon, Aug 15, 2016 at 12:02 PM, Javier Rey <jre...@gmail.com> wrote: > >> Hi everyone, >> >> I have one dataframe with one column this column is an array of numbers, >> how can I sum each array by row a obtain a new column with sum? in pyspark. >> >> Example: >> >> +------------+ >> | numbers| >> +------------+ >> |[10, 20, 30]| >> |[40, 50, 60]| >> |[70, 80, 90]| >> +------------+ >> >> The idea is obtain the same df with a new column with totals: >> >> +------------+------ >> | numbers| | >> +------------+------ >> |[10, 20, 30]|60 | >> |[40, 50, 60]|150 | >> |[70, 80, 90]|240 | >> +------------+------ >> >> Regards! >> >> Samir >> >> >> >> > -- Best Regards, Ayan Guha