Re: How to add a new column with date duration from 2 date columns in a dataframe
Thanks Davies. HiveContext seems neat to use :) On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu dav...@databricks.com wrote: As Aram said, there two options in Spark 1.4, 1) Use the HiveContext, then you got datediff from Hive, df.selectExpr(datediff(d2, d1)) 2) Use Python UDF: ``` from datetime import date df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], ['d1', 'd2']) from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType diff = udf(lambda a, b: (a - b).days, IntegerType()) df.select(diff(df.d1, df.d2)).show() +-+ |PythonUDF#lambda(d1,d2)| +-+ | -39| +-+ ``` On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan aram.mkrtchyan...@gmail.com wrote: Hi, hope this will help you import org.apache.spark.sql.functions._ import sqlContext.implicits._ import java.sql.Timestamp val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2) val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) = Days.daysBetween(new DateTime(value2.getTime), new DateTime(value1.getTime)).getDays) df.withColumn(diff, dateDiff(df(day2), df(day1))).show() or you can write sql query using hiveql's datediff function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote: More update on this question..I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :). pyspark.sql.functions.datediff(end, start)[source] Returns the number of days from start to end. df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2']) df.select(datediff(df.d2, df.d1).alias('diff')).collect() [Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com wrote: Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ Here is what I have tried so far: - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+
Re: How to add a new column with date duration from 2 date columns in a dataframe
Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ Here is what I have tried so far: - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+
Re: How to add a new column with date duration from 2 date columns in a dataframe
As Aram said, there two options in Spark 1.4, 1) Use the HiveContext, then you got datediff from Hive, df.selectExpr(datediff(d2, d1)) 2) Use Python UDF: ``` from datetime import date df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], ['d1', 'd2']) from pyspark.sql.functions import udf from pyspark.sql.types import IntegerType diff = udf(lambda a, b: (a - b).days, IntegerType()) df.select(diff(df.d1, df.d2)).show() +-+ |PythonUDF#lambda(d1,d2)| +-+ | -39| +-+ ``` On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan aram.mkrtchyan...@gmail.com wrote: Hi, hope this will help you import org.apache.spark.sql.functions._ import sqlContext.implicits._ import java.sql.Timestamp val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2) val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) = Days.daysBetween(new DateTime(value2.getTime), new DateTime(value1.getTime)).getDays) df.withColumn(diff, dateDiff(df(day2), df(day1))).show() or you can write sql query using hiveql's datediff function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote: More update on this question..I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :). pyspark.sql.functions.datediff(end, start)[source] Returns the number of days from start to end. df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2']) df.select(datediff(df.d2, df.d1).alias('diff')).collect() [Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com wrote: Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ Here is what I have tried so far: - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: How to add a new column with date duration from 2 date columns in a dataframe
More update on this question..I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :). pyspark.sql.functions.datediff(*end*, *start*)[source] http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff Returns the number of days from start to end. df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2']) df.select(datediff(df.d2, df.d1).alias('diff')).collect()[Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com wrote: Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ Here is what I have tried so far: - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+
Re: How to add a new column with date duration from 2 date columns in a dataframe
Hi, hope this will help you import org.apache.spark.sql.functions._ import sqlContext.implicits._ import java.sql.Timestamp val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2) val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) = Days.daysBetween(new DateTime(value2.getTime), new DateTime(value1.getTime)).getDays) df.withColumn(diff, dateDiff(df(day2), df(day1))).show() or you can write sql query using hiveql's datediff function. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote: More update on this question..I am using spark 1.4.1. I was just reading documentation of spark 1.5 (still in development) and I think there will be a new func *datediff* that will solve the issue. So please let me know if there is any work-around until spark 1.5 is out :). pyspark.sql.functions.datediff(*end*, *start*)[source] http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff Returns the number of days from start to end. df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 'd2']) df.select(datediff(df.d2, df.d1).alias('diff')).collect()[Row(diff=32)] New in version 1.5. On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com wrote: Apologies, sent too early accidentally. Actual message is below A dataframe has 2 datecolumns (datetime type) and I would like to add another column that would have difference between these two dates. Dataframe snippet is below. new_df.show(5) +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+ Here is what I have tried so far: - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE)).show() Error: DateType does not support numeric operations - new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() Error: Can't extract value from (next_diag_date#927 - SVCDATE#377); However this simple python code works fine with pySpark: from datetime import date d0 = date(2008, 8, 18) d1 = date(2008, 9, 26) delta = d0 - d1 print (d0 - d1).days # -39 Any suggestions would be appreciated! Also is there a way to add a new column in dataframe without using column expression (e.g. like in pandas or R. df$new_col = 'new col value')? Thanks, Dhaval On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote: new_df.withColumn('SVCDATE2', (new_df.next_diag_date-new_df.SVCDATE).days).show() +---+--+--+ | PATID| SVCDATE|next_diag_date| +---+--+--+ |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27| +---+--+--+