Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-26 Thread Dhaval Patel
Thanks Davies. HiveContext seems neat to use :)

On Thu, Aug 20, 2015 at 3:02 PM, Davies Liu dav...@databricks.com wrote:

 As Aram said, there two options in Spark 1.4,

 1) Use the HiveContext, then you got datediff from Hive,
 df.selectExpr(datediff(d2, d1))
 2) Use Python UDF:
 ```
  from datetime import date
  df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9,
 26))], ['d1', 'd2'])
  from pyspark.sql.functions import udf
  from pyspark.sql.types import IntegerType
  diff = udf(lambda a, b: (a - b).days, IntegerType())
  df.select(diff(df.d1, df.d2)).show()
 +-+
 |PythonUDF#lambda(d1,d2)|
 +-+
 |  -39|
 +-+
 ```

 On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan
 aram.mkrtchyan...@gmail.com wrote:
  Hi,
 
  hope this will help you
 
  import org.apache.spark.sql.functions._
  import sqlContext.implicits._
  import java.sql.Timestamp
 
  val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2)
 
  val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =
Days.daysBetween(new DateTime(value2.getTime), new
  DateTime(value1.getTime)).getDays)
  df.withColumn(diff, dateDiff(df(day2), df(day1))).show()
 
  or you can write sql query using hiveql's datediff function.
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
 
  On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com
 wrote:
 
  More update on this question..I am using spark 1.4.1.
 
  I was just reading documentation of spark 1.5 (still in development)
 and I
  think there will be a new func *datediff* that will solve the issue. So
  please let me know if there is any work-around until spark 1.5 is out
 :).
 
  pyspark.sql.functions.datediff(end, start)[source]
 
  Returns the number of days from start to end.
 
   df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')],
 ['d1',
   'd2'])
   df.select(datediff(df.d2, df.d1).alias('diff')).collect()
  [Row(diff=32)]
 
  New in version 1.5.
 
 
  On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com
  wrote:
 
  Apologies, sent too early accidentally. Actual message is below
  
 
  A dataframe has 2 datecolumns (datetime type) and I would like to add
  another column that would have difference between these two dates.
 Dataframe
  snippet is below.
 
  new_df.show(5)
  +---+--+--+
  | PATID| SVCDATE|next_diag_date|
  +---+--+--+
  |12345655545|2012-02-13| 2012-02-13|
  |12345655545|2012-02-13| 2012-02-13|
  |12345655545|2012-02-13| 2012-02-27|
  +---+--+--+
 
 
 
  Here is what I have tried so far:
 
  - new_df.withColumn('SVCDATE2',
  (new_df.next_diag_date-new_df.SVCDATE)).show()
  Error: DateType does not support numeric operations
 
  - new_df.withColumn('SVCDATE2',
  (new_df.next_diag_date-new_df.SVCDATE).days).show()
  Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);
 
 
  However this simple python code works fine with pySpark:
 
  from datetime import date
  d0 = date(2008, 8, 18)
  d1 = date(2008, 9, 26)
  delta = d0 - d1
  print (d0 - d1).days
 
  # -39
 
 
  Any suggestions would be appreciated! Also is there a way to add a new
  column in dataframe without using column expression (e.g. like in
 pandas or
  R. df$new_col = 'new col value')?
 
 
  Thanks,
  Dhaval
 
 
 
  On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com
  wrote:
 
  new_df.withColumn('SVCDATE2',
  (new_df.next_diag_date-new_df.SVCDATE).days).show()
 
  +---+--+--+ | PATID|
 SVCDATE|next_diag_date|
  +---+--+--+ |12345655545|2012-02-13|
 2012-02-13|
  |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
 2012-02-27|
  +---+--+--+
 
 
 
 



Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
Apologies, sent too early accidentally. Actual message is below


A dataframe has 2 datecolumns (datetime type) and I would like to add
another column that would have difference between these two dates.
Dataframe snippet is below.

new_df.show(5)
+---+--+--+
| PATID| SVCDATE|next_diag_date|
+---+--+--+
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-13|
|12345655545|2012-02-13| 2012-02-27| +---+--+--+



Here is what I have tried so far:

- new_df.withColumn('SVCDATE2',
(new_df.next_diag_date-new_df.SVCDATE)).show()
Error: DateType does not support numeric operations

- new_df.withColumn('SVCDATE2',
(new_df.next_diag_date-new_df.SVCDATE).days).show()
Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);


However this simple python code works fine with pySpark:

from datetime import date
d0 = date(2008, 8, 18)
d1 = date(2008, 9, 26)
delta = d0 - d1
print (d0 - d1).days

# -39


Any suggestions would be appreciated! Also is there a way to add a new
column in dataframe without using column expression (e.g. like in pandas or
R. df$new_col = 'new col value')?


Thanks,
Dhaval



On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com wrote:

 new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()

 +---+--+--+ | PATID| SVCDATE|next_diag_date|
 +---+--+--+ |12345655545|2012-02-13|
 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
 2012-02-27| +---+--+--+



Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Davies Liu
As Aram said, there two options in Spark 1.4,

1) Use the HiveContext, then you got datediff from Hive,
df.selectExpr(datediff(d2, d1))
2) Use Python UDF:
```
 from datetime import date
 df = sqlContext.createDataFrame([(date(2008, 8, 18), date(2008, 9, 26))], 
 ['d1', 'd2'])
 from pyspark.sql.functions import udf
 from pyspark.sql.types import IntegerType
 diff = udf(lambda a, b: (a - b).days, IntegerType())
 df.select(diff(df.d1, df.d2)).show()
+-+
|PythonUDF#lambda(d1,d2)|
+-+
|  -39|
+-+
```

On Thu, Aug 20, 2015 at 7:45 AM, Aram Mkrtchyan
aram.mkrtchyan...@gmail.com wrote:
 Hi,

 hope this will help you

 import org.apache.spark.sql.functions._
 import sqlContext.implicits._
 import java.sql.Timestamp

 val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2)

 val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =
   Days.daysBetween(new DateTime(value2.getTime), new
 DateTime(value1.getTime)).getDays)
 df.withColumn(diff, dateDiff(df(day2), df(day1))).show()

 or you can write sql query using hiveql's datediff function.
  https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

 On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote:

 More update on this question..I am using spark 1.4.1.

 I was just reading documentation of spark 1.5 (still in development) and I
 think there will be a new func *datediff* that will solve the issue. So
 please let me know if there is any work-around until spark 1.5 is out :).

 pyspark.sql.functions.datediff(end, start)[source]

 Returns the number of days from start to end.

  df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1',
  'd2'])
  df.select(datediff(df.d2, df.d1).alias('diff')).collect()
 [Row(diff=32)]

 New in version 1.5.


 On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com
 wrote:

 Apologies, sent too early accidentally. Actual message is below
 

 A dataframe has 2 datecolumns (datetime type) and I would like to add
 another column that would have difference between these two dates. Dataframe
 snippet is below.

 new_df.show(5)
 +---+--+--+
 | PATID| SVCDATE|next_diag_date|
 +---+--+--+
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-27|
 +---+--+--+



 Here is what I have tried so far:

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE)).show()
 Error: DateType does not support numeric operations

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()
 Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);


 However this simple python code works fine with pySpark:

 from datetime import date
 d0 = date(2008, 8, 18)
 d1 = date(2008, 9, 26)
 delta = d0 - d1
 print (d0 - d1).days

 # -39


 Any suggestions would be appreciated! Also is there a way to add a new
 column in dataframe without using column expression (e.g. like in pandas or
 R. df$new_col = 'new col value')?


 Thanks,
 Dhaval



 On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com
 wrote:

 new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()

 +---+--+--+ | PATID| SVCDATE|next_diag_date|
 +---+--+--+ |12345655545|2012-02-13| 
 2012-02-13|
 |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13| 2012-02-27|
 +---+--+--+





-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Dhaval Patel
More update on this question..I am using spark 1.4.1.

I was just reading documentation of spark 1.5 (still in development) and I
think there will be a new func *datediff* that will solve the issue. So
please let me know if there is any work-around until spark 1.5 is out :).

pyspark.sql.functions.datediff(*end*, *start*)[source]
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff

Returns the number of days from start to end.

 df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 
 'd2']) df.select(datediff(df.d2, 
 df.d1).alias('diff')).collect()[Row(diff=32)]

New in version 1.5.

On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com wrote:

 Apologies, sent too early accidentally. Actual message is below
 

 A dataframe has 2 datecolumns (datetime type) and I would like to add
 another column that would have difference between these two dates.
 Dataframe snippet is below.

 new_df.show(5)
 +---+--+--+
 | PATID| SVCDATE|next_diag_date|
 +---+--+--+
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-27|
 +---+--+--+



 Here is what I have tried so far:

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE)).show()
 Error: DateType does not support numeric operations

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()
 Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);


 However this simple python code works fine with pySpark:

 from datetime import date
 d0 = date(2008, 8, 18)
 d1 = date(2008, 9, 26)
 delta = d0 - d1
 print (d0 - d1).days

 # -39


 Any suggestions would be appreciated! Also is there a way to add a new
 column in dataframe without using column expression (e.g. like in pandas or
 R. df$new_col = 'new col value')?


 Thanks,
 Dhaval



 On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com
 wrote:

 new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()

 +---+--+--+ | PATID| SVCDATE|next_diag_date|
 +---+--+--+ |12345655545|2012-02-13|
 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
 2012-02-27| +---+--+--+





Re: How to add a new column with date duration from 2 date columns in a dataframe

2015-08-20 Thread Aram Mkrtchyan
Hi,

hope this will help you

import org.apache.spark.sql.functions._
import sqlContext.implicits._
import java.sql.Timestamp

val df = sc.parallelize(Array((date1, date2))).toDF(day1, day2)

val dateDiff = udf[Long, Timestamp, Timestamp]((value1, value2) =
  Days.daysBetween(new DateTime(value2.getTime), new
DateTime(value1.getTime)).getDays)
df.withColumn(diff, dateDiff(df(day2), df(day1))).show()

or you can write sql query using hiveql's datediff function.
 https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

On Thu, Aug 20, 2015 at 4:57 PM, Dhaval Patel dhaval1...@gmail.com wrote:

 More update on this question..I am using spark 1.4.1.

 I was just reading documentation of spark 1.5 (still in development) and I
 think there will be a new func *datediff* that will solve the issue. So
 please let me know if there is any work-around until spark 1.5 is out :).

 pyspark.sql.functions.datediff(*end*, *start*)[source]
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/_modules/pyspark/sql/functions.html#datediff
 http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-preview-20150812-docs/api/python/pyspark.sql.html#pyspark.sql.functions.datediff

 Returns the number of days from start to end.

  df = sqlContext.createDataFrame([('2015-04-08','2015-05-10')], ['d1', 
  'd2']) df.select(datediff(df.d2, 
  df.d1).alias('diff')).collect()[Row(diff=32)]

 New in version 1.5.

 On Thu, Aug 20, 2015 at 8:26 AM, Dhaval Patel dhaval1...@gmail.com
 wrote:

 Apologies, sent too early accidentally. Actual message is below
 

 A dataframe has 2 datecolumns (datetime type) and I would like to add
 another column that would have difference between these two dates.
 Dataframe snippet is below.

 new_df.show(5)
 +---+--+--+
 | PATID| SVCDATE|next_diag_date|
 +---+--+--+
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-13|
 |12345655545|2012-02-13| 2012-02-27|
 +---+--+--+



 Here is what I have tried so far:

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE)).show()
 Error: DateType does not support numeric operations

 - new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()
 Error: Can't extract value from (next_diag_date#927 - SVCDATE#377);


 However this simple python code works fine with pySpark:

 from datetime import date
 d0 = date(2008, 8, 18)
 d1 = date(2008, 9, 26)
 delta = d0 - d1
 print (d0 - d1).days

 # -39


 Any suggestions would be appreciated! Also is there a way to add a new
 column in dataframe without using column expression (e.g. like in pandas or
 R. df$new_col = 'new col value')?


 Thanks,
 Dhaval



 On Thu, Aug 20, 2015 at 8:18 AM, Dhaval Patel dhaval1...@gmail.com
 wrote:

 new_df.withColumn('SVCDATE2',
 (new_df.next_diag_date-new_df.SVCDATE).days).show()

 +---+--+--+ | PATID| SVCDATE|next_diag_date|
 +---+--+--+ |12345655545|2012-02-13|
 2012-02-13| |12345655545|2012-02-13| 2012-02-13| |12345655545|2012-02-13|
 2012-02-27| +---+--+--+