Hi, For test purposes I am reading in a simple csv file as follows:
val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date: string, Net: string, VAT: string, Total: string] For this work I am interested in column "Payment Date" > 6 months old from today Data is stored in the following format for that column scala> df.select("Payment date").take(2) res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014]) stored as 'dd/MM/yyyy' The current time I get as scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy') ").collect.apply(0).getString(0) today: String = 21/03/2016 So I want to filter the csv file scala> df.filter(col("Payment date") < lit(today)).show(2) +--------------+------------+---------+-----+---------+ |Invoice Number|Payment date| Net| VAT| Total| +--------------+------------+---------+-----+---------+ | 360| 10/02/2014|?2,500.00|?0.00|?2,500.00| | 361| 17/02/2014|?2,500.00|?0.00|?2,500.00| +--------------+------------+---------+-----+---------+ However, I want to use datediff() function here not just < today! Obviously one can store the file as a table and use SQL on it. However, I want to see if there are other ways using fp. Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com