Thanks Silvio. The problem I have is that somehow string comparison does not work.
Case in point val df = sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", "true").option("header", "true").load("/data/stg/table2") val current_date = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), 'dd/MM/yyyy') ").collect.apply(0).getString(0) df.filter(*lit(current_date) < col("Payment date"*)).select(lit(current_date).alias("current_date"), col("Payment date").alias("PaymentDate")).show(5) It selects all the rows that are less than today's date (they are old). +------------+-----------+ |current_date|PaymentDate| +------------+-----------+ | 22/03/2016| 24/02/2014| | 22/03/2016| 24/03/2014| | 22/03/2016| 31/03/2015| | 22/03/2016| 28/04/2014| | 22/03/2016| 26/05/2014| +------------+-----------+ I don't know why this comparison is failing. May be it is comparing the first two leftmost characters? Thanks Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 22 March 2016 at 00:26, Silvio Fiorito <silvio.fior...@granturing.com> wrote: > There’s a months_between function you could use, as well: > > df.filter(months_between(current_date, $”Payment Date”) > 6).show > > From: Mich Talebzadeh <mich.talebza...@gmail.com> > Date: Monday, March 21, 2016 at 5:53 PM > To: "user @spark" <user@spark.apache.org> > Subject: Work out date column in CSV more than 6 months old (datediff or > something) > > Hi, > > For test purposes I am reading in a simple csv file as follows: > > val df = > sqlContext.read.format("com.databricks.spark.csv").option("inferSchema", > "true").option("header", "true").load("/data/stg/table2") > df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment > date: string, Net: string, VAT: string, Total: string] > > For this work I am interested in column "Payment Date" > 6 months old from > today > > Data is stored in the following format for that column > > scala> df.select("Payment date").take(2) > res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014]) > > stored as 'dd/MM/yyyy' > > The current time I get as > > scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(), > 'dd/MM/yyyy') ").collect.apply(0).getString(0) > today: String = 21/03/2016 > > > So I want to filter the csv file > > scala> df.filter(col("Payment date") < lit(today)).show(2) > +--------------+------------+---------+-----+---------+ > |Invoice Number|Payment date| Net| VAT| Total| > +--------------+------------+---------+-----+---------+ > | 360| 10/02/2014|?2,500.00|?0.00|?2,500.00| > | 361| 17/02/2014|?2,500.00|?0.00|?2,500.00| > +--------------+------------+---------+-----+---------+ > > > However, I want to use datediff() function here not just < today! > > > Obviously one can store the file as a table and use SQL on it. However, I > want to see if there are other ways using fp. > > Thanks > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > > http://talebzadehmich.wordpress.com > > >