Hi,
For test purposes I am ready a simple csv file as follows:
val df =
sqlContext.read.format("com.databricks.spark.csv").option("inferSchema",
"true").option("header", "true").load("/data/stg/table2")
df: org.apache.spark.sql.DataFrame = [Invoice Number: string, Payment date:
string, Net: string, VAT: string, Total: string]
For this work I am interested in column "Payment Date" > 6 months old from
today
Data is stored in the following format for that column
scala> df.select("Payment date").take(2)
res40: Array[org.apache.spark.sql.Row] = Array([10/02/2014], [17/02/2014])
stored as 'dd/MM/yyyy'
The current time I get as
scala> val today = sqlContext.sql("SELECT FROM_unixtime(unix_timestamp(),
'dd/MM/yyyy') ").collect.apply(0).getString(0)
today: String = 21/03/2016
So I want to filter the csv file
scala> df.filter(col("Payment date") < lit(today)).show(2)
+--------------+------------+---------+-----+---------+
|Invoice Number|Payment date| Net| VAT| Total|
+--------------+------------+---------+-----+---------+
| 360| 10/02/2014|?2,500.00|?0.00|?2,500.00|
| 361| 17/02/2014|?2,500.00|?0.00|?2,500.00|
+--------------+------------+---------+-----+---------+
However, I want to use datediff() function here not just < today!
Obviously one can store the file as a table and use SQL on it. However, I
want to see if there are other ways using fp.
Thanks
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
http://talebzadehmich.wordpress.com