Hi, I'm trying to implement a folding function in Spark, it takes an input k and a data frame of ids and dates. k=1 will be just the data frame, k=2 will, consist of the min and max date for each id once and the rest twice, k=3 will consist of min and max once, min+1 and max-1, twice and the rest three times, etc.
Code in scala, with variable names changed: val acctMinDates = df.groupBy("id").agg(min("thedate")) val acctMaxDates = df.groupBy("id").agg(max("thedate")) val acctDates = acctMinDates.join(acctMaxDates, "id").collect() var filterString = ""; for (i <- 1 to k - 1) { if (i == 1) { for (aDate <- acctDates) { filterString = filterString + "(id = " + aDate(0) + " and thedate > " + aDate(1) + " and thedate < " + aDate(2) + ") or "; } filterString = filterString.substring(0, filterString.size - 4) } df = df.unionAll(df.where(filterString)); } } Code that is being attempted to translate, from pandas/python: df = pd.concat([df.groupby('id').apply(lambda x: pd.concat([x.iloc[i: i + k] for i in range(len(x.index) - k + 1)]))]) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Large-where-clause-StackOverflow-1-5-2-tp27544.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org