Re: Using Spark SQL for temporal data

Michael Armbrust Thu, 12 Feb 2015 22:56:37 -0800

>
> I haven't been paying close attention to the JIRA tickets for
> PrunedFilteredScan but I noticed some weird behavior around the filters
> being applied when OR expressions were used in the WHERE clause. From what
> I was seeing, it looks like it could be possible that the "start" and "end"
> ranges you are proposing to place in the WHERE clause could actually never
> be pushed down to the PrunedFilteredScan if there's an OR expression in
> there, like: (start > "2014-12-01" and end < "2015-02-12") or (....). I
> haven't done a unit test for this case yet, but I did file SPARK-5296
> because of the behavior I was seeing. I'm requiring a time range in the
> services I'm writing because without it, the full Accumulo table would be
> scanned- and that's no good.



Ah, I see.  Right now we only split up and pass down conjunctive (and)
predicates that can be expressed in the limited set of filters so far.  We
can easily add OR if it works for your use case. It'll be up to the data
source however to recurse down the ORs and either pass multiple time ranges
to accumulo or union multiple RDDs together to return them.  Lets discuss
more on the JIRA.

Are there any plans on making the CatalystScan public in the near future
> (possibly once SparkSQL reaches the proposed stability in 1.3?)


No, it'll remain public so people can experiment with it, but it is
unlikely it'll ever have the same stability guarantees that the Spark
public API does.  This is primarily due to its dependence on the whole
catalyst expression hierarchy.  Instead I'd like to add to the other scan
filters / interfaces that can provide useful information to the data
sources.

Re: Using Spark SQL for temporal data

Reply via email to