You can create a new Issue and send a pull request for the same i think.
+ dev list
Thanks
Best Regards
On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:
Dear Sir / Madam,
I have a plan to contribute some codes about passing filters to a
datasource as physical planning.
In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory with Spark operations), we need to implement
PrunedFilteredScan, PrunedScan or CatalystScan in package
org.apache.spark.sql.sources.
For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
org.apache.spark.sql.sources, which do not access directly to the query
parser but are objects built by selectFilters() in package
org.apache.spark.sql.sources.DataSourceStrategy.
It looks all the filters (rather raw expressions) do not pass to the
function below in PrunedFilteredScan and PrunedScan.
def buildScan(requiredColumns: Array[String], filters: Array[Filter]):
RDD[Row]
The passing filters in here are defined in package
org.apache.spark.sql.sources.
On the other hand, it does not pass EqualNullSafe filter in package
org.apache.spark.sql.catalyst.expressions even though this looks possible
to pass for other datasources such as Parquet and JSON.
I understand that CatalystScan can take the all raw expression accessing
to the query planner. However, it is experimental and also it needs
different interfaces (as well as unstable for the reasons such as binary
capability).
As far as I know, Parquet also does not use this.
In general, this can be a issue as a user send a query to data such as
1.
SELECT *
FROM table
WHERE field = 1;
2.
SELECT *
FROM table
WHERE field = 1;
The second query can be hugely slow because of large network traffic by
not filtered data from the source RDD.
Also,I could not find a proper issue for this (except for
https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
now binary capability.
Accordingly, I want to add this issue and make a pull request with my
codes.
Could you please make any comments for this?
Thanks.