[ 
https://issues.apache.org/jira/browse/SPARK-9814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-9814:
--------------------------------
    Summary: EqualNullSafe not passing to data sources  (was: EqualNotNull not 
passing to data sources)

> EqualNullSafe not passing to data sources
> -----------------------------------------
>
>                 Key: SPARK-9814
>                 URL: https://issues.apache.org/jira/browse/SPARK-9814
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Hyukjin Kwon
>            Assignee: Hyukjin Kwon
>            Priority: Minor
>             Fix For: 1.5.0
>
>
> When data sources (such as Parquet) tries to filter data when reading from 
> HDFS (not in memory), Physical planing phase passes the filter objects in 
> {{org.apache.spark.sql.sources}}, which are appropriately built and picked up 
> by {{selectFilters()}} in {{org.apache.spark.sql.sources.DataSourceStrategy}}.
> On the other hand, it does not pass {{EqualNullSafe}} filter in 
> {{org.apache.spark.sql.catalyst.expressions}} even though this seems possible 
> to pass for other datasources such as Parquet and JSON. In more detail, it 
> does not pass {{EqualNullSafe}} to (below) {{buildScan()}} in 
> {{PrunedFilteredScan}} and {{PrunedScan}}, 
> {code}
> def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
> RDD[Row]
> {code}
> even though the binary capability issue is 
> solved.(https://issues.apache.org/jira/browse/SPARK-8747).
> I understand that {{CatalystScan}} can take the all raw expressions accessing 
> to the query planner. However, it is experimental and also it needs different 
> interfaces (as well as unstable for the reasons such as binary capability).
> In general, the problem below can happen.
> 1.
> {code:sql}
> SELECT * FROM table WHERE field = 1;
> {code}
>  
> 2. 
> {code:sql}
> SELECT * FROM table WHERE field <=> 1;
> {code}
> The second query can be hugely slow although the functionally is almost 
> identical because of the possible large network traffic (etc.) by not 
> filtered data from the source RDD.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to