Re: Inquery about contributing codes

2015-08-11 Thread Akhil Das
You can create a new Issue and send a pull request for the same i think.

+ dev list

Thanks
Best Regards

On Tue, Aug 11, 2015 at 8:32 AM, Hyukjin Kwon gurwls...@gmail.com wrote:

 Dear Sir / Madam,

 I have a plan to contribute some codes about passing filters to a
 datasource as physical planning.

 In more detail, I understand when we want to build up filter operations
 from data like Parquet (when actually reading and filtering HDFS blocks at
 first not filtering in memory with Spark operations), we need to implement

 PrunedFilteredScan, PrunedScan or CatalystScan in package
 org.apache.spark.sql.sources.



 For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
 org.apache.spark.sql.sources, which do not access directly to the query
 parser but are objects built by selectFilters() in package
 org.apache.spark.sql.sources.DataSourceStrategy.

 It looks all the filters (rather raw expressions) do not pass to the
 function below in PrunedFilteredScan and PrunedScan.

 def buildScan(requiredColumns: Array[String], filters: Array[Filter]): 
 RDD[Row]

 The passing filters in here are defined in package
 org.apache.spark.sql.sources.

 On the other hand, it does not pass EqualNullSafe filter in package
 org.apache.spark.sql.catalyst.expressions even though this looks possible
 to pass for other datasources such as Parquet and JSON.



 I understand that  CatalystScan can take the all raw expression accessing
 to the query planner. However, it is experimental and also it needs
 different interfaces (as well as unstable for the reasons such as binary
 capability).

 As far as I know, Parquet also does not use this.



 In general, this can be a issue as a user send a query to data such as

 1.

 SELECT *
 FROM table
 WHERE field = 1;


 2.

 SELECT *
 FROM table
 WHERE field = 1;


 The second query can be hugely slow because of large network traffic by
 not filtered data from the source RDD.



 Also,I could not find a proper issue for this (except for
 https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
 now binary capability.

 Accordingly, I want to add this issue and make a pull request with my
 codes.


 Could you please make any comments for this?

 Thanks.




Inquery about contributing codes

2015-08-10 Thread Hyukjin Kwon
Dear Sir / Madam,

I have a plan to contribute some codes about passing filters to a
datasource as physical planning.

In more detail, I understand when we want to build up filter operations
from data like Parquet (when actually reading and filtering HDFS blocks at
first not filtering in memory with Spark operations), we need to implement

PrunedFilteredScan, PrunedScan or CatalystScan in package
org.apache.spark.sql.sources.



For PrunedFilteredScan and PrunedScan, it pass the filter objects in package
org.apache.spark.sql.sources, which do not access directly to the query
parser but are objects built by selectFilters() in package
org.apache.spark.sql.sources.DataSourceStrategy.

It looks all the filters (rather raw expressions) do not pass to the
function below in PrunedFilteredScan and PrunedScan.

def buildScan(requiredColumns: Array[String], filters: Array[Filter]): RDD[Row]

The passing filters in here are defined in package
org.apache.spark.sql.sources.

On the other hand, it does not pass EqualNullSafe filter in package
org.apache.spark.sql.catalyst.expressions even though this looks possible
to pass for other datasources such as Parquet and JSON.



I understand that  CatalystScan can take the all raw expression accessing
to the query planner. However, it is experimental and also it needs
different interfaces (as well as unstable for the reasons such as binary
capability).

As far as I know, Parquet also does not use this.



In general, this can be a issue as a user send a query to data such as

1.

SELECT *
FROM table
WHERE field = 1;


2.

SELECT *
FROM table
WHERE field = 1;


The second query can be hugely slow because of large network traffic by not
filtered data from the source RDD.



Also,I could not find a proper issue for this (except for
https://issues.apache.org/jira/browse/SPARK-8747) which says it supports
now binary capability.

Accordingly, I want to add this issue and make a pull request with my codes.


Could you please make any comments for this?

Thanks.