Hi,
This is my first message to the Apache Spark digest.
In a custom data source reader I am implementing, I noticed that I do not
receive pushdown filters for datatypes such as ShortType, ByteType, and
BooleanType. I do get filters for types: IntegerType, LongType, FloatType,
DoubleType, DateType, and TimestampType.
I receive the filters via the pushFilters override, which is part of the
SupportsPushDownFilters trait.
The following is a session to demonstrate the behaviour. Note that I placed:
filters.foreach { println }
at the top of the pushFilters override.
Spark context Web UI available at http://192.168.20.114:4040
<http://192.168.20.114:4040/>
Spark context available as 'sc' (master = local[4], app id =
local-1533638354996).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 2.3.1
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.
scala> val df = spark.read.format("myds").
option("port", 5000).
option("host", "localhost").
option("numPartitions", "1").
option("function","spartanQuery").load()
df: org.apache.spark.sql.DataFrame = [col1: smallint, col2: bigint ... 1 more
field]
scala> df.schema
res0: org.apache.spark.sql.types.StructType =
StructType(StructField(col1,ShortType,false),
StructField(col2,LongType,false), StructField(col3,DoubleType,false))
scala> df.createOrReplaceTempView("dft")
scala> spark.sql("select * from dft").show
+----+----+-------------------+
|col1|col2| col3|
+----+----+-------------------+
| 0| 0|0.36884380085393786|
| 1| 100| 0.5903338296338916|
| 2| 200|0.19292618241161108|
| 3| 300|0.27678317041136324|
| 4| 400|0.35814784304238856|
| 5| 500|0.19823945895768702|
| 6| 600| 0.4533605803735554|
| 7| 700|0.15147616644389927|
| 8| 800|0.09802114544436336|
| 9| 900|0.31370880361646414|
+----+----+-------------------+
scala> spark.sql("select * from dft where col1<5 and col2<500 and col3<.5").show
LessThan(col2,500)
LessThan(col3,0.5)
+----+----+-------------------+
|col1|col2| col3|
+----+----+-------------------+
| 0| 0|0.36884380085393786|
| 2| 200|0.19292618241161108|
| 3| 300|0.27678317041136324|
| 4| 400|0.35814784304238856|
+----+----+-------------------+
Note the absence of a filter for col1 (ShortType). Is this a shortcoming in
Spark or am I doing something wrong? If a shortcoming, can the Spark team
consider adding some of the other data types to the list of potential pushdown
filters?
Thank you