DataSourceReader and SupportPushDownFilters for Short types

Hugh Hyndman Sat, 08 Sep 2018 16:18:02 -0700

Hi,

This is my first message to the Apache Spark digest.


In a custom data source reader I am implementing, I noticed that I do not 
receive pushdown filters for datatypes such as ShortType, ByteType, and 
BooleanType. I do get filters for types: IntegerType, LongType, FloatType, 
DoubleType, DateType, and TimestampType.

I receive the filters via the pushFilters override, which is part of the 
SupportsPushDownFilters trait.

The following is a session to demonstrate the behaviour. Note that I placed:

filters.foreach { println }

at the top of the pushFilters override.

Spark context Web UI available at http://192.168.20.114:4040 
<http://192.168.20.114:4040/>
Spark context available as 'sc' (master = local[4], app id = 
local-1533638354996).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.1
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_91)
Type in expressions to have them evaluated.
Type :help for more information.

scala> val df = spark.read.format("myds").
  option("port", 5000).
  option("host", "localhost").
  option("numPartitions", "1").
  option("function","spartanQuery").load()
df: org.apache.spark.sql.DataFrame = [col1: smallint, col2: bigint ... 1 more 
field]

scala> df.schema
res0: org.apache.spark.sql.types.StructType = 
StructType(StructField(col1,ShortType,false), 
StructField(col2,LongType,false), StructField(col3,DoubleType,false))

scala> df.createOrReplaceTempView("dft")

scala> spark.sql("select * from dft").show
+----+----+-------------------+
|col1|col2|               col3|
+----+----+-------------------+
|   0|   0|0.36884380085393786|
|   1| 100| 0.5903338296338916|
|   2| 200|0.19292618241161108|
|   3| 300|0.27678317041136324|
|   4| 400|0.35814784304238856|
|   5| 500|0.19823945895768702|
|   6| 600| 0.4533605803735554|
|   7| 700|0.15147616644389927|
|   8| 800|0.09802114544436336|
|   9| 900|0.31370880361646414|
+----+----+-------------------+


scala> spark.sql("select * from dft where col1<5 and col2<500 and col3<.5").show
LessThan(col2,500)
LessThan(col3,0.5)
+----+----+-------------------+
|col1|col2|               col3|
+----+----+-------------------+
|   0|   0|0.36884380085393786|
|   2| 200|0.19292618241161108|
|   3| 300|0.27678317041136324|
|   4| 400|0.35814784304238856|
+----+----+-------------------+

Note the absence of a filter for col1 (ShortType). Is this a shortcoming in 
Spark or am I doing something wrong? If a shortcoming, can the Spark team 
consider adding some of the other data types to the list of potential pushdown 
filters?

Thank you

DataSourceReader and SupportPushDownFilters for Short types

Reply via email to