Andrew Duffy created SPARK-17213:
------------------------------------

             Summary: Parquet String Pushdown for Non-Eq Comparisons Broken
                 Key: SPARK-17213
                 URL: https://issues.apache.org/jira/browse/SPARK-17213
             Project: Spark
          Issue Type: Bug
    Affects Versions: 2.0.0
            Reporter: Andrew Duffy


Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
    > Seq("a", "é").toDF("name").where("name > 'a'").count
    1
{code}

Querying from a parquet dataset:
{code}
    > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
    > spark.read.parquet("/tmp/bad").where("name > 'a'").count
    0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to