[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-08 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-17213:
--
Component/s: SQL

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>Assignee: Cheng Lian
> Fix For: 2.1.0
>
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-12-01 Thread Robert Kruszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Kruszewski updated SPARK-17213:
--
Affects Version/s: (was: 2.0.0)
   2.1.0

> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.1.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-08-24 Thread Andrew Duffy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-17213:
-
Description: 
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Based on the way Parquet pushes down Eq, it will not be affecting correctness 
but it will force you to read row groups you should be able to skip.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362

  was:
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362


> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Based on the way Parquet pushes down Eq, it will not be affecting correctness 
> but it will force you to read row groups you should be able to skip.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17213) Parquet String Pushdown for Non-Eq Comparisons Broken

2016-08-24 Thread Andrew Duffy (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrew Duffy updated SPARK-17213:
-
Description: 
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.

Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
Link to PR: https://github.com/apache/parquet-mr/pull/362

  was:
Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
which compare bytes as unsigned integers. Currently however Parquet does not 
respect this ordering. This is currently in the process of being fixed in 
Parquet, JIRA and PR link below, but currently all filters are broken over 
strings, with there actually being a correctness issue for {{>}} and {{<}}.

*Repro:*
Querying directly from in-memory DataFrame:
{code}
> Seq("a", "é").toDF("name").where("name > 'a'").count
1
{code}

Querying from a parquet dataset:
{code}
> Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> spark.read.parquet("/tmp/bad").where("name > 'a'").count
0
{code}
This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
implementation of comparison of strings is based on signed byte array 
comparison, so it will actually create 1 row group with statistics 
{{min=é,max=a}}, and so the row group will be dropped by the query.


> Parquet String Pushdown for Non-Eq Comparisons Broken
> -
>
> Key: SPARK-17213
> URL: https://issues.apache.org/jira/browse/SPARK-17213
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 2.0.0
>Reporter: Andrew Duffy
>
> Spark defines ordering over strings based on comparison of UTF8 byte arrays, 
> which compare bytes as unsigned integers. Currently however Parquet does not 
> respect this ordering. This is currently in the process of being fixed in 
> Parquet, JIRA and PR link below, but currently all filters are broken over 
> strings, with there actually being a correctness issue for {{>}} and {{<}}.
> *Repro:*
> Querying directly from in-memory DataFrame:
> {code}
> > Seq("a", "é").toDF("name").where("name > 'a'").count
> 1
> {code}
> Querying from a parquet dataset:
> {code}
> > Seq("a", "é").toDF("name").write.parquet("/tmp/bad")
> > spark.read.parquet("/tmp/bad").where("name > 'a'").count
> 0
> {code}
> This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's 
> implementation of comparison of strings is based on signed byte array 
> comparison, so it will actually create 1 row group with statistics 
> {{min=é,max=a}}, and so the row group will be dropped by the query.
> Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686
> Link to PR: https://github.com/apache/parquet-mr/pull/362



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org