[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16012422#comment-16012422 ] Apache Spark commented on SPARK-20364: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/18000 > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon >Priority: Critical > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974215#comment-15974215 ] Apache Spark commented on SPARK-20364: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/17680 > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15974214#comment-15974214 ] Hyukjin Kwon commented on SPARK-20364: -- Probably, let me open a PR first at least to show/check if the latter suggestion works. > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973777#comment-15973777 ] Hyukjin Kwon commented on SPARK-20364: -- Ah, actually, I ran some of tests already with the latter proposal (of course :)) and took a look related code. In terms of whether it works or not, I am positive. Yes, I think that makes sense in a way but I guess that is an option too as I guess calling protected ones sounds not the best too and rather an workaround similarly. We could adds it in the JIRA as the third option. Actually, I think both this way and the way I suggested as the latter use Parquet APIs as working around. So, I guess none of both is particually better. The issue here to me is about setting dot in the filter API. I wonder if there is no way to set this in the column path via filter API. I will investigate a bit more, try both ways and open a PR after picking up one if no one has an aswer about this for few days. I actually searched a bit ahead and I failed to find a JIRA in Parquet. > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15973763#comment-15973763 ] Andrew Ash commented on SPARK-20364: Thanks for the investigation [~hyukjin.kwon]! The proposal you put up (sorry Rob, wasn't me) looks like a good direction. I'm not sure it will work as written though, since I'd expect the IntColumn/LongColumn/etc classes from Parquet to be expected in other places, an instanceof for example. As an alternate suggestion, I'm wondering if we could call the package-protected constructors of the {{IntColumn}} class from Spark-owned class in that package with the result from the {{ColumnPath.get()}} method, rather than its {{.fromDotString()}} method. Then Spark can use its better knowledge of whether the field is a column-with-dot or a field-in-struct and assemble the ColumnPath correctly. And I do think that this should eventually live in Parquet. Maybe there's a ticket already open there for this? Does that make sense? > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972400#comment-15972400 ] Robert Kruszewski commented on SPARK-20364: --- Looks like parquet doesn't differentiate between column named "a.b" and field b in column "a". We could always treat them the same but that might lead to some surprises down the road. Ideally parquet would let us construct filter given exact column name or dot separated accessors. I think what [~aash] proposes is roughly how we should fix it with caveat that eventually this should live in parquet imho. > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20364) Parquet predicate pushdown on columns with dots return empty results
[ https://issues.apache.org/jira/browse/SPARK-20364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972171#comment-15972171 ] Hyukjin Kwon commented on SPARK-20364: -- [~aash], [~robert3005] who found this issue in https://github.com/apache/spark/pull/17667 and [~lian cheng] who might have a better idea and I think can confirm if the investigation here is correct and decide the way to resolve it. > Parquet predicate pushdown on columns with dots return empty results > > > Key: SPARK-20364 > URL: https://issues.apache.org/jira/browse/SPARK-20364 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.0 >Reporter: Hyukjin Kwon > > Currently, if there are dots in the column name, predicate pushdown seems > being failed in Parquet. > **With dots** > {code} > val path = "/tmp/abcde" > Seq(Some(1), None).toDF("col.dots").write.parquet(path) > spark.read.parquet(path).where("`col.dots` IS NOT NULL").show() > {code} > {code} > ++ > |col.dots| > ++ > ++ > {code} > **Without dots** > {code} > val path = "/tmp/abcde2" > Seq(Some(1), None).toDF("coldots").write.parquet(path) > spark.read.parquet(path).where("`coldots` IS NOT NULL").show() > {code} > {code} > +---+ > |coldots| > +---+ > | 1| > +---+ > {code} > It seems dot in the column names via {{FilterApi}} tries to separate the > field name with dot ({{ColumnPath}} with multiple column paths) whereas the > actual column name is {{col.dots}}. (See [FilterApi.java#L71 > |https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-column/src/main/java/org/apache/parquet/filter2/predicate/FilterApi.java#L71] > and it calls > [ColumnPath.java#L44|https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.2/parquet-common/src/main/java/org/apache/parquet/hadoop/metadata/ColumnPath.java#L44]. > I just tried to come up with ways to resolve it and I came up with two as > below: > One is simply to don't push down filters when there are dots in column names > so that it reads all and filters in Spark-side. > The other way creates Spark's {{FilterApi}} for those columns (it seems > final) to get always use single column path it in Spark-side (this seems > hacky) as we are not pushing down nested columns currently. So, it looks we > can get a field name via {{ColumnPath.get}} not {{ColumnPath.fromDotString}} > in this way. > I just made a rough version of the latter. > {code} > private[parquet] object ParquetColumns { > def intColumn(columnPath: String): Column[Integer] with SupportsLtGt = { > new Column[Integer] (ColumnPath.get(columnPath), classOf[Integer]) with > SupportsLtGt > } > def longColumn(columnPath: String): Column[java.lang.Long] with > SupportsLtGt = { > new Column[java.lang.Long] ( > ColumnPath.get(columnPath), classOf[java.lang.Long]) with SupportsLtGt > } > def floatColumn(columnPath: String): Column[java.lang.Float] with > SupportsLtGt = { > new Column[java.lang.Float] ( > ColumnPath.get(columnPath), classOf[java.lang.Float]) with SupportsLtGt > } > def doubleColumn(columnPath: String): Column[java.lang.Double] with > SupportsLtGt = { > new Column[java.lang.Double] ( > ColumnPath.get(columnPath), classOf[java.lang.Double]) with SupportsLtGt > } > def booleanColumn(columnPath: String): Column[java.lang.Boolean] with > SupportsEqNotEq = { > new Column[java.lang.Boolean] ( > ColumnPath.get(columnPath), classOf[java.lang.Boolean]) with > SupportsEqNotEq > } > def binaryColumn(columnPath: String): Column[Binary] with SupportsLtGt = { > new Column[Binary] (ColumnPath.get(columnPath), classOf[Binary]) with > SupportsLtGt > } > } > {code} > Both sound not the best. Please let me know if anyone has a better idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org