[jira] [Resolved] (SPARK-31615) Pretty string output for sql method of RuntimeReplaceable expressions
[ https://issues.apache.org/jira/browse/SPARK-31615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31615. -- Fix Version/s: 3.1.0 Assignee: Kent Yao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28420 > Pretty string output for sql method of RuntimeReplaceable expressions > - > > Key: SPARK-31615 > URL: https://issues.apache.org/jira/browse/SPARK-31615 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Minor > Fix For: 3.1.0 > > > The RuntimeReplaceable ones are runtime replaceable, thus, their original > parameters are not going to be resolved to PrettyAttribute and remain debug > style string if we directly implement their `sql` methods with their > parameters' `sql` methods. > e.g. `sql` method from `to_timestamp` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31631) Fix test flakiness caused by MiniKdc which throws "address in use" BindException
[ https://issues.apache.org/jira/browse/SPARK-31631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31631. -- Fix Version/s: 3.1.0 Assignee: Kent Yao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28442 > Fix test flakiness caused by MiniKdc which throws "address in use" > BindException > > > Key: SPARK-31631 > URL: https://issues.apache.org/jira/browse/SPARK-31631 > Project: Spark > Issue Type: Bug > Components: Tests >Affects Versions: 3.0.0, 3.1.0 >Reporter: Kent Yao >Assignee: Kent Yao >Priority: Major > Fix For: 3.1.0 > > > {code:java} > [info] org.apache.spark.sql.kafka010.KafkaDelegationTokenSuite *** ABORTED > *** (15 seconds, 426 milliseconds) > [info] java.net.BindException: Address already in use > [info] at sun.nio.ch.Net.bind0(Native Method) > [info] at sun.nio.ch.Net.bind(Net.java:433) > [info] at sun.nio.ch.Net.bind(Net.java:425) > [info] at > sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:223) > [info] at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74) > [info] at > org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:198) > [info] at > org.apache.mina.transport.socket.nio.NioSocketAcceptor.open(NioSocketAcceptor.java:51) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor.registerHandles(AbstractPollingIoAcceptor.java:547) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor.access$400(AbstractPollingIoAcceptor.java:68) > [info] at > org.apache.mina.core.polling.AbstractPollingIoAcceptor$Acceptor.run(AbstractPollingIoAcceptor.java:422) > [info] at > org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:64) > [info] at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [info] at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [info] at java.lang.Thread.run(Thread.java:748) > {code} > This is an issue fixed in hadoop 2.8.0 > https://issues.apache.org/jira/browse/HADOOP-12656 > We may need apply the approach from HBASE first before we drop Hadoop 2.7.x > https://issues.apache.org/jira/browse/HBASE-14734 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31656) AFT blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-31656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31656: Assignee: zhengruifeng > AFT blockify input vectors > -- > > Key: SPARK-31656 > URL: https://issues.apache.org/jira/browse/SPARK-31656 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31656) AFT blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-31656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31656: Assignee: Apache Spark > AFT blockify input vectors > -- > > Key: SPARK-31656 > URL: https://issues.apache.org/jira/browse/SPARK-31656 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31656) AFT blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-31656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101349#comment-17101349 ] Apache Spark commented on SPARK-31656: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28473 > AFT blockify input vectors > -- > > Key: SPARK-31656 > URL: https://issues.apache.org/jira/browse/SPARK-31656 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31656) AFT blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-31656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31656: Assignee: (was: Apache Spark) > AFT blockify input vectors > -- > > Key: SPARK-31656 > URL: https://issues.apache.org/jira/browse/SPARK-31656 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101348#comment-17101348 ] Apache Spark commented on SPARK-31655: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/28472 > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31655: Assignee: Apache Spark > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Assignee: Apache Spark >Priority: Minor > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101347#comment-17101347 ] Apache Spark commented on SPARK-31655: -- User 'AngersZh' has created a pull request for this issue: https://github.com/apache/spark/pull/28472 > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31655) Upgrade snappy to version 1.1.7.5
[ https://issues.apache.org/jira/browse/SPARK-31655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31655: Assignee: (was: Apache Spark) > Upgrade snappy to version 1.1.7.5 > - > > Key: SPARK-31655 > URL: https://issues.apache.org/jira/browse/SPARK-31655 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: angerszhu >Priority: Minor > > Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31656) AFT blockify input vectors
zhengruifeng created SPARK-31656: Summary: AFT blockify input vectors Key: SPARK-31656 URL: https://issues.apache.org/jira/browse/SPARK-31656 Project: Spark Issue Type: Sub-task Components: ML Affects Versions: 3.1.0 Reporter: zhengruifeng -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30660) LinearRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30660: Assignee: zhengruifeng (was: Apache Spark) > LinearRegression blockify input vectors > --- > > Key: SPARK-30660 > URL: https://issues.apache.org/jira/browse/SPARK-30660 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30660) LinearRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101340#comment-17101340 ] Apache Spark commented on SPARK-30660: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28471 > LinearRegression blockify input vectors > --- > > Key: SPARK-30660 > URL: https://issues.apache.org/jira/browse/SPARK-30660 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-30660) LinearRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101341#comment-17101341 ] Apache Spark commented on SPARK-30660: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/28471 > LinearRegression blockify input vectors > --- > > Key: SPARK-30660 > URL: https://issues.apache.org/jira/browse/SPARK-30660 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-30660) LinearRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-30660: Assignee: Apache Spark (was: zhengruifeng) > LinearRegression blockify input vectors > --- > > Key: SPARK-30660 > URL: https://issues.apache.org/jira/browse/SPARK-30660 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: Apache Spark >Priority: Minor > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-5195) when hive table is query with alias the cache data lose effectiveness.
[ https://issues.apache.org/jira/browse/SPARK-5195?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] yixiaohua closed SPARK-5195. > when hive table is query with alias the cache data lose effectiveness. > > > Key: SPARK-5195 > URL: https://issues.apache.org/jira/browse/SPARK-5195 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.2.0 >Reporter: yixiaohua >Assignee: yixiaohua >Priority: Major > Fix For: 1.3.0 > > > override the MetastoreRelation's sameresult method only compare databasename > and table name > because in previous : > cache table t1; > select count() from t1; > it will read data from memory but the sql below will not,instead it read from > hdfs: > select count() from t1 t; > because cache data is keyed by logical plan and compare with sameResult ,so > when table with alias the same table 's logicalplan is not the same logical > plan with out alias so modify the sameresult method only compare databasename > and table name -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31655) Upgrade snappy to version 1.1.7.5
angerszhu created SPARK-31655: - Summary: Upgrade snappy to version 1.1.7.5 Key: SPARK-31655 URL: https://issues.apache.org/jira/browse/SPARK-31655 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.0.0 Reporter: angerszhu Upgrade snappy to version 1.1.7.5 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101315#comment-17101315 ] Apache Spark commented on SPARK-29803: -- User 'tianshizz' has created a pull request for this issue: https://github.com/apache/spark/pull/28470 > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29803: Assignee: Apache Spark > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Apache Spark >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29803) remove all instances of 'from __future__ import print_function'
[ https://issues.apache.org/jira/browse/SPARK-29803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29803: Assignee: (was: Apache Spark) > remove all instances of 'from __future__ import print_function' > > > Key: SPARK-29803 > URL: https://issues.apache.org/jira/browse/SPARK-29803 > Project: Spark > Issue Type: Sub-task > Components: Build, PySpark, Tests >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > Attachments: print_function_list.txt > > > there are 135 python files in the spark repo that need to have `from > __future__ import print_function` removed (see attached file > 'print_function_list.txt'). > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29802: Assignee: (was: Apache Spark) > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101310#comment-17101310 ] Apache Spark commented on SPARK-29802: -- User 'tianshizz' has created a pull request for this issue: https://github.com/apache/spark/pull/28469 > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-29802) update remaining python scripts in repo to python3 shebang
[ https://issues.apache.org/jira/browse/SPARK-29802?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-29802: Assignee: Apache Spark > update remaining python scripts in repo to python3 shebang > -- > > Key: SPARK-29802 > URL: https://issues.apache.org/jira/browse/SPARK-29802 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.0.0 >Reporter: Shane Knapp >Assignee: Apache Spark >Priority: Major > > there are a bunch of scripts in the repo that need to have their shebang > updated to python3: > {noformat} > dev/create-release/releaseutils.py:#!/usr/bin/env python > dev/create-release/generate-contributors.py:#!/usr/bin/env python > dev/create-release/translate-contributors.py:#!/usr/bin/env python > dev/github_jira_sync.py:#!/usr/bin/env python > dev/merge_spark_pr.py:#!/usr/bin/env python > python/pyspark/version.py:#!/usr/bin/env python > python/pyspark/find_spark_home.py:#!/usr/bin/env python > python/setup.py:#!/usr/bin/env python{noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-30659) LogisticRegression blockify input vectors
[ https://issues.apache.org/jira/browse/SPARK-30659?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-30659. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 28458 [https://github.com/apache/spark/pull/28458] > LogisticRegression blockify input vectors > - > > Key: SPARK-30659 > URL: https://issues.apache.org/jira/browse/SPARK-30659 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.0.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Minor > Fix For: 3.1.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31212) Failure of casting the '1000-02-29' string to the date type
[ https://issues.apache.org/jira/browse/SPARK-31212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-31212. -- Resolution: Won't Fix Won't fix in Spark 2.4.x. See also https://github.com/apache/spark/pull/28445#issuecomment-624455200 > Failure of casting the '1000-02-29' string to the date type > --- > > Key: SPARK-31212 > URL: https://issues.apache.org/jira/browse/SPARK-31212 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.5 >Reporter: Maxim Gekk >Priority: Major > > The '1000-02-29' is valid date in the Julian calendar used in Spark 2.4.5 for > dates before 1582-10-15 but casting the string to the date type fails: > {code:scala} > scala> val df = > Seq("1000-02-29").toDF("dateS").select($"dateS".cast("date").as("date")) > df: org.apache.spark.sql.DataFrame = [date: date] > scala> df.show > ++ > |date| > ++ > |null| > ++ > {code} > Creating a dataset from java.sql.Date w/ the same input string works > correctly: > {code:scala} > scala> val df2 = > Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) > df2: org.apache.spark.sql.DataFrame = [date: date] > scala> df2.show > +--+ > | date| > +--+ > |1000-02-29| > +--+ > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31647) Deprecate 'spark.sql.optimizer.metadataOnly'
[ https://issues.apache.org/jira/browse/SPARK-31647?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Takeshi Yamamuro resolved SPARK-31647. -- Fix Version/s: 3.0.0 Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28459 > Deprecate 'spark.sql.optimizer.metadataOnly' > > > Key: SPARK-31647 > URL: https://issues.apache.org/jira/browse/SPARK-31647 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 3.0.0 > > > This optimization can cause a potential correctness issue, see also > SPARK-26709. > Also, it seems difficult to extend the optimization. Basically you should > whitelist all available functions. > Seems we should rather deprecate and remove this. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31654) sequence producing inconsistent intervals for month step
Roman Yalki created SPARK-31654: --- Summary: sequence producing inconsistent intervals for month step Key: SPARK-31654 URL: https://issues.apache.org/jira/browse/SPARK-31654 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.4.4 Reporter: Roman Yalki Taking an example from [https://spark.apache.org/docs/latest/api/sql/] {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2018-03-01'), interval 1 > month);{code} [2018-01-01,2018-02-01,2018-03-01] if one is to expand `stop` till the end of the year some intervals are returned as the last day of the month whereas fist day of the month is expected {code:java} > SELECT sequence(to_date('2018-01-01'), to_date('2019-01-01'), interval 1 > month){code} [2018-01-01, 2018-02-01, 2018-03-01, *2018-03-31, 2018-04-30, 2018-05-31, 2018-06-30, 2018-07-31, 2018-08-31, 2018-09-30, 2018-10-31*, 2018-12-01, 2019-01-01] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31365) Enable nested predicate pushdown per data sources
[ https://issues.apache.org/jira/browse/SPARK-31365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101237#comment-17101237 ] Apache Spark commented on SPARK-31365: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28468 > Enable nested predicate pushdown per data sources > - > > Key: SPARK-31365 > URL: https://issues.apache.org/jira/browse/SPARK-31365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: L. C. Hsieh >Priority: Blocker > Fix For: 3.0.0 > > > Currently, nested predicate pushdown is on or off for all data sources. We > should create configuration for each supported data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31365) Enable nested predicate pushdown per data sources
[ https://issues.apache.org/jira/browse/SPARK-31365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101235#comment-17101235 ] Apache Spark commented on SPARK-31365: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/28468 > Enable nested predicate pushdown per data sources > - > > Key: SPARK-31365 > URL: https://issues.apache.org/jira/browse/SPARK-31365 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: DB Tsai >Assignee: L. C. Hsieh >Priority: Blocker > Fix For: 3.0.0 > > > Currently, nested predicate pushdown is on or off for all data sources. We > should create configuration for each supported data source. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31653: Assignee: Holden Karau (was: Apache Spark) > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Blocker > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31653: Assignee: Apache Spark (was: Holden Karau) > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Apache Spark >Priority: Blocker > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101219#comment-17101219 ] Apache Spark commented on SPARK-31653: -- User 'holdenk' has created a pull request for this issue: https://github.com/apache/spark/pull/28467 > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Blocker > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau reassigned SPARK-31653: Assignee: Holden Karau > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Blocker > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau resolved SPARK-31653. -- Fix Version/s: 2.4.6 Resolution: Fixed > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Blocker > Fix For: 2.4.6 > > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31653) setuptools needs to be installed before anything else
[ https://issues.apache.org/jira/browse/SPARK-31653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Holden Karau updated SPARK-31653: - Priority: Blocker (was: Major) > setuptools needs to be installed before anything else > - > > Key: SPARK-31653 > URL: https://issues.apache.org/jira/browse/SPARK-31653 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.4.6 >Reporter: Holden Karau >Priority: Blocker > > One of the early packages we install as part of the release build in the > Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31653) setuptools needs to be installed before anything else
Holden Karau created SPARK-31653: Summary: setuptools needs to be installed before anything else Key: SPARK-31653 URL: https://issues.apache.org/jira/browse/SPARK-31653 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.4.6 Reporter: Holden Karau One of the early packages we install as part of the release build in the Dockerfile now requires setuptools to be pre-installed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101110#comment-17101110 ] Dongjoon Hyun commented on SPARK-29250: --- Thanks for the update. For now, we are not aiming to downgrade Hadoop 3.2.x to Hadoop 3.1.x. If we find a way how to handle Apache Hive 1.2 and Hive 2.3's guava dependency, it will be with Hadoop 3.2+. Since Apache Hive community doesn't have any plan for Guava update in `branch-2.3`, Apache Spark community needs another way to handle them. For Apache Hive 1.2, I want to drop it as soon as possible after Apache Spark 3.0.0 release. It will narrow down this issue to Apache Hive 2.3's guava dependency. > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-29250) Upgrade to Hadoop 3.2.1
[ https://issues.apache.org/jira/browse/SPARK-29250?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101110#comment-17101110 ] Dongjoon Hyun edited comment on SPARK-29250 at 5/6/20, 7:25 PM: Thanks for the update. For now, we are not aiming to downgrade from Hadoop 3.2.x to Hadoop 3.1.x. If we find a way how to handle Apache Hive 1.2 and Hive 2.3's guava dependency, it will be with Hadoop 3.2+. Since Apache Hive community doesn't have any plan for Guava update in `branch-2.3`, Apache Spark community needs another way to handle them. For Apache Hive 1.2, I want to drop it as soon as possible after Apache Spark 3.0.0 release. It will narrow down this issue to Apache Hive 2.3's guava dependency. was (Author: dongjoon): Thanks for the update. For now, we are not aiming to downgrade Hadoop 3.2.x to Hadoop 3.1.x. If we find a way how to handle Apache Hive 1.2 and Hive 2.3's guava dependency, it will be with Hadoop 3.2+. Since Apache Hive community doesn't have any plan for Guava update in `branch-2.3`, Apache Spark community needs another way to handle them. For Apache Hive 1.2, I want to drop it as soon as possible after Apache Spark 3.0.0 release. It will narrow down this issue to Apache Hive 2.3's guava dependency. > Upgrade to Hadoop 3.2.1 > --- > > Key: SPARK-29250 > URL: https://issues.apache.org/jira/browse/SPARK-29250 > Project: Spark > Issue Type: Improvement > Components: Build >Affects Versions: 3.1.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101089#comment-17101089 ] Apache Spark commented on SPARK-31361: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28466 > Rebase datetime in parquet/avro according to file metadata > -- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101088#comment-17101088 ] Apache Spark commented on SPARK-31361: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28466 > Rebase datetime in parquet/avro according to file metadata > -- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
[ https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100976#comment-17100976 ] Apache Spark commented on SPARK-5300: - User 'wetneb' has created a pull request for this issue: https://github.com/apache/spark/pull/28465 > Spark loads file partitions in inconsistent order on native filesystems > --- > > Key: SPARK-5300 > URL: https://issues.apache.org/jira/browse/SPARK-5300 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, EXT4, for example. >Reporter: Ewan Higgs >Priority: Major > > Discussed on user list in April 2014: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html > And on dev list January 2015: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html > When using a file system which isn't HDFS, file partitions ('part-0, > part-1', etc.) are not guaranteed to load in the same order. This means > previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5300) Spark loads file partitions in inconsistent order on native filesystems
[ https://issues.apache.org/jira/browse/SPARK-5300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100973#comment-17100973 ] Apache Spark commented on SPARK-5300: - User 'wetneb' has created a pull request for this issue: https://github.com/apache/spark/pull/28465 > Spark loads file partitions in inconsistent order on native filesystems > --- > > Key: SPARK-5300 > URL: https://issues.apache.org/jira/browse/SPARK-5300 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 1.1.0, 1.2.0 > Environment: Linux, EXT4, for example. >Reporter: Ewan Higgs >Priority: Major > > Discussed on user list in April 2014: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-reads-partitions-in-a-wrong-order-td4818.html > And on dev list January 2015: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-order-guarantees-td10142.html > When using a file system which isn't HDFS, file partitions ('part-0, > part-1', etc.) are not guaranteed to load in the same order. This means > previously sorted RDDs will be loaded out of order. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100948#comment-17100948 ] David Mavashev commented on SPARK-21770: Hi, I'm hitting the above issue, in which the whole job is failing because of a single row that gets a 0 vector probabilities: {code:java} class: SparkException, cause: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 10251.0 failed 1 times, most recent failure: Lost task 5.0 in stage 10251.0 (TID 128916, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: requirement failed: Can't normalize the 0-vector. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$.normalizeToProbabilitiesInPlace(ProbabilisticClassifier.scala:244) at org.apache.spark.ml.classification.DecisionTreeClassificationModel.raw2probabilityInPlace(DecisionTreeClassifier.scala:198) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.raw2probability(ProbabilisticClassifier.scala:172) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$2.apply(ProbabilisticClassifier.scala:124) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$2.apply(ProbabilisticClassifier.scala:124) ... 19 more {code} What should be the correct handling to make this work, this is randomly happening on models we generate with Random Forest Classifier. > ProbabilisticClassificationModel: Improve normalization of all-zero raw > predictions > --- > > Key: SPARK-21770 > URL: https://issues.apache.org/jira/browse/SPARK-21770 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.3.0 >Reporter: Siddharth Murching >Assignee: Weichen Xu >Priority: Minor > Fix For: 2.3.0 > > > Given an n-element raw prediction vector of all-zeros, > ProbabilisticClassifierModel.normalizeToProbabilitiesInPlace() should output > a probability vector of all-equal 1/n entries -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-21770) ProbabilisticClassificationModel: Improve normalization of all-zero raw predictions
[ https://issues.apache.org/jira/browse/SPARK-21770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100948#comment-17100948 ] David Mavashev edited comment on SPARK-21770 at 5/6/20, 4:24 PM: - Hi, Im using version 2.4.5, I'm hitting the above issue, in which the whole job is failing because of a single row that gets a 0 vector probabilities: {code:java} class: SparkException, cause: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 10251.0 failed 1 times, most recent failure: Lost task 5.0 in stage 10251.0 (TID 128916, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2101) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.IllegalArgumentException: requirement failed: Can't normalize the 0-vector. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$.normalizeToProbabilitiesInPlace(ProbabilisticClassifier.scala:244) at org.apache.spark.ml.classification.DecisionTreeClassificationModel.raw2probabilityInPlace(DecisionTreeClassifier.scala:198) at org.apache.spark.ml.classification.ProbabilisticClassificationModel.raw2probability(ProbabilisticClassifier.scala:172) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$2.apply(ProbabilisticClassifier.scala:124) at org.apache.spark.ml.classification.ProbabilisticClassificationModel$$anonfun$2.apply(ProbabilisticClassifier.scala:124) ... 19 more {code} What should be the correct handling to make this work, this is randomly happening on models we generate with Random Forest Classifier. was (Author: davidmav86): Hi, I'm hitting the above issue, in which the whole job is failing because of a single row that gets a 0 vector probabilities: {code:java} class: SparkException, cause: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) org.apache.spark.SparkException: Job aborted due to stage failure: Task 5 in stage 10251.0 failed 1 times, most recent failure: Lost task 5.0 in stage 10251.0 (TID 128916, localhost, executor driver): org.apache.spark.SparkException: Failed to execute user defined function($anonfun$2: (struct,values:array>) => struct,values:array>) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$27.apply(RDD.scala:972) at
[jira] [Assigned] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31652: Assignee: Apache Spark > Add ANOVASelector and FValueSelector to PySpark > --- > > Key: SPARK-31652 > URL: https://issues.apache.org/jira/browse/SPARK-31652 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Major > > Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31652: Assignee: (was: Apache Spark) > Add ANOVASelector and FValueSelector to PySpark > --- > > Key: SPARK-31652 > URL: https://issues.apache.org/jira/browse/SPARK-31652 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100940#comment-17100940 ] Apache Spark commented on SPARK-31652: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/28464 > Add ANOVASelector and FValueSelector to PySpark > --- > > Key: SPARK-31652 > URL: https://issues.apache.org/jira/browse/SPARK-31652 > Project: Spark > Issue Type: Sub-task > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Priority: Major > > Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31652) Add ANOVASelector and FValueSelector to PySpark
Huaxin Gao created SPARK-31652: -- Summary: Add ANOVASelector and FValueSelector to PySpark Key: SPARK-31652 URL: https://issues.apache.org/jira/browse/SPARK-31652 Project: Spark Issue Type: Sub-task Components: ML, PySpark Affects Versions: 3.1.0 Reporter: Huaxin Gao Add ANOVASelector and FValueSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors
[ https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100878#comment-17100878 ] Thunder Stumpges commented on SPARK-19371: -- Thank you for the comments [~honor], [~danmeany], and [~lebedev] ! I am glad to see there are others with this issue. We have had to "just live with it" for these years. And this job is STILL running in production, every 10 seconds, wasting computing resources and time due to imbalanced cached partitions across the executors. > Cannot spread cached partitions evenly across executors > --- > > Key: SPARK-19371 > URL: https://issues.apache.org/jira/browse/SPARK-19371 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Thunder Stumpges >Priority: Major > Labels: bulk-closed > Attachments: RDD Block Distribution on two executors.png, Unbalanced > RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and > resulting task imbalance.png, execution timeline.png > > > Before running an intensive iterative job (in this case a distributed topic > model training), we need to load a dataset and persist it across executors. > After loading from HDFS and persisting, the partitions are spread unevenly > across executors (based on the initial scheduling of the reads which are not > data locale sensitive). The partition sizes are even, just not their > distribution over executors. We currently have no way to force the partitions > to spread evenly, and as the iterative algorithm begins, tasks are > distributed to executors based on this initial load, forcing some very > unbalanced work. > This has been mentioned a > [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059] > of > [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html] > in > [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html] > user/dev group threads. > None of the discussions I could find had solutions that worked for me. Here > are examples of things I have tried. All resulted in partitions in memory > that were NOT evenly distributed to executors, causing future tasks to be > imbalanced across executors as well. > *Reduce Locality* > {code}spark.shuffle.reduceLocality.enabled=false/true{code} > *"Legacy" memory mode* > {code}spark.memory.useLegacyMode = true/false{code} > *Basic load and repartition* > {code} > val numPartitions = 48*16 > val df = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions). > persist > df.count > {code} > *Load and repartition to 2x partitions, then shuffle repartition down to > desired partitions* > {code} > val numPartitions = 48*16 > val df2 = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions*2) > val df = df2.repartition(numPartitions). > persist > df.count > {code} > It would be great if when persisting an RDD/DataFrame, if we could request > that those partitions be stored evenly across executors in preparation for > future tasks. > I'm not sure if this is a more general issue (I.E. not just involving > persisting RDDs), but for the persisted in-memory case, it can make a HUGE > difference in the over-all running time of the remaining work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31609) Add VarianceThresholdSelector to PySpark
[ https://issues.apache.org/jira/browse/SPARK-31609?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-31609. -- Fix Version/s: 3.1.0 Assignee: Huaxin Gao Resolution: Fixed Resolved by https://github.com/apache/spark/pull/28409 > Add VarianceThresholdSelector to PySpark > > > Key: SPARK-31609 > URL: https://issues.apache.org/jira/browse/SPARK-31609 > Project: Spark > Issue Type: New Feature > Components: ML, PySpark >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Minor > Fix For: 3.1.0 > > > Add VarianceThresholdSelector to PySpark -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100820#comment-17100820 ] angerszhu edited comment on SPARK-31648 at 5/6/20, 1:51 PM: [~Rajesh Tadi] Can you show your reproduce detail code process? thanks I can't reproduce this in 2.4.0 and master branch was (Author: angerszhuuu): [~Rajesh Tadi] Can you show your reproduce detail code process? I can't reproduce this in 2.4.0 and master branch > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100820#comment-17100820 ] angerszhu commented on SPARK-31648: --- [~Rajesh Tadi] Can you show your reproduce detail code process? I can't reproduce this in 2.4.0 and master branch > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20628) Keep track of nodes which are going to be shut down & avoid scheduling new tasks
[ https://issues.apache.org/jira/browse/SPARK-20628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100793#comment-17100793 ] wuyi commented on SPARK-20628: -- Hi [~holden] is this ticket resolved by [https://github.com/apache/spark/pull/26440]? > Keep track of nodes which are going to be shut down & avoid scheduling new > tasks > > > Key: SPARK-20628 > URL: https://issues.apache.org/jira/browse/SPARK-20628 > Project: Spark > Issue Type: Sub-task > Components: Spark Core >Affects Versions: 2.2.0, 2.3.0 >Reporter: Holden Karau >Assignee: Holden Karau >Priority: Major > Fix For: 3.1.0 > > > Keep track of nodes which are going to be shut down. We considered adding > this for YARN but took a different approach, for instances where we can't > control instance termination though (EC2, GCE, etc.) this may make more sense. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100789#comment-17100789 ] Rajesh Tadi commented on SPARK-31648: - [~angerszhuuu] I have tried creating a table using Spark-SQL and Dataframes in Scala as well. Both the ways I see the same issue. > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-31650. - Fix Version/s: 3.0.0 Resolution: Fixed Issue resolved by pull request 28460 [https://github.com/apache/spark/pull/28460] > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Fix For: 3.0.0 > > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-31650: --- Assignee: wuyi > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: wuyi >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100721#comment-17100721 ] Apache Spark commented on SPARK-31399: -- User 'rednaxelafx' has created a pull request for this issue: https://github.com/apache/spark/pull/28463 > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) > at
[jira] [Assigned] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31399: Assignee: Apache Spark (was: Kris Mok) > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at
[jira] [Assigned] (SPARK-31399) Closure cleaner broken in Scala 2.12
[ https://issues.apache.org/jira/browse/SPARK-31399?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31399: Assignee: Kris Mok (was: Apache Spark) > Closure cleaner broken in Scala 2.12 > > > Key: SPARK-31399 > URL: https://issues.apache.org/jira/browse/SPARK-31399 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.4.5, 3.0.0 >Reporter: Wenchen Fan >Assignee: Kris Mok >Priority: Blocker > > The `ClosureCleaner` only support Scala functions and it uses the following > check to catch closures > {code} > // Check whether a class represents a Scala closure > private def isClosure(cls: Class[_]): Boolean = { > cls.getName.contains("$anonfun$") > } > {code} > This doesn't work in 3.0 any more as we upgrade to Scala 2.12 and most Scala > functions become Java lambdas. > As an example, the following code works well in Spark 2.4 Spark Shell: > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > import org.apache.spark.sql.functions.lit > defined class Foo > col: org.apache.spark.sql.Column = 123 > df: org.apache.spark.rdd.RDD[Foo] = MapPartitionsRDD[5] at map at :20 > {code} > But fails in 3.0 > {code} > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:396) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:386) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159) > at org.apache.spark.SparkContext.clean(SparkContext.scala:2371) > at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:422) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) > at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) > at org.apache.spark.rdd.RDD.map(RDD.scala:421) > ... 39 elided > Caused by: java.io.NotSerializableException: org.apache.spark.sql.Column > Serialization stack: > - object not serializable (class: org.apache.spark.sql.Column, value: > 123) > - field (class: $iw, name: col, type: class org.apache.spark.sql.Column) > - object (class $iw, $iw@2d87ac2b) > - element of array (index: 0) > - array (class [Ljava.lang.Object;, size 1) > - field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, > type: class [Ljava.lang.Object;) > - object (class java.lang.invoke.SerializedLambda, > SerializedLambda[capturingClass=class $iw, > functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, > implementation=invokeStatic > $anonfun$df$1$adapted:(L$iw;Ljava/lang/Object;)LFoo;, > instantiatedMethodType=(Ljava/lang/Object;)LFoo;, numCaptured=1]) > - writeReplace data (class: java.lang.invoke.SerializedLambda) > - object (class $Lambda$2438/170049100, $Lambda$2438/170049100@d6b8c43) > at > org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41) > at > org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47) > at > org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101) > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:393) > ... 47 more > {code} > **Apache Spark 2.4.5 with Scala 2.12** > {code} > Welcome to > __ > / __/__ ___ _/ /__ > _\ \/ _ \/ _ `/ __/ '_/ >/___/ .__/\_,_/_/ /_/\_\ version 2.4.5 > /_/ > Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_242) > Type in expressions to have them evaluated. > Type :help for more information. > scala> :pa > // Entering paste mode (ctrl-D to finish) > import org.apache.spark.sql.functions.lit > case class Foo(id: String) > val col = lit("123") > val df = sc.range(0,10,1,1).map { _ => Foo("") } > // Exiting paste mode, now interpreting. > org.apache.spark.SparkException: Task not serializable > at > org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:403) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:393) > at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162) > at
[jira] [Commented] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100639#comment-17100639 ] angerszhu commented on SPARK-31648: --- [~Rajesh Tadi] Anyway to reproduce this bug? > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100629#comment-17100629 ] Rajesh Tadi edited comment on SPARK-31648 at 5/6/20, 9:52 AM: -- [~angerszhuuu] Below is the SQL I have used. select * from testdb.partbuck_test where country_cd='India'; My table structure will look similar as below. Schema: ||col_name||data_type||comment|| |ID|bigint|null| |NAME|string|null| |COUNTRY_CD|string|null| |# Partition Information| | | |# col_name|data_type|comment| |COUNTRY_CD|string|null| was (Author: rajesh tadi): [~angerszhuuu] Below is the SQL I have used. select * from testdb.partbuck_test where country_cd='India'; My table structure will look similar as below. Schema: ++--+--+ |col_name |data_type |comment| ++--+--+ |ID |bigint |null | |NAME |string |null | |. |... |null | |. |... |null | |. |... |null | |COUNTRY_CD |string |null | |# Partition Information | | | |# col_name |data_type |comment| |COUNTRY_CD |string |null | ++--+--+ > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31651) Improve handling the case where different barrier sync types in a single sync
[ https://issues.apache.org/jira/browse/SPARK-31651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31651: Assignee: Apache Spark > Improve handling the case where different barrier sync types in a single sync > - > > Key: SPARK-31651 > URL: https://issues.apache.org/jira/browse/SPARK-31651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > > Currently, we use cleanupBarrierStage when detecting different barrier sync > types in a single sync. This cause a problem that a new `ContextBarrierState` > can be created again if there's following requesters on the way. And those > corresponding tasks will fail because of killing instead of different barrier > sync types detected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31651) Improve handling the case where different barrier sync types in a single sync
[ https://issues.apache.org/jira/browse/SPARK-31651?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100635#comment-17100635 ] Apache Spark commented on SPARK-31651: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28462 > Improve handling the case where different barrier sync types in a single sync > - > > Key: SPARK-31651 > URL: https://issues.apache.org/jira/browse/SPARK-31651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Currently, we use cleanupBarrierStage when detecting different barrier sync > types in a single sync. This cause a problem that a new `ContextBarrierState` > can be created again if there's following requesters on the way. And those > corresponding tasks will fail because of killing instead of different barrier > sync types detected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31651) Improve handling the case where different barrier sync types in a single sync
[ https://issues.apache.org/jira/browse/SPARK-31651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31651: Assignee: (was: Apache Spark) > Improve handling the case where different barrier sync types in a single sync > - > > Key: SPARK-31651 > URL: https://issues.apache.org/jira/browse/SPARK-31651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Currently, we use cleanupBarrierStage when detecting different barrier sync > types in a single sync. This cause a problem that a new `ContextBarrierState` > can be created again if there's following requesters on the way. And those > corresponding tasks will fail because of killing instead of different barrier > sync types detected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100629#comment-17100629 ] Rajesh Tadi commented on SPARK-31648: - [~angerszhuuu] Below is the SQL I have used. select * from testdb.partbuck_test where country_cd='India'; My table structure will look similar as below. Schema: ++--+--+ |col_name |data_type |comment| ++--+--+ |ID |bigint |null | |NAME |string |null | |. |... |null | |. |... |null | |. |... |null | |COUNTRY_CD |string |null | |# Partition Information | | | |# col_name |data_type |comment| |COUNTRY_CD |string |null | ++--+--+ > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100620#comment-17100620 ] Vikas Reddy Aravabhumi edited comment on SPARK-31470 at 5/6/20, 9:42 AM: - [~yumwang] Could you please let us know the ETA for this fix? was (Author: vikasreddy): [~yumwang] Could you please let us know the ETA of this fix? > Introduce SORTED BY clause in CREATE TABLE statement > > > Key: SPARK-31470 > URL: https://issues.apache.org/jira/browse/SPARK-31470 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We usually sort on frequently filtered columns when writing data to improve > query performance. But there is no these info in the table information. > > {code:sql} > CREATE TABLE t(day INT, hour INT, year INT, month INT) > USING parquet > PARTITIONED BY (year, month) > SORTED BY (day, hour); > {code} > > Impala, Oracle and redshift support this clause: > https://issues.apache.org/jira/browse/IMPALA-4166 > https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B > https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yuming Wang updated SPARK-31470: Description: We usually sort on frequently filtered columns when writing data to improve query performance. But there is no these info in the table information. {code:sql} CREATE TABLE t(day INT, hour INT, year INT, month INT) USING parquet PARTITIONED BY (year, month) SORTED BY (day, hour); {code} Impala, Oracle and redshift support this clause: https://issues.apache.org/jira/browse/IMPALA-4166 https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html was: We usually sort on frequently filtered columns when writing data to improve query performance. But there is no these info in the table information. {code:sql} CREATE TABLE t(day INT, hour INT, year INT, month INT) USING parquet PARTITIONED BY (year, month) SORTED BY (day, hour); {code} Impala and redshift support this clause: https://issues.apache.org/jira/browse/IMPALA-4166 https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html > Introduce SORTED BY clause in CREATE TABLE statement > > > Key: SPARK-31470 > URL: https://issues.apache.org/jira/browse/SPARK-31470 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We usually sort on frequently filtered columns when writing data to improve > query performance. But there is no these info in the table information. > > {code:sql} > CREATE TABLE t(day INT, hour INT, year INT, month INT) > USING parquet > PARTITIONED BY (year, month) > SORTED BY (day, hour); > {code} > > Impala, Oracle and redshift support this clause: > https://issues.apache.org/jira/browse/IMPALA-4166 > https://docs.oracle.com/database/121/DWHSG/attcluster.htm#GUID-DAECFBC5-FD1A-45A5-8C2C-DC9884D0857B > https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31651) Improve handling the case where different barrier sync types in a single sync
[ https://issues.apache.org/jira/browse/SPARK-31651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31651: - Summary: Improve handling the case where different barrier sync types in a single sync (was: Improve handling for the case of different barrier sync types in a single sync) > Improve handling the case where different barrier sync types in a single sync > - > > Key: SPARK-31651 > URL: https://issues.apache.org/jira/browse/SPARK-31651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Currently, we use cleanupBarrierStage when detecting different barrier sync > types in a single sync. This cause a problem that a new `ContextBarrierState` > can be created again if there's following requesters on the way. And those > corresponding tasks will fail because of killing instead of different barrier > sync types detected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31651) Improve handling for the case of different barrier sync types in a single sync
[ https://issues.apache.org/jira/browse/SPARK-31651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31651: - Description: Currently, we use cleanupBarrierStage when detecting different barrier sync types in a single sync. This cause a problem that a new `ContextBarrierState` can be created again if there's following requesters on the way. And those corresponding tasks will fail because of killing instead of different barrier sync types detected. was: Currently, we use cleanupBarrierStage when detecting different barrier sync types in a single sync. This cause a problem that a new `ContextBarrierState` can be created again if there's following requesters on the way. And those corresponding tasks will fail because of killing instead of different barrier sync types deteced. > Improve handling for the case of different barrier sync types in a single sync > -- > > Key: SPARK-31651 > URL: https://issues.apache.org/jira/browse/SPARK-31651 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > > Currently, we use cleanupBarrierStage when detecting different barrier sync > types in a single sync. This cause a problem that a new `ContextBarrierState` > can be created again if there's following requesters on the way. And those > corresponding tasks will fail because of killing instead of different barrier > sync types detected. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31651) Improve handling for the case of different barrier sync types in a single sync
wuyi created SPARK-31651: Summary: Improve handling for the case of different barrier sync types in a single sync Key: SPARK-31651 URL: https://issues.apache.org/jira/browse/SPARK-31651 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.0.0 Reporter: wuyi Currently, we use cleanupBarrierStage when detecting different barrier sync types in a single sync. This cause a problem that a new `ContextBarrierState` can be created again if there's following requesters on the way. And those corresponding tasks will fail because of killing instead of different barrier sync types deteced. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31470) Introduce SORTED BY clause in CREATE TABLE statement
[ https://issues.apache.org/jira/browse/SPARK-31470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100620#comment-17100620 ] Vikas Reddy Aravabhumi commented on SPARK-31470: [~yumwang] Could you please let us know the ETA of this fix? > Introduce SORTED BY clause in CREATE TABLE statement > > > Key: SPARK-31470 > URL: https://issues.apache.org/jira/browse/SPARK-31470 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Yuming Wang >Priority: Major > > We usually sort on frequently filtered columns when writing data to improve > query performance. But there is no these info in the table information. > > {code:sql} > CREATE TABLE t(day INT, hour INT, year INT, month INT) > USING parquet > PARTITIONED BY (year, month) > SORTED BY (day, hour); > {code} > > Impala and redshift support this clause: > https://issues.apache.org/jira/browse/IMPALA-4166 > https://docs.aws.amazon.com/redshift/latest/dg/t_Sorting_data-compare-sort-styles.html -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31650: - Issue Type: Bug (was: Test) > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31361) Rebase datetime in parquet/avro according to file metadata
[ https://issues.apache.org/jira/browse/SPARK-31361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100600#comment-17100600 ] Apache Spark commented on SPARK-31361: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/28461 > Rebase datetime in parquet/avro according to file metadata > -- > > Key: SPARK-31361 > URL: https://issues.apache.org/jira/browse/SPARK-31361 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.0.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.0.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31127) Add abstract Selector
[ https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng reassigned SPARK-31127: Assignee: Huaxin Gao > Add abstract Selector > - > > Key: SPARK-31127 > URL: https://issues.apache.org/jira/browse/SPARK-31127 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > > Add abstract Selector. Put the common code between ChisqSelector and > FValueSelector to Selector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31127) Add abstract Selector
[ https://issues.apache.org/jira/browse/SPARK-31127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhengruifeng resolved SPARK-31127. -- Fix Version/s: 3.1.0 Resolution: Fixed Issue resolved by pull request 27978 [https://github.com/apache/spark/pull/27978] > Add abstract Selector > - > > Key: SPARK-31127 > URL: https://issues.apache.org/jira/browse/SPARK-31127 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 3.1.0 >Reporter: Huaxin Gao >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.1.0 > > > Add abstract Selector. Put the common code between ChisqSelector and > FValueSelector to Selector. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31650: Assignee: Apache Spark > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Assignee: Apache Spark >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100555#comment-17100555 ] Apache Spark commented on SPARK-31650: -- User 'Ngone51' has created a pull request for this issue: https://github.com/apache/spark/pull/28460 > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-31650: Assignee: (was: Apache Spark) > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100551#comment-17100551 ] angerszhu commented on SPARK-31648: --- [~Rajesh Tadi] Can you show your sql and table schema detail? > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-31460) spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently
[ https://issues.apache.org/jira/browse/SPARK-31460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gabor Somogyi resolved SPARK-31460. --- Resolution: Information Provided > spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently > -- > > Key: SPARK-31460 > URL: https://issues.apache.org/jira/browse/SPARK-31460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: vinay >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > In spark 2.4.4 , it provides a source "spark-sql-kafka-0-10_2.11". > > When I wanted to read from my kafka-0.10.2.11 cluster, it throws out an error > "*java.util.concurrent.TimeoutException: Cannot fetch record for offset x > in 1000 milliseconds*" frequently, and the job thus failed. > > I see this issue was seen before in 2.3 according to ticket 23829 and an > upgrade to spark 2.4 was supposed to solve this. > > {code:java} > compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', > version: '2.4.4'{code} > Here is the error stack. > {code:java} > org.apache.spark.SparkException: Writing job aborted. > > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389) > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788) > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788) > org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) > > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) > org.apache.spark.sql.Dataset.collect(Dataset.scala:2788) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:540) > > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) > > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > >
[jira] [Commented] (SPARK-31460) spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently
[ https://issues.apache.org/jira/browse/SPARK-31460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100549#comment-17100549 ] Gabor Somogyi commented on SPARK-31460: --- Please re-open if the suggestion didn't help. > spark-sql-kafka source in spark 2.4.4 causes reading stream failure frequently > -- > > Key: SPARK-31460 > URL: https://issues.apache.org/jira/browse/SPARK-31460 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 2.4.4 >Reporter: vinay >Priority: Major > Original Estimate: 24h > Remaining Estimate: 24h > > In spark 2.4.4 , it provides a source "spark-sql-kafka-0-10_2.11". > > When I wanted to read from my kafka-0.10.2.11 cluster, it throws out an error > "*java.util.concurrent.TimeoutException: Cannot fetch record for offset x > in 1000 milliseconds*" frequently, and the job thus failed. > > I see this issue was seen before in 2.3 according to ticket 23829 and an > upgrade to spark 2.4 was supposed to solve this. > > {code:java} > compile group: 'org.apache.spark', name: 'spark-sql-kafka-0-10_2.11', > version: '2.4.4'{code} > Here is the error stack. > {code:java} > org.apache.spark.SparkException: Writing job aborted. > > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2Exec.scala:92) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247) > org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296) > > org.apache.spark.sql.Dataset.org$apache$spark$sql$Dataset$$collectFromPlan(Dataset.scala:3389) > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788) > org.apache.spark.sql.Dataset$$anonfun$collect$1.apply(Dataset.scala:2788) > org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370) > > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369) > org.apache.spark.sql.Dataset.collect(Dataset.scala:2788) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5$$anonfun$apply$17.apply(MicroBatchExecution.scala:540) > > org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78) > > org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125) > > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch$5.apply(MicroBatchExecution.scala:535) > > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > org.apache.spark.sql.execution.streaming.MicroBatchExecution.org$apache$spark$sql$execution$streaming$MicroBatchExecution$$runBatch(MicroBatchExecution.scala:534) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply$mcV$sp(MicroBatchExecution.scala:198) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1$$anonfun$apply$mcZ$sp$1.apply(MicroBatchExecution.scala:166) > > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:351) > > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > org.apache.spark.sql.execution.streaming.MicroBatchExecution$$anonfun$runActivatedStream$1.apply$mcZ$sp(MicroBatchExecution.scala:166) > > org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:56) > >
[jira] [Created] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
wuyi created SPARK-31650: Summary: SQL UI doesn't show metrics and whole stage codegen in AQE Key: SPARK-31650 URL: https://issues.apache.org/jira/browse/SPARK-31650 Project: Spark Issue Type: Test Components: SQL Affects Versions: 3.0.0 Reporter: wuyi Attachments: before_aqe_ui.png When enabling AQE with subqueris within a query, the SQL UI may doesn't show metrics and whole stage codegen. Here's a reproduce demo: {code:java} spark.range(1).toDF("value").write.parquet("/tmp/p1") spark.range(1).toDF("value").write.parquet("/tmp/p2") spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") spark.sql("select * from t1 where value=(select Max(value) from t2)").show() {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31650) SQL UI doesn't show metrics and whole stage codegen in AQE
[ https://issues.apache.org/jira/browse/SPARK-31650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi updated SPARK-31650: - Attachment: before_aqe_ui.png > SQL UI doesn't show metrics and whole stage codegen in AQE > -- > > Key: SPARK-31650 > URL: https://issues.apache.org/jira/browse/SPARK-31650 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 3.0.0 >Reporter: wuyi >Priority: Major > Attachments: before_aqe_ui.png > > > When enabling AQE with subqueris within a query, the SQL UI may doesn't show > metrics and whole stage codegen. > Here's a reproduce demo: > > {code:java} > spark.range(1).toDF("value").write.parquet("/tmp/p1") > spark.range(1).toDF("value").write.parquet("/tmp/p2") > spark.read.parquet("/tmp/p1").createOrReplaceTempView("t1") > spark.read.parquet("/tmp/p2").createOrReplaceTempView("t2") > spark.sql("select * from t1 where value=(select Max(value) from t2)").show() > {code} > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31649) Spread partitions evenly to spark executors
serdar onur created SPARK-31649: --- Summary: Spread partitions evenly to spark executors Key: SPARK-31649 URL: https://issues.apache.org/jira/browse/SPARK-31649 Project: Spark Issue Type: New Feature Components: Spark Core Affects Versions: 2.4.4 Reporter: serdar onur The year is 2020 and I am still trying to find a solution to this. I totally understand what [~thunderstumpges] was trying to achieve and I am trying to achieve the same. For a tool like spark, it is unacceptable not to be able to distribute the created partitions to the executors evenly. You know, we can create a custom partitioner to distribute the data to the partitions evenly by creating our own partition index. I was under the impression that a similar approach could be applied to spread these partitions to the executors evenly(using some sort of executor index for selection of executors during partition distribution). I have been googling this for a day now and I am very disappointed to say that up to now this seems to be not possible. Note: I am disappointed that the issue below was put into resolved state without actually doing anything about it. https://issues.apache.org/jira/browse/SPARK-19371 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19371) Cannot spread cached partitions evenly across executors
[ https://issues.apache.org/jira/browse/SPARK-19371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17100517#comment-17100517 ] serdar onur commented on SPARK-19371: - The year is 2020 and I am still trying to find a solution to this. I totally understand what [~thunderstumpges] was trying to achieve and I am trying to achieve the same. For a tool like spark, it is unacceptable not to be able to distribute the created partitions to the executors evenly. You know, we can create a custom partitioner to distribute the data to the partitions evenly by creating our own partition index. I was under the impression that a similar approach could be applied to spread these partitions to the executors evenly(using some sort of executor index for selection of executors during partition distribution). I have been googling this for a day now and I am very disappointed to say that up to now this seems to be not possible. > Cannot spread cached partitions evenly across executors > --- > > Key: SPARK-19371 > URL: https://issues.apache.org/jira/browse/SPARK-19371 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.1 >Reporter: Thunder Stumpges >Priority: Major > Labels: bulk-closed > Attachments: RDD Block Distribution on two executors.png, Unbalanced > RDD Blocks, and resulting task imbalance.png, Unbalanced RDD Blocks, and > resulting task imbalance.png, execution timeline.png > > > Before running an intensive iterative job (in this case a distributed topic > model training), we need to load a dataset and persist it across executors. > After loading from HDFS and persisting, the partitions are spread unevenly > across executors (based on the initial scheduling of the reads which are not > data locale sensitive). The partition sizes are even, just not their > distribution over executors. We currently have no way to force the partitions > to spread evenly, and as the iterative algorithm begins, tasks are > distributed to executors based on this initial load, forcing some very > unbalanced work. > This has been mentioned a > [number|http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Partitions-not-distributed-evenly-to-executors-tt16988.html#a17059] > of > [times|http://apache-spark-user-list.1001560.n3.nabble.com/Spark-work-distribution-among-execs-tt26502.html] > in > [various|http://apache-spark-user-list.1001560.n3.nabble.com/Partitions-are-get-placed-on-the-single-node-tt26597.html] > user/dev group threads. > None of the discussions I could find had solutions that worked for me. Here > are examples of things I have tried. All resulted in partitions in memory > that were NOT evenly distributed to executors, causing future tasks to be > imbalanced across executors as well. > *Reduce Locality* > {code}spark.shuffle.reduceLocality.enabled=false/true{code} > *"Legacy" memory mode* > {code}spark.memory.useLegacyMode = true/false{code} > *Basic load and repartition* > {code} > val numPartitions = 48*16 > val df = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions). > persist > df.count > {code} > *Load and repartition to 2x partitions, then shuffle repartition down to > desired partitions* > {code} > val numPartitions = 48*16 > val df2 = sqlContext.read. > parquet("/data/folder_to_load"). > repartition(numPartitions*2) > val df = df2.repartition(numPartitions). > persist > df.count > {code} > It would be great if when persisting an RDD/DataFrame, if we could request > that those partitions be stored evenly across executors in preparation for > future tasks. > I'm not sure if this is a more general issue (I.E. not just involving > persisting RDDs), but for the persisted in-memory case, it can make a HUGE > difference in the over-all running time of the remaining work. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Description: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string was: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Description: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string was: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string [^Spark Bug.txt] > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string Issue
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Description: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string [^Spark Bug.txt] was: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string Summary: Filtering is supported only on partition keys of type string Issue (was: Filtering is supported only on partition keys of type string) > Filtering is supported only on partition keys of type string Issue > -- > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > > [^Spark Bug.txt] -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Description: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but I still see the same issue. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string was: When I submit a SQL with partition filter I see the below error. I tried setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to false but no luck. java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. java.lang.reflect.InvocationTargetException: org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported only on partition keys of type string > Filtering is supported only on partition keys of type string > > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but I still see the same issue. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Attachment: Spark Bug.txt > Filtering is supported only on partition keys of type string > > > Key: SPARK-31648 > URL: https://issues.apache.org/jira/browse/SPARK-31648 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.4 >Reporter: Rajesh Tadi >Priority: Major > Attachments: Spark Bug.txt > > > When I submit a SQL with partition filter I see the below error. I tried > setting Spark Configuration spark.sql.hive.manageFilesourcePartitions to > false but no luck. > java.lang.RuntimeException: Caught Hive MetaException attempting to get > partition metadata by filter from Hive. > java.lang.reflect.InvocationTargetException: > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string > org.apache.hadoop.hive.metastore.api.MetaException: Filtering is supported > only on partition keys of type string -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Docs Text: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:775) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at
[jira] [Updated] (SPARK-31648) Filtering is supported only on partition keys of type string
[ https://issues.apache.org/jira/browse/SPARK-31648?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rajesh Tadi updated SPARK-31648: Docs Text: (was: java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:775) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:679) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$getPartitionsByFilter$1.apply(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:275) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:213) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:212) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:258) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:677) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1221) at org.apache.spark.sql.hive.HiveExternalCatalog$$anonfun$listPartitionsByFilter$1.apply(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:97) at org.apache.spark.sql.hive.HiveExternalCatalog.listPartitionsByFilter(HiveExternalCatalog.scala:1214) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.listPartitionsByFilter(ExternalCatalogWithListener.scala:254) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.listPartitionsByFilter(SessionCatalog.scala:962) at org.apache.spark.sql.execution.datasources.CatalogFileIndex.filterPartitions(CatalogFileIndex.scala:73) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:63) at org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions$$anonfun$apply$1.applyOrElse(PruneFileSourcePartitions.scala:27) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$2.apply(TreeNode.scala:259) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:258) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$class.transformDown(AnalysisHelper.scala:149) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformDown$1.apply(TreeNode.scala:264) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:329) at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:187) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:264) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDown(LogicalPlan.scala:29) at