[jira] [Created] (SPARK-42535) The HA support of Spark Thrift Server
WangHL created SPARK-42535: -- Summary: The HA support of Spark Thrift Server Key: SPARK-42535 URL: https://issues.apache.org/jira/browse/SPARK-42535 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.2 Reporter: WangHL When there are many Spark SQL connects on Spark Thrift Server.if the Thrift Server is down ,the connect cannot get service .So we need to consider the High Availability support for Spark Thrift Server . We want to import the pattern of HiveServer HA to provide ThriftServer HA. we need to get HA support as the 'addServerInstanceToZooKeeper' method of HiveServer2. Use zookeeper to choose the active thrift server when you connect the Spark Thrift Server. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42286) Fix internal error for valid CASE WHEN expression with CAST when inserting into a table
[ https://issues.apache.org/jira/browse/SPARK-42286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692512#comment-17692512 ] Apache Spark commented on SPARK-42286: -- User 'RunyaoChen' has created a pull request for this issue: https://github.com/apache/spark/pull/40140 > Fix internal error for valid CASE WHEN expression with CAST when inserting > into a table > --- > > Key: SPARK-42286 > URL: https://issues.apache.org/jira/browse/SPARK-42286 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Runyao.Chen >Assignee: Runyao.Chen >Priority: Major > Fix For: 3.4.0 > > > ``` > spark-sql> create or replace table es570639t1 as select x FROM values (1), > (2), (3) as tab(x); > spark-sql> create or replace table es570639t2 (x Decimal(9, 0)); > spark-sql> insert into es570639t2 select 0 - (case when x = 1 then 1 else x > end) from es570639t1 where x = 1; > ``` > hits the following internal error > org.apache.spark.SparkException: [INTERNAL_ERROR] Child is not Cast or > ExpressionProxy of Cast > > Stack trace: > org.apache.spark.SparkException: [INTERNAL_ERROR] Child is not Cast or > ExpressionProxy of Cast at > org.apache.spark.SparkException$.internalError(SparkException.scala:78) at > org.apache.spark.SparkException$.internalError(SparkException.scala:82) at > org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.checkChild(Cast.scala:2693) > at > org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2697) > at > org.apache.spark.sql.catalyst.expressions.CheckOverflowInTableInsert.withNewChildInternal(Cast.scala:2683) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.$anonfun$mapChildren$5(TreeNode.scala:1315) > at > org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:106) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren(TreeNode.scala:1314) > at > org.apache.spark.sql.catalyst.trees.UnaryLike.mapChildren$(TreeNode.scala:1309) > at > org.apache.spark.sql.catalyst.expressions.UnaryExpression.mapChildren(Expression.scala:636) > at > org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:570) > at > org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:570) > > This internal error comes from `CheckOverflowInTableInsert``checkChild`, > where we covered only `Cast` expr and `ExpressionProxy` expr, but not the > `CaseWhen` expr. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-39859) Support v2 `DESCRIBE TABLE EXTENDED` for columns
[ https://issues.apache.org/jira/browse/SPARK-39859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692503#comment-17692503 ] Apache Spark commented on SPARK-39859: -- User 'huaxingao' has created a pull request for this issue: https://github.com/apache/spark/pull/40139 > Support v2 `DESCRIBE TABLE EXTENDED` for columns > > > Key: SPARK-39859 > URL: https://issues.apache.org/jira/browse/SPARK-39859 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.4.0 >Reporter: Max Gekk >Assignee: Huaxin Gao >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42484) Better logging for UnsafeRowUtils
[ https://issues.apache.org/jira/browse/SPARK-42484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-42484. - Fix Version/s: 3.5.0 Resolution: Fixed > Better logging for UnsafeRowUtils > - > > Key: SPARK-42484 > URL: https://issues.apache.org/jira/browse/SPARK-42484 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.3 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > Fix For: 3.5.0 > > > Current `UnsafeRowUtils.validateStructuralIntegrity` only returns a boolean, > making it hard to track exactly where the problem is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42484) Better logging for UnsafeRowUtils
[ https://issues.apache.org/jira/browse/SPARK-42484?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-42484: --- Assignee: Wei Liu > Better logging for UnsafeRowUtils > - > > Key: SPARK-42484 > URL: https://issues.apache.org/jira/browse/SPARK-42484 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.3 >Reporter: Wei Liu >Assignee: Wei Liu >Priority: Major > > Current `UnsafeRowUtils.validateStructuralIntegrity` only returns a boolean, > making it hard to track exactly where the problem is. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42049) Improve AliasAwareOutputExpression
[ https://issues.apache.org/jira/browse/SPARK-42049?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692481#comment-17692481 ] Apache Spark commented on SPARK-42049: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/40137 > Improve AliasAwareOutputExpression > -- > > Key: SPARK-42049 > URL: https://issues.apache.org/jira/browse/SPARK-42049 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.4.0 >Reporter: XiDuo You >Assignee: Peter Toth >Priority: Major > Fix For: 3.4.0 > > > AliasAwareOutputExpression now does not support if an attribute has more than > one alias. > AliasAwareOutputExpression should also work for LogicalPlan. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41793: Assignee: (was: Apache Spark) > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692480#comment-17692480 ] Apache Spark commented on SPARK-41793: -- User 'ulysses-you' has created a pull request for this issue: https://github.com/apache/spark/pull/40138 > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-41793: Assignee: Apache Spark > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Assignee: Apache Spark >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42529) Support Cube and Rollup
[ https://issues.apache.org/jira/browse/SPARK-42529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42529. --- Fix Version/s: 3.4.1 Resolution: Fixed > Support Cube and Rollup > --- > > Key: SPARK-42529 > URL: https://issues.apache.org/jira/browse/SPARK-42529 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.1 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42515) ClientE2ETestSuite local test failed
[ https://issues.apache.org/jira/browse/SPARK-42515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692469#comment-17692469 ] Apache Spark commented on SPARK-42515: -- User 'LuciferYang' has created a pull request for this issue: https://github.com/apache/spark/pull/40136 > ClientE2ETestSuite local test failed > > > Key: SPARK-42515 > URL: https://issues.apache.org/jira/browse/SPARK-42515 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Minor > > > local run `build/sbt clean "connect-client-jvm/test"`, > `ClientE2ETestSuite#write table` failed, GA not failed. > > {code:java} > [info] - rite table *** FAILED *** (41 milliseconds) > [info] io.grpc.StatusRuntimeException: UNKNOWN: > org/apache/parquet/hadoop/api/ReadSupport > [info] at io.grpc.Status.asRuntimeException(Status.java:535) > [info] at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > [info] at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) > [info] at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) > [info] at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:145) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) > [i
[jira] [Assigned] (SPARK-42515) ClientE2ETestSuite local test failed
[ https://issues.apache.org/jira/browse/SPARK-42515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42515: Assignee: (was: Apache Spark) > ClientE2ETestSuite local test failed > > > Key: SPARK-42515 > URL: https://issues.apache.org/jira/browse/SPARK-42515 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Priority: Minor > > > local run `build/sbt clean "connect-client-jvm/test"`, > `ClientE2ETestSuite#write table` failed, GA not failed. > > {code:java} > [info] - rite table *** FAILED *** (41 milliseconds) > [info] io.grpc.StatusRuntimeException: UNKNOWN: > org/apache/parquet/hadoop/api/ReadSupport > [info] at io.grpc.Status.asRuntimeException(Status.java:535) > [info] at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > [info] at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) > [info] at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) > [info] at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:145) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) > [info] at sbt.ForkMain$Run.l
[jira] [Assigned] (SPARK-42515) ClientE2ETestSuite local test failed
[ https://issues.apache.org/jira/browse/SPARK-42515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42515: Assignee: Apache Spark > ClientE2ETestSuite local test failed > > > Key: SPARK-42515 > URL: https://issues.apache.org/jira/browse/SPARK-42515 > Project: Spark > Issue Type: Bug > Components: Connect >Affects Versions: 3.4.0, 3.5.0 >Reporter: Yang Jie >Assignee: Apache Spark >Priority: Minor > > > local run `build/sbt clean "connect-client-jvm/test"`, > `ClientE2ETestSuite#write table` failed, GA not failed. > > {code:java} > [info] - rite table *** FAILED *** (41 milliseconds) > [info] io.grpc.StatusRuntimeException: UNKNOWN: > org/apache/parquet/hadoop/api/ReadSupport > [info] at io.grpc.Status.asRuntimeException(Status.java:535) > [info] at > io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:660) > [info] at > scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45) > [info] at scala.collection.Iterator.foreach(Iterator.scala:943) > [info] at scala.collection.Iterator.foreach$(Iterator.scala:943) > [info] at scala.collection.AbstractIterator.foreach(Iterator.scala:1431) > [info] at org.apache.spark.sql.SparkSession.execute(SparkSession.scala:169) > [info] at > org.apache.spark.sql.DataFrameWriter.executeWriteOperation(DataFrameWriter.scala:255) > [info] at > org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:338) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.$anonfun$new$12(ClientE2ETestSuite.scala:145) > [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) > [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) > [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) > [info] at org.scalatest.Transformer.apply(Transformer.scala:22) > [info] at org.scalatest.Transformer.apply(Transformer.scala:20) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) > [info] at org.scalatest.TestSuite.withFixture(TestSuite.scala:196) > [info] at org.scalatest.TestSuite.withFixture$(TestSuite.scala:195) > [info] at > org.scalatest.funsuite.AnyFunSuite.withFixture(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) > [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) > [info] at org.scalatest.funsuite.AnyFunSuite.runTest(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) > [info] at scala.collection.immutable.List.foreach(List.scala:431) > [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) > [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) > [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) > [info] at > org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1564) > [info] at org.scalatest.Suite.run(Suite.scala:1114) > [info] at org.scalatest.Suite.run$(Suite.scala:1096) > [info] at > org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1564) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) > [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) > [info] at > org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.org$scalatest$BeforeAndAfterAll$$super$run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) > [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) > [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) > [info] at > org.apache.spark.sql.ClientE2ETestSuite.run(ClientE2ETestSuite.scala:33) > [info] at > org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:321) > [info] at > org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:517) > [info
[jira] [Assigned] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42444: Assignee: (was: Apache Spark) > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692467#comment-17692467 ] Apache Spark commented on SPARK-42444: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40135 > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692466#comment-17692466 ] Apache Spark commented on SPARK-42444: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/40135 > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42444: Assignee: Apache Spark > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Assignee: Apache Spark >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692451#comment-17692451 ] Wenchen Fan commented on SPARK-41793: - When we are doing window operator computing per partition, it's local decimal calculations and we can temporarily go beyond the decimal precision limitation, because `Decimal` is backed by `java.math.BigDecimal`. We should only check overflow before writing out decimal values. There is an expression `DecimalAddNoOverflowCheck` and we should use it in the window operator. [~ulysses] can you help to fix it? This is the same idea we use in Sum/Average. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692442#comment-17692442 ] Ruifeng Zheng commented on SPARK-42444: --- I am going to fix this one > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42534) Fix DB2 Limit clause
[ https://issues.apache.org/jira/browse/SPARK-42534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692440#comment-17692440 ] Apache Spark commented on SPARK-42534: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/40134 > Fix DB2 Limit clause > > > Key: SPARK-42534 > URL: https://issues.apache.org/jira/browse/SPARK-42534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42534) Fix DB2 Limit clause
[ https://issues.apache.org/jira/browse/SPARK-42534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42534: Assignee: (was: Apache Spark) > Fix DB2 Limit clause > > > Key: SPARK-42534 > URL: https://issues.apache.org/jira/browse/SPARK-42534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42534) Fix DB2 Limit clause
[ https://issues.apache.org/jira/browse/SPARK-42534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692439#comment-17692439 ] Apache Spark commented on SPARK-42534: -- User 'sadikovi' has created a pull request for this issue: https://github.com/apache/spark/pull/40134 > Fix DB2 Limit clause > > > Key: SPARK-42534 > URL: https://issues.apache.org/jira/browse/SPARK-42534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42534) Fix DB2 Limit clause
[ https://issues.apache.org/jira/browse/SPARK-42534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42534: Assignee: Apache Spark > Fix DB2 Limit clause > > > Key: SPARK-42534 > URL: https://issues.apache.org/jira/browse/SPARK-42534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42534) Fix DB2 Limit clause
[ https://issues.apache.org/jira/browse/SPARK-42534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692438#comment-17692438 ] Ivan Sadikov commented on SPARK-42534: -- I am going to open a PR to fix this. > Fix DB2 Limit clause > > > Key: SPARK-42534 > URL: https://issues.apache.org/jira/browse/SPARK-42534 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Ivan Sadikov >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42534) Fix DB2 Limit clause
Ivan Sadikov created SPARK-42534: Summary: Fix DB2 Limit clause Key: SPARK-42534 URL: https://issues.apache.org/jira/browse/SPARK-42534 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Ivan Sadikov -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42530. --- Fix Version/s: 3.4.0 Resolution: Fixed Issue resolved by pull request 40127 [https://github.com/apache/spark/pull/40127] > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42533) SSL support for Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692420#comment-17692420 ] Apache Spark commented on SPARK-42533: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40133 > SSL support for Scala Client > > > Key: SPARK-42533 > URL: https://issues.apache.org/jira/browse/SPARK-42533 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Add the basic encryption support for scala client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42533) SSL support for Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42533: Assignee: (was: Apache Spark) > SSL support for Scala Client > > > Key: SPARK-42533 > URL: https://issues.apache.org/jira/browse/SPARK-42533 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Add the basic encryption support for scala client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42533) SSL support for Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42533: Assignee: Apache Spark > SSL support for Scala Client > > > Key: SPARK-42533 > URL: https://issues.apache.org/jira/browse/SPARK-42533 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Apache Spark >Priority: Major > > Add the basic encryption support for scala client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42533) SSL support for Scala Client
[ https://issues.apache.org/jira/browse/SPARK-42533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692417#comment-17692417 ] Apache Spark commented on SPARK-42533: -- User 'zhenlineo' has created a pull request for this issue: https://github.com/apache/spark/pull/40133 > SSL support for Scala Client > > > Key: SPARK-42533 > URL: https://issues.apache.org/jira/browse/SPARK-42533 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Priority: Major > > Add the basic encryption support for scala client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42533) SSL support for Scala Client
Zhen Li created SPARK-42533: --- Summary: SSL support for Scala Client Key: SPARK-42533 URL: https://issues.apache.org/jira/browse/SPARK-42533 Project: Spark Issue Type: Improvement Components: Connect Affects Versions: 3.4.0 Reporter: Zhen Li Add the basic encryption support for scala client. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-42518) Scala client Write API V2
[ https://issues.apache.org/jira/browse/SPARK-42518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zhen Li closed SPARK-42518. --- > Scala client Write API V2 > - > > Key: SPARK-42518 > URL: https://issues.apache.org/jira/browse/SPARK-42518 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Impl the Dataset#writeTo method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen deleted SPARK-40378: - > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug >Reporter: Nikhil Sharma >Priority: Major > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40378: - > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-35563) [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows
[ https://issues.apache.org/jira/browse/SPARK-35563 ] Sean R. Owen deleted comment on SPARK-35563: -- was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. [Rails Course|https://www.igmguru.com/digital-marketing-programming/ruby-on-rails-certification-training/] > [SQL] Window operations with over Int.MaxValue + 1 rows can silently drop rows > -- > > Key: SPARK-35563 > URL: https://issues.apache.org/jira/browse/SPARK-35563 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.2 >Reporter: Robert Joseph Evans >Priority: Major > Labels: data-loss > > I think this impacts a lot more versions of Spark, but I don't know for sure > because it takes a long time to test. As a part of doing corner case > validation testing for spark rapids I found that if a window function has > more than {{Int.MaxValue + 1}} rows the result is silently truncated to that > many rows. I have only tested this on 3.0.2 with {{row_number}}, but I > suspect it will impact others as well. This is a really rare corner case, but > because it is silent data corruption I personally think it is quite serious. > {code:scala} > import org.apache.spark.sql.expressions.Window > val windowSpec = Window.partitionBy("a").orderBy("b") > val df = spark.range(Int.MaxValue.toLong + 100).selectExpr(s"1 as a", "id as > b") > spark.time(df.select(col("a"), col("b"), > row_number().over(windowSpec).alias("rn")).orderBy(desc("a"), > desc("b")).select((col("rn") < 0).alias("dir")).groupBy("dir").count.show(20)) > +-+--+ > > | dir| count| > +-+--+ > |false|2147483647| > | true| 1| > +-+--+ > Time taken: 1139089 ms > Int.MaxValue.toLong + 100 > res15: Long = 2147483747 > 2147483647L + 1 > res16: Long = 2147483648 > {code} > I had to make sure that I ran the above with at least 64GiB of heap for the > executor (I did it in local mode and it worked, but took forever to run) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40378) What is React Native.
[ https://issues.apache.org/jira/browse/SPARK-40378?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40378: - > What is React Native. > - > > Key: SPARK-40378 > URL: https://issues.apache.org/jira/browse/SPARK-40378 > Project: Spark > Issue Type: Bug > Components: Java API >Affects Versions: 3.0.3 >Reporter: Nikhil Sharma >Priority: Major > > React Native is an open-source framework for building mobile apps. It was > created by Facebook and is designed for cross-platform capability. It can be > tough to choose between an excellent user experience, a beautiful user > interface, and fast processing, but [React Native online > course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] > makes that decision an easy one with powerful native development. Jordan > Walke found a way to generate UI elements from a javascript thread and > applied it to iOS to build the first native application. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40819) Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type instead of automatically converting to LongType
[ https://issues.apache.org/jira/browse/SPARK-40819 ] Sean R. Owen deleted comment on SPARK-40819: -- was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. [https://www.igmguru.com/digital-marketing-programming/react-native-training/] > Parquet INT64 (TIMESTAMP(NANOS,true)) now throwing Illegal Parquet type > instead of automatically converting to LongType > > > Key: SPARK-40819 > URL: https://issues.apache.org/jira/browse/SPARK-40819 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.3.2, 3.4.0 >Reporter: Alfred Davidson >Assignee: Alfred Davidson >Priority: Critical > Labels: regression > Fix For: 3.2.4, 3.3.2, 3.4.0 > > > Since 3.2 parquet files containing attributes with type "INT64 > (TIMESTAMP(NANOS, true))" are no longer readable and attempting to read > throws: > > {code:java} > Caused by: org.apache.spark.sql.AnalysisException: Illegal Parquet type: > INT64 (TIMESTAMP(NANOS,true)) > at > org.apache.spark.sql.errors.QueryCompilationErrors$.illegalParquetTypeError(QueryCompilationErrors.scala:1284) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.illegalType$1(ParquetSchemaConverter.scala:105) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:174) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertField(ParquetSchemaConverter.scala:90) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.$anonfun$convert$1(ParquetSchemaConverter.scala:72) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) > at scala.collection.Iterator.foreach(Iterator.scala:941) > at scala.collection.Iterator.foreach$(Iterator.scala:941) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) > at scala.collection.IterableLike.foreach(IterableLike.scala:74) > at scala.collection.IterableLike.foreach$(IterableLike.scala:73) > at scala.collection.AbstractIterable.foreach(Iterable.scala:56) > at scala.collection.TraversableLike.map(TraversableLike.scala:238) > at scala.collection.TraversableLike.map$(TraversableLike.scala:231) > at scala.collection.AbstractTraversable.map(Traversable.scala:108) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:66) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convert(ParquetSchemaConverter.scala:63) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$readSchemaFromFooter$2(ParquetFileFormat.scala:548) > at scala.Option.getOrElse(Option.scala:189) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.readSchemaFromFooter(ParquetFileFormat.scala:548) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$2(ParquetFileFormat.scala:528) > at scala.collection.immutable.Stream.map(Stream.scala:418) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1(ParquetFileFormat.scala:528) > at > org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$.$anonfun$mergeSchemasInParallel$1$adapted(ParquetFileFormat.scala:521) > at > org.apache.spark.sql.execution.datasources.SchemaMergeUtils$.$anonfun$mergeSchemasInParallel$2(SchemaMergeUtils.scala:76) > {code} > Prior to 3.2 successfully reads the parquet automatically converting to a > LongType. > I believe work part of https://issues.apache.org/jira/browse/SPARK-34661 > introduced the change in behaviour, more specifically here: > [https://github.com/apache/spark/pull/31776/files#diff-3730a913c4b95edf09fb78f8739c538bae53f7269555b6226efe7ccee1901b39R154] > which throws the QueryCompilationErrors.illegalParquetTypeError -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-22588: - > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen deleted SPARK-22588: - > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-22588: - External issue URL: (was: https://mindmajix.com/scala-training) > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER294516): We offer comprehensive [Splunk online training|https://www.igmguru.com/big-data/splunk-training/] that also covers a variety of administrative and support options. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-40847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40847: - > SPARK: Load Data from Dataframe or RDD to DynamoDB > --- > > Key: SPARK-40847 > URL: https://issues.apache.org/jira/browse/SPARK-40847 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Vivek Garg >Priority: Major > Labels: spark > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 | A | B | C | null > 19 | X | Y | null | null > 21 | R | null | null | null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > > MapPartitionsRDD[10] at rdd at :41 > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => > { var ddbMap = new HashMap[String, AttributeValue]() var ClientNum = new > AttributeValue() ClientNum.setN(a.get(0).toString) ddbMap.put("ClientNum", > ClientNum) var Value_1 = new AttributeValue() Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) ddbMap.put("Value_2", Value_2) var Value_3 = > new AttributeValue() Value_3.setS(a.get(3).toString) ddbMap.put("Value_3", > Value_3) var Value_4 = new AttributeValue() Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) var item = new DynamoDBItemWritable() > item.setItem(ddbMap) (new Text(""), item) } > ) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thank you. > [Power BI > Certification|https://www.igmguru.com/data-science-bi/power-bi-certification-training/] -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-40847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-40847: - > SPARK: Load Data from Dataframe or RDD to DynamoDB > --- > > Key: SPARK-40847 > URL: https://issues.apache.org/jira/browse/SPARK-40847 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Vivek Garg >Priority: Major > Labels: spark > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-23521) SPIP: Standardize SQL logical plans with DataSourceV2
[ https://issues.apache.org/jira/browse/SPARK-23521 ] Sean R. Owen deleted comment on SPARK-23521: -- was (Author: JIRAUSER294516): IgmGuru [Mulesoft Online Training|https://www.igmguru.com/digital-marketing-programming/mulesoft-training/] is created with the Mulesoft certification exam in mind to ensure that the applicant passes the test on their first try. > SPIP: Standardize SQL logical plans with DataSourceV2 > - > > Key: SPARK-23521 > URL: https://issues.apache.org/jira/browse/SPARK-23521 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > Labels: SPIP > Attachments: SPIP_ Standardize logical plans.pdf > > > Executive Summary: This SPIP is based on [discussion about the DataSourceV2 > implementation|https://lists.apache.org/thread.html/55676ec1f5039d3deaf347d391cf82fe8574b8fa4eeab70110ed5b2b@%3Cdev.spark.apache.org%3E] > on the dev list. The proposal is to standardize the logical plans used for > write operations to make the planner more maintainable and to make Spark's > write behavior predictable and reliable. It proposes the following principles: > # Use well-defined logical plan nodes for all high-level operations: insert, > create, CTAS, overwrite table, etc. > # Use planner rules that match on these high-level nodes, so that it isn’t > necessary to create rules to match each eventual code path individually. > # Clearly define Spark’s behavior for these logical plan nodes. Physical > nodes should implement that behavior so that all code paths eventually make > the same guarantees. > # Specialize implementation when creating a physical plan, not logical > plans. This will avoid behavior drift and ensure planner code is shared > across physical implementations. > The SPIP doc presents a small but complete set of those high-level logical > operations, most of which are already defined in SQL or implemented by some > write path in Spark. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-40847) SPARK: Load Data from Dataframe or RDD to DynamoDB
[ https://issues.apache.org/jira/browse/SPARK-40847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen deleted SPARK-40847: - > SPARK: Load Data from Dataframe or RDD to DynamoDB > --- > > Key: SPARK-40847 > URL: https://issues.apache.org/jira/browse/SPARK-40847 > Project: Spark > Issue Type: Question >Reporter: Vivek Garg >Priority: Major > Labels: spark > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40993) Migrate markdown style README to python/docs/development/testing.rst
[ https://issues.apache.org/jira/browse/SPARK-40993 ] Sean R. Owen deleted comment on SPARK-40993: -- was (Author: JIRAUSER294516): Hii, I think you got the answer. [Salesforce Marketing Cloud Certification|https://www.igmguru.com/salesforce/salesforce-marketing-cloud-training/] > Migrate markdown style README to python/docs/development/testing.rst > > > Key: SPARK-40993 > URL: https://issues.apache.org/jira/browse/SPARK-40993 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Hyukjin Kwon >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER295436): Thank you for sharing the information. [Best Machine Learning Course|https://www.igmguru.com/machine-learning-ai/machine-learning-certification-training/] worldwide. Machine Learning Training Online program is designed after consulting people from the industry and academia. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER295436): Thank you for sharing the information. [Rails training|https://www.igmguru.com/digital-marketing-programming/ruby-on-rails-certification-training/] {*}provides in-depth knowledge on all the core fundamentals of Ruby and MVC design patterns through real-time use cases and projects{*}. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER295436): Thank you for sharing the information. The [AWS DevOps Professional certification|https://www.igmguru.com/cloud-computing/aws-devops-training/] is a professional-level certification offered by Amazon Web Services (AWS) that validates a candidate's ability to design, implement, and maintain a software development process on the AWS platform using DevOps practices. The certification is intended for individuals with at least one year of experience working with AWS and at least two years of experience in a DevOps role. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER294516): The industry standard-setting Machine Learning Operations or MLOps training offered by IgmGuru. The carefully selected training module includes the most recent syllabus to meet the demands of numerous sectors throughout the world. The [MLOps Certification|https://www.igmguru.com/machine-learning-ai/mlops-course-certification/] course was developed using the extensive knowledge and skills of industry leaders. The MLOps training gives people an advantage over the competition because it makes a wide range of profitable career prospects available to them. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER295111): Thank you for sharing the information. [Vlocity Salesforce Certification|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/] enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer assisting many tops and arising companies obtain their wanted progress utilizing its Omnichannel procedures. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER294516): According to the [Salesforce CPQ Certification|https://www.igmguru.com/salesforce/salesforce-cpq-training/] Exam, our Salesforce CPQ Certification Training program has been created. The core abilities needed for effectively implementing Salesforce CPQ solutions are developed in this course on Salesforce CPQ. Through instruction using practical examples, this course will go deeper into developing a quoting process, pricing strategies, configuration, CPQ object data model, and more. This online Salesforce CPQ training course includes practical projects that will aid you in passing the Salesforce CPQ Certification test. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-22588) SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values
[ https://issues.apache.org/jira/browse/SPARK-22588 ] Sean R. Owen deleted comment on SPARK-22588: -- was (Author: JIRAUSER295436): Thank you for sharing the information. [React Native Online Course|https://www.igmguru.com/digital-marketing-programming/react-native-training/] is an integrated professional course aimed at providing learners with the skills and knowledge of React Native, a mobile application framework used for the development of mobile applications for Android, iOS, UWP (Universal Windows Platform), and the web. > SPARK: Load Data from Dataframe or RDD to DynamoDB / dealing with null values > - > > Key: SPARK-22588 > URL: https://issues.apache.org/jira/browse/SPARK-22588 > Project: Spark > Issue Type: Question > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Saanvi Sharma >Priority: Minor > Labels: dynamodb, spark > Original Estimate: 24h > Remaining Estimate: 24h > > I am using spark 2.1 on EMR and i have a dataframe like this: > ClientNum | Value_1 | Value_2 | Value_3 | Value_4 > 14 |A |B| C | null > 19 |X |Y| null| null > 21 |R | null | null| null > I want to load data into DynamoDB table with ClientNum as key fetching: > Analyze Your Data on Amazon DynamoDB with apche Spark11 > Using Spark SQL for ETL3 > here is my code that I tried to solve: > var jobConf = new JobConf(sc.hadoopConfiguration) > jobConf.set("dynamodb.servicename", "dynamodb") > jobConf.set("dynamodb.input.tableName", "table_name") > jobConf.set("dynamodb.output.tableName", "table_name") > jobConf.set("dynamodb.endpoint", "dynamodb.eu-west-1.amazonaws.com") > jobConf.set("dynamodb.regionid", "eu-west-1") > jobConf.set("dynamodb.throughput.read", "1") > jobConf.set("dynamodb.throughput.read.percent", "1") > jobConf.set("dynamodb.throughput.write", "1") > jobConf.set("dynamodb.throughput.write.percent", "1") > > jobConf.set("mapred.output.format.class", > "org.apache.hadoop.dynamodb.write.DynamoDBOutputFormat") > jobConf.set("mapred.input.format.class", > "org.apache.hadoop.dynamodb.read.DynamoDBInputFormat") > #Import Data > val df = > sqlContext.read.format("com.databricks.spark.csv").option("header", > "true").option("inferSchema", "true").load(path) > I performed a transformation to have an RDD that matches the types that the > DynamoDB custom output format knows how to write. The custom output format > expects a tuple containing the Text and DynamoDBItemWritable types. > Create a new RDD with those types in it, in the following map call: > #Convert the dataframe to rdd > val df_rdd = df.rdd > > df_rdd: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = > MapPartitionsRDD[10] at rdd at :41 > > #Print first rdd > df_rdd.take(1) > > res12: Array[org.apache.spark.sql.Row] = Array([14,A,B,C,null]) > var ddbInsertFormattedRDD = df_rdd.map(a => { > var ddbMap = new HashMap[String, AttributeValue]() > var ClientNum = new AttributeValue() > ClientNum.setN(a.get(0).toString) > ddbMap.put("ClientNum", ClientNum) > var Value_1 = new AttributeValue() > Value_1.setS(a.get(1).toString) > ddbMap.put("Value_1", Value_1) > var Value_2 = new AttributeValue() > Value_2.setS(a.get(2).toString) > ddbMap.put("Value_2", Value_2) > var Value_3 = new AttributeValue() > Value_3.setS(a.get(3).toString) > ddbMap.put("Value_3", Value_3) > var Value_4 = new AttributeValue() > Value_4.setS(a.get(4).toString) > ddbMap.put("Value_4", Value_4) > var item = new DynamoDBItemWritable() > item.setItem(ddbMap) > (new Text(""), item) > }) > This last call uses the job configuration that defines the EMR-DDB connector > to write out the new RDD you created in the expected format: > ddbInsertFormattedRDD.saveAsHadoopDataset(jobConf) > fails with the follwoing error: > Caused by: java.lang.NullPointerException > null values caused the error, if I try with ClientNum and Value_1 it works > data is correctly inserted on DynamoDB table. > Thanks for your help !! -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827 ] Sean R. Owen deleted comment on SPARK-34827: -- was (Author: JIRAUSER295436): Thank you for sharing such good information. Very informative and effective post. +[https://www.igmguru.com/digital-marketing-programming/golang-training/]+ > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-2258) Worker UI displays zombie executors
[ https://issues.apache.org/jira/browse/SPARK-2258 ] Sean R. Owen deleted comment on SPARK-2258: - was (Author: JIRAUSER295111): The website is so easy to use – I am impressed with it. Thank you for Sharing. Salesforce Vlocity Training focuses on producing experts who aren't just able to handle the platform but build solutions to keep their respective companies as well as their careers way ahead of the competition. Go through this link:- [https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/] > Worker UI displays zombie executors > --- > > Key: SPARK-2258 > URL: https://issues.apache.org/jira/browse/SPARK-2258 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 1.0.0 >Reporter: Andrew Or >Assignee: Andrew Or >Priority: Major > Fix For: 1.1.0 > > Attachments: Screen Shot 2014-06-24 at 9.23.18 AM.png > > > See attached. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827 ] Sean R. Owen deleted comment on SPARK-34827: -- was (Author: JIRAUSER297361): I like your content. If anyone wants to learn a new course like Vlocity platform developer certification focuses on producing experts who aren't just ready to handle the platform but build solutions to keep their respective companies and their careers ahead of the competition. Go through this link:[https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/] > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827 ] Sean R. Owen deleted comment on SPARK-34827: -- was (Author: JIRAUSER295111): Thank you for sharing such good information. Very informative and effective post. [Msbi Training|https://www.igmguru.com/data-science-bi/msbi-certification-training/] offers the best solutions for Business Intelligence and data mining. MSBI uses Visual Studio data tools and SQL servers to make great decisions in our business activities. > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827 ] Sean R. Owen deleted comment on SPARK-34827: -- was (Author: JIRAUSER294516): I appreciate you sharing this useful information. Very useful and interesting post. [Uipath training|https://www.igmguru.com/machine-learning-ai/rpa-uipath-certification-training/]. > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-42444) DataFrame.drop should handle multi columns properly
[ https://issues.apache.org/jira/browse/SPARK-42444 ] Sean R. Owen deleted comment on SPARK-42444: -- was (Author: JIRAUSER295111): Thank you for sharing. [Azure Solution Architect Training |https://www.igmguru.com/cloud-computing/microsoft-azure-solution-architect-az-300-training/]has been designed for software developers who are keen on developing best-in-class applications using this open and advanced platform of Windows Azure. > DataFrame.drop should handle multi columns properly > --- > > Key: SPARK-42444 > URL: https://issues.apache.org/jira/browse/SPARK-42444 > Project: Spark > Issue Type: Bug > Components: PySpark >Affects Versions: 3.4.0 >Reporter: Ruifeng Zheng >Priority: Blocker > > {code:java} > from pyspark.sql import Row > df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, "Bob")], > ["age", "name"]) > df2 = spark.createDataFrame([Row(height=80, name="Tom"), Row(height=85, > name="Bob")]) > df1.join(df2, df1.name == df2.name, 'inner').drop('name', 'age').show() > {code} > This works in 3.3 > {code:java} > +--+ > |height| > +--+ > |85| > |80| > +--+ > {code} > but fails in 3.4 > {code:java} > --- > AnalysisException Traceback (most recent call last) > Cell In[1], line 4 > 2 df1 = spark.createDataFrame([(14, "Tom"), (23, "Alice"), (16, > "Bob")], ["age", "name"]) > 3 df2 = spark.createDataFrame([Row(height=80, name="Tom"), > Row(height=85, name="Bob")]) > > 4 df1.join(df2, df1.name == df2.name, 'inner').drop('name', > 'age').show() > File ~/Dev/spark/python/pyspark/sql/dataframe.py:4913, in > DataFrame.drop(self, *cols) >4911 jcols = [_to_java_column(c) for c in cols] >4912 first_column, *remaining_columns = jcols > -> 4913 jdf = self._jdf.drop(first_column, self._jseq(remaining_columns)) >4915 return DataFrame(jdf, self.sparkSession) > File ~/Dev/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py:1322, > in JavaMember.__call__(self, *args) >1316 command = proto.CALL_COMMAND_NAME +\ >1317 self.command_header +\ >1318 args_command +\ >1319 proto.END_COMMAND_PART >1321 answer = self.gateway_client.send_command(command) > -> 1322 return_value = get_return_value( >1323 answer, self.gateway_client, self.target_id, self.name) >1325 for temp_arg in temp_args: >1326 if hasattr(temp_arg, "_detach"): > File ~/Dev/spark/python/pyspark/errors/exceptions/captured.py:159, in > capture_sql_exception..deco(*a, **kw) > 155 converted = convert_exception(e.java_exception) > 156 if not isinstance(converted, UnknownException): > 157 # Hide where the exception came from that shows a non-Pythonic > 158 # JVM exception message. > --> 159 raise converted from None > 160 else: > 161 raise > AnalysisException: [AMBIGUOUS_REFERENCE] Reference `name` is ambiguous, could > be: [`name`, `name`]. > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42033) Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline
[ https://issues.apache.org/jira/browse/SPARK-42033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42033: - > Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline > > > Key: SPARK-42033 > URL: https://issues.apache.org/jira/browse/SPARK-42033 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Pankaj Nagla >Priority: Major > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Deleted] (SPARK-42033) Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline
[ https://issues.apache.org/jira/browse/SPARK-42033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen deleted SPARK-42033: - > Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline > > > Key: SPARK-42033 > URL: https://issues.apache.org/jira/browse/SPARK-42033 > Project: Spark > Issue Type: Bug >Reporter: Pankaj Nagla >Priority: Major > > Content deleted as spam -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42033) Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline
[ https://issues.apache.org/jira/browse/SPARK-42033?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692409#comment-17692409 ] Sean R. Owen commented on SPARK-42033: -- It's spam. This guy is injecting links to some course. I'm deleting this and spam comments > Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline > > > Key: SPARK-42033 > URL: https://issues.apache.org/jira/browse/SPARK-42033 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Pankaj Nagla >Priority: Major > > I'm going through the "Scalable FastAPI Application on AWS" course. My > gitlab-ci.yml file is below. > stages: > - docker > variables: > DOCKER_DRIVER: overlay2 > DOCKER_TLS_CERTDIR: "/certs" > cache: > key: ${CI_JOB_NAME} > paths: > - ${CI_PROJECT_DIR}/services/talk_booking/.venv/ > build-python-ci-image: > image: docker:19.03.0 > services: > - docker:19.03.0-dind > stage: docker > before_script: > - cd ci_cd/python/ > script: > - docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" > $CI_REGISTRY > - docker build -t > registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim . > - docker push registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > My Pipeline fails with this error: > See > [https://docs.docker.com/engine/reference/commandline/login/#credentials-store] > Login Succeeded > $ docker build -t registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > . > invalid argument > "registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim" for "-t, --tag" > flag: invalid reference format > See 'docker build --help'. > Cleaning up project directory and file based variables > ERROR: Job failed: exit code 125 > It may or may not be relevant but the Container Registry for the GitLab > project says there's a Docker connection error. All these problems have been > discussed in this [Aws Sysops Training > |https://www.igmguru.com/cloud-computing/aws-sysops-certification-training/]follow > the page. > Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42033) Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline
[ https://issues.apache.org/jira/browse/SPARK-42033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-42033. -- Resolution: Invalid > Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline > > > Key: SPARK-42033 > URL: https://issues.apache.org/jira/browse/SPARK-42033 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Pankaj Nagla >Priority: Major > > I'm going through the "Scalable FastAPI Application on AWS" course. My > gitlab-ci.yml file is below. > stages: > - docker > variables: > DOCKER_DRIVER: overlay2 > DOCKER_TLS_CERTDIR: "/certs" > cache: > key: ${CI_JOB_NAME} > paths: > - ${CI_PROJECT_DIR}/services/talk_booking/.venv/ > build-python-ci-image: > image: docker:19.03.0 > services: > - docker:19.03.0-dind > stage: docker > before_script: > - cd ci_cd/python/ > script: > - docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" > $CI_REGISTRY > - docker build -t > registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim . > - docker push registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > My Pipeline fails with this error: > See > [https://docs.docker.com/engine/reference/commandline/login/#credentials-store] > Login Succeeded > $ docker build -t registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > . > invalid argument > "registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim" for "-t, --tag" > flag: invalid reference format > See 'docker build --help'. > Cleaning up project directory and file based variables > ERROR: Job failed: exit code 125 > It may or may not be relevant but the Container Registry for the GitLab > project says there's a Docker connection error. All these problems have been > discussed in this [Aws Sysops Training > |https://www.igmguru.com/cloud-computing/aws-sysops-certification-training/]follow > the page. > Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42033) Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline
[ https://issues.apache.org/jira/browse/SPARK-42033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen updated SPARK-42033: - External issue URL: (was: https://www.igmguru.com/cloud-computing/aws-sysops-certification-training/) > Docker Tag Error 25 on gitlab-ci.yml trying to start GitLab Pipeline > > > Key: SPARK-42033 > URL: https://issues.apache.org/jira/browse/SPARK-42033 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 2.1.1 >Reporter: Pankaj Nagla >Priority: Major > > I'm going through the "Scalable FastAPI Application on AWS" course. My > gitlab-ci.yml file is below. > stages: > - docker > variables: > DOCKER_DRIVER: overlay2 > DOCKER_TLS_CERTDIR: "/certs" > cache: > key: ${CI_JOB_NAME} > paths: > - ${CI_PROJECT_DIR}/services/talk_booking/.venv/ > build-python-ci-image: > image: docker:19.03.0 > services: > - docker:19.03.0-dind > stage: docker > before_script: > - cd ci_cd/python/ > script: > - docker login -u "$CI_REGISTRY_USER" -p "$CI_REGISTRY_PASSWORD" > $CI_REGISTRY > - docker build -t > registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim . > - docker push registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > My Pipeline fails with this error: > See > [https://docs.docker.com/engine/reference/commandline/login/#credentials-store] > Login Succeeded > $ docker build -t registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim > . > invalid argument > "registry.gitlab.com/chris_/talk-booking:cicd-python3.9-slim" for "-t, --tag" > flag: invalid reference format > See 'docker build --help'. > Cleaning up project directory and file based variables > ERROR: Job failed: exit code 125 > It may or may not be relevant but the Container Registry for the GitLab > project says there's a Docker connection error. All these problems have been > discussed in this [Aws Sysops Training > |https://www.igmguru.com/cloud-computing/aws-sysops-certification-training/]follow > the page. > Thanks > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] (SPARK-40149) Star expansion after outer join asymmetrically includes joining key
[ https://issues.apache.org/jira/browse/SPARK-40149 ] Sean R. Owen deleted comment on SPARK-40149: -- was (Author: JIRAUSER295111): Thank you for sharing the information. [Vlocity Training|https://www.igmguru.com/salesforce/salesforce-vlocity-training-certification/] enhances CPQ and guided selling as well. Salesforce Vlocity is the pioneer assisting many tops and arising companies obtain their wanted progress utilizing its Omnichannel procedures. > Star expansion after outer join asymmetrically includes joining key > --- > > Key: SPARK-40149 > URL: https://issues.apache.org/jira/browse/SPARK-40149 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2 >Reporter: Otakar Truněček >Assignee: Wenchen Fan >Priority: Blocker > Fix For: 3.3.1, 3.2.3, 3.4.0 > > > When star expansion is used on left side of a join, the result will include > joining key, while on the right side of join it doesn't. I would expect the > behaviour to be symmetric (either include on both sides or on neither). > Example: > {code:python} > from pyspark.sql import SparkSession > import pyspark.sql.functions as f > spark = SparkSession.builder.getOrCreate() > df_left = spark.range(5).withColumn('val', f.lit('left')) > df_right = spark.range(3, 7).withColumn('val', f.lit('right')) > df_merged = ( > df_left > .alias('left') > .join(df_right.alias('right'), on='id', how='full_outer') > .withColumn('left_all', f.struct('left.*')) > .withColumn('right_all', f.struct('right.*')) > ) > df_merged.show() > {code} > result: > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {0, left}| {null}| > | 1|left| null| {1, left}| {null}| > | 2|left| null| {2, left}| {null}| > | 3|left|right| {3, left}| {right}| > | 4|left|right| {4, left}| {right}| > | 5|null|right|{null, null}| {right}| > | 6|null|right|{null, null}| {right}| > +---++-++-+ > {code} > This behaviour started with release 3.2.0. Previously the key was not > included on either side. > Result from Spark 3.1.3 > {code:java} > +---++-++-+ > | id| val| val|left_all|right_all| > +---++-++-+ > | 0|left| null| {left}| {null}| > | 6|null|right| {null}| {right}| > | 5|null|right| {null}| {right}| > | 1|left| null| {left}| {null}| > | 3|left|right| {left}| {right}| > | 2|left| null| {left}| {null}| > | 4|left|right| {left}| {right}| > +---++-++-+ {code} > I have a gut feeling this is related to these issues: > https://issues.apache.org/jira/browse/SPARK-39376 > https://issues.apache.org/jira/browse/SPARK-34527 > https://issues.apache.org/jira/browse/SPARK-38603 > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692408#comment-17692408 ] Thomas Graves commented on SPARK-41793: --- [~ulysses] [~cloud_fan] [~xinrong] We need to decide what we are doing with this for 3.4 before doing any release. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42532) Update YuniKorn documentation with v1.2
[ https://issues.apache.org/jira/browse/SPARK-42532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-42532. --- Fix Version/s: 3.4.0 Assignee: Dongjoon Hyun Resolution: Fixed > Update YuniKorn documentation with v1.2 > --- > > Key: SPARK-42532 > URL: https://issues.apache.org/jira/browse/SPARK-42532 > Project: Spark > Issue Type: Documentation > Components: Documentation, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42532) Update YuniKorn documentation with v1.2
[ https://issues.apache.org/jira/browse/SPARK-42532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42532: Assignee: Apache Spark > Update YuniKorn documentation with v1.2 > --- > > Key: SPARK-42532 > URL: https://issues.apache.org/jira/browse/SPARK-42532 > Project: Spark > Issue Type: Documentation > Components: Documentation, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42532) Update YuniKorn documentation with v1.2
[ https://issues.apache.org/jira/browse/SPARK-42532?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42532: Assignee: (was: Apache Spark) > Update YuniKorn documentation with v1.2 > --- > > Key: SPARK-42532 > URL: https://issues.apache.org/jira/browse/SPARK-42532 > Project: Spark > Issue Type: Documentation > Components: Documentation, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42532) Update YuniKorn documentation with v1.2
[ https://issues.apache.org/jira/browse/SPARK-42532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692401#comment-17692401 ] Apache Spark commented on SPARK-42532: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40132 > Update YuniKorn documentation with v1.2 > --- > > Key: SPARK-42532 > URL: https://issues.apache.org/jira/browse/SPARK-42532 > Project: Spark > Issue Type: Documentation > Components: Documentation, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42532) Update YuniKorn documentation with v1.2
Dongjoon Hyun created SPARK-42532: - Summary: Update YuniKorn documentation with v1.2 Key: SPARK-42532 URL: https://issues.apache.org/jira/browse/SPARK-42532 Project: Spark Issue Type: Documentation Components: Documentation, Kubernetes Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42150) Upgrade Volcano to 1.7.0
[ https://issues.apache.org/jira/browse/SPARK-42150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692384#comment-17692384 ] Apache Spark commented on SPARK-42150: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40131 > Upgrade Volcano to 1.7.0 > > > Key: SPARK-42150 > URL: https://issues.apache.org/jira/browse/SPARK-42150 > Project: Spark > Issue Type: Improvement > Components: Documentation, Kubernetes >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Minor > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42522) Fix DataFrameWriterV2 to find the default source
[ https://issues.apache.org/jira/browse/SPARK-42522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42522. --- Fix Version/s: 3.4.0 Assignee: Takuya Ueshin Resolution: Fixed > Fix DataFrameWriterV2 to find the default source > > > Key: SPARK-42522 > URL: https://issues.apache.org/jira/browse/SPARK-42522 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.4.0 > > > {code:python} > df.writeTo("test_table").create() > {code} > throws: > {noformat} > pyspark.errors.exceptions.connect.SparkConnectGrpcException: > (org.apache.spark.SparkClassNotFoundException) [DATA_SOURCE_NOT_FOUND] Failed > to find the data source: . Please find packages at > `https://spark.apache.org/third-party-projects.html`. > {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42518) Scala client Write API V2
[ https://issues.apache.org/jira/browse/SPARK-42518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42518. --- Fix Version/s: 3.4.0 Assignee: Zhen Li Resolution: Fixed > Scala client Write API V2 > - > > Key: SPARK-42518 > URL: https://issues.apache.org/jira/browse/SPARK-42518 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.4.0 >Reporter: Zhen Li >Assignee: Zhen Li >Priority: Major > Fix For: 3.4.0 > > > Impl the Dataset#writeTo method. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42531) Scala Client Add Collection Functions
[ https://issues.apache.org/jira/browse/SPARK-42531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42531: Assignee: (was: Apache Spark) > Scala Client Add Collection Functions > - > > Key: SPARK-42531 > URL: https://issues.apache.org/jira/browse/SPARK-42531 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42531) Scala Client Add Collection Functions
[ https://issues.apache.org/jira/browse/SPARK-42531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42531: Assignee: Apache Spark > Scala Client Add Collection Functions > - > > Key: SPARK-42531 > URL: https://issues.apache.org/jira/browse/SPARK-42531 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42531) Scala Client Add Collection Functions
[ https://issues.apache.org/jira/browse/SPARK-42531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692367#comment-17692367 ] Apache Spark commented on SPARK-42531: -- User 'hvanhovell' has created a pull request for this issue: https://github.com/apache/spark/pull/40130 > Scala Client Add Collection Functions > - > > Key: SPARK-42531 > URL: https://issues.apache.org/jira/browse/SPARK-42531 > Project: Spark > Issue Type: Task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Herman van Hövell >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42531) Scala Client Add Collection Functions
Herman van Hövell created SPARK-42531: - Summary: Scala Client Add Collection Functions Key: SPARK-42531 URL: https://issues.apache.org/jira/browse/SPARK-42531 Project: Spark Issue Type: Task Components: Connect Affects Versions: 3.4.0 Reporter: Herman van Hövell -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42529) Support Cube and Rollup
[ https://issues.apache.org/jira/browse/SPARK-42529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42529: Assignee: Apache Spark (was: Rui Wang) > Support Cube and Rollup > --- > > Key: SPARK-42529 > URL: https://issues.apache.org/jira/browse/SPARK-42529 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42529) Support Cube and Rollup
[ https://issues.apache.org/jira/browse/SPARK-42529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692349#comment-17692349 ] Apache Spark commented on SPARK-42529: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40129 > Support Cube and Rollup > --- > > Key: SPARK-42529 > URL: https://issues.apache.org/jira/browse/SPARK-42529 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42529) Support Cube and Rollup
[ https://issues.apache.org/jira/browse/SPARK-42529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42529: Assignee: Rui Wang (was: Apache Spark) > Support Cube and Rollup > --- > > Key: SPARK-42529 > URL: https://issues.apache.org/jira/browse/SPARK-42529 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42529) Support Cube and Rollup
[ https://issues.apache.org/jira/browse/SPARK-42529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Rui Wang updated SPARK-42529: - Summary: Support Cube and Rollup (was: Support Cube,Rollup,Pivot) > Support Cube and Rollup > --- > > Key: SPARK-42529 > URL: https://issues.apache.org/jira/browse/SPARK-42529 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes
[ https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692340#comment-17692340 ] Apache Spark commented on SPARK-42466: -- User 'shrprasa' has created a pull request for this issue: https://github.com/apache/spark/pull/40128 > spark.kubernetes.file.upload.path not deleting files under HDFS after job > completes > --- > > Key: SPARK-42466 > URL: https://issues.apache.org/jira/browse/SPARK-42466 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Jagadeeswara Rao >Priority: Major > > In cluster mode after uploading files to HDFS location using > spark.kubernetes.file.upload.path property files are not getting cleared . > File is successfully uploaded to hdfs location in this format > spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to > uploadFileUri . > [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310] > following is driver log , driver is completed successfully and shutdownhook > is not cleared the hdfs files. > {code:java} > 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 23/02/16 18:06:56 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed. > 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared > 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped > 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/16 18:06:57 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext > 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7 > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes
[ https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692339#comment-17692339 ] Apache Spark commented on SPARK-42466: -- User 'shrprasa' has created a pull request for this issue: https://github.com/apache/spark/pull/40128 > spark.kubernetes.file.upload.path not deleting files under HDFS after job > completes > --- > > Key: SPARK-42466 > URL: https://issues.apache.org/jira/browse/SPARK-42466 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Jagadeeswara Rao >Priority: Major > > In cluster mode after uploading files to HDFS location using > spark.kubernetes.file.upload.path property files are not getting cleared . > File is successfully uploaded to hdfs location in this format > spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to > uploadFileUri . > [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310] > following is driver log , driver is completed successfully and shutdownhook > is not cleared the hdfs files. > {code:java} > 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 23/02/16 18:06:56 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed. > 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared > 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped > 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/16 18:06:57 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext > 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7 > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes
[ https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42466: Assignee: (was: Apache Spark) > spark.kubernetes.file.upload.path not deleting files under HDFS after job > completes > --- > > Key: SPARK-42466 > URL: https://issues.apache.org/jira/browse/SPARK-42466 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Jagadeeswara Rao >Priority: Major > > In cluster mode after uploading files to HDFS location using > spark.kubernetes.file.upload.path property files are not getting cleared . > File is successfully uploaded to hdfs location in this format > spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to > uploadFileUri . > [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310] > following is driver log , driver is completed successfully and shutdownhook > is not cleared the hdfs files. > {code:java} > 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 23/02/16 18:06:56 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed. > 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared > 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped > 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/16 18:06:57 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext > 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7 > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42466) spark.kubernetes.file.upload.path not deleting files under HDFS after job completes
[ https://issues.apache.org/jira/browse/SPARK-42466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42466: Assignee: Apache Spark > spark.kubernetes.file.upload.path not deleting files under HDFS after job > completes > --- > > Key: SPARK-42466 > URL: https://issues.apache.org/jira/browse/SPARK-42466 > Project: Spark > Issue Type: Bug > Components: Kubernetes >Affects Versions: 3.2.0 >Reporter: Jagadeeswara Rao >Assignee: Apache Spark >Priority: Major > > In cluster mode after uploading files to HDFS location using > spark.kubernetes.file.upload.path property files are not getting cleared . > File is successfully uploaded to hdfs location in this format > spark-upload-[randomUUID] using {{KubernetesUtils}} is requested to > uploadFileUri . > [https://github.com/apache/spark/blob/76a134ade60a9f354aca01eaca0b2e2477c6bd43/resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala#L310] > following is driver log , driver is completed successfully and shutdownhook > is not cleared the hdfs files. > {code:java} > 23/02/16 18:06:56 INFO KubernetesClusterSchedulerBackend: Shutting down all > executors > 23/02/16 18:06:56 INFO > KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each > executor to shut down > 23/02/16 18:06:56 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has > been closed. > 23/02/16 18:06:57 INFO MapOutputTrackerMasterEndpoint: > MapOutputTrackerMasterEndpoint stopped! > 23/02/16 18:06:57 INFO MemoryStore: MemoryStore cleared > 23/02/16 18:06:57 INFO BlockManager: BlockManager stopped > 23/02/16 18:06:57 INFO BlockManagerMaster: BlockManagerMaster stopped > 23/02/16 18:06:57 INFO > OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: > OutputCommitCoordinator stopped! > 23/02/16 18:06:57 INFO SparkContext: Successfully stopped SparkContext > 23/02/16 18:06:57 INFO ShutdownHookManager: Shutdown hook called > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /tmp/spark-efb8f725-4ead-4729-a8e0-f478280121b7 > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local2/spark-66dbf7e6-fe7e-4655-8724-69d76d93fc1f > 23/02/16 18:06:57 INFO ShutdownHookManager: Deleting directory > /spark-local1/spark-53aefaee-58a5-4fce-b5b0-5e29f42e337f{code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reassigned SPARK-42530: - Assignee: Dongjoon Hyun > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692333#comment-17692333 ] Apache Spark commented on SPARK-42530: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40127 > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42530: Assignee: (was: Apache Spark) > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692332#comment-17692332 ] Apache Spark commented on SPARK-42530: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/40127 > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-42530: Assignee: Apache Spark > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42530) Remove Hadoop 2 from PySpark installation guide
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42530: -- Summary: Remove Hadoop 2 from PySpark installation guide (was: Update PySpark installation guide by hiding Hadoop 2) > Remove Hadoop 2 from PySpark installation guide > --- > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42530) Update PySpark installation guide by hiding Hadoop 2
[ https://issues.apache.org/jira/browse/SPARK-42530?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-42530: -- Summary: Update PySpark installation guide by hiding Hadoop 2 (was: Update PySpark installation by hiding Hadoop 2) > Update PySpark installation guide by hiding Hadoop 2 > > > Key: SPARK-42530 > URL: https://issues.apache.org/jira/browse/SPARK-42530 > Project: Spark > Issue Type: Documentation > Components: Documentation, PySpark >Affects Versions: 3.4.0 >Reporter: Dongjoon Hyun >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42530) Update PySpark installation by hiding Hadoop 2
Dongjoon Hyun created SPARK-42530: - Summary: Update PySpark installation by hiding Hadoop 2 Key: SPARK-42530 URL: https://issues.apache.org/jira/browse/SPARK-42530 Project: Spark Issue Type: Documentation Components: Documentation, PySpark Affects Versions: 3.4.0 Reporter: Dongjoon Hyun -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-40822) Use stable derived-column-alias algorithm, suitable for CREATE VIEW
[ https://issues.apache.org/jira/browse/SPARK-40822?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692308#comment-17692308 ] Apache Spark commented on SPARK-40822: -- User 'MaxGekk' has created a pull request for this issue: https://github.com/apache/spark/pull/40126 > Use stable derived-column-alias algorithm, suitable for CREATE VIEW > > > Key: SPARK-40822 > URL: https://issues.apache.org/jira/browse/SPARK-40822 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.4.0 >Reporter: Serge Rielau >Priority: Major > > Spark has the ability derive column aliases for expressions if no alias was > provided by the user. > E.g. > CREATE TABLE T(c1 INT, c2 INT); > SELECT c1, `(c1 + 1)`, c3 FROM (SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T); > This is a valuable feature. However, the current implementation works by > pretty printing the expression from the logical plan. This has multiple > downsides: > * The derived names can be unintuitive. For example the brackets in `(c1 + > 1)` or outright ugly, such as: > SELECT `substr(hello, 1, 2147483647)` FROM (SELECT substr('hello', 1)) AS T; > * We cannot guarantee stability across versions since the logical lan of an > expression may change. > The later is a major reason why we cannot allow CREATE VIEW without a column > list except in "trivial" cases. > CREATE VIEW v AS SELECT c1, c1 + 1, c1 * c2 AS c3 FROM T; > Not allowed to create a permanent view `spark_catalog`.`default`.`v` without > explicitly assigning an alias for expression (c1 + 1). > There are two way we can go about fixing this: > # Stop deriving column aliases from the expression. Instead generate unique > names such as `_col_1` based on their position in the select list. This is > ugly and takes away the "nice" headers on result sets > # Move the derivation of the name upstream. That is instead of pretty > printing the logical plan we pretty print the lexer output, or a sanitized > version of the expression as typed. > The statement as typed is stable by definition. The lexer is stable because i > has no reason to change. And if it ever did we have a better chance to manage > the change. > In this feature we propose the following semantic: > # If the column alias can be trivially derived (some of these can stack), do > so: > ** a (qualified) column reference => the unqualified column identifier > cat.sch.tab.col => col > ** A field reference => the fieldname > struct.field1.field2 => field2 > ** A cast(column AS type) => column > cast(col1 AS INT) => col1 > ** A map lookup with literal key => keyname > map.key => key > map['key'] => key > ** A parameter less function => unqualified function name > current_schema() => current_schema > # Take the lexer tokens of the expression, eliminate comments, and append > them. > foo(tab1.c1 + /* this is a plus*/ > 1) => `foo(tab1.c1+1)` > > Of course we wan this change under a config. > If the config is set we can allow CREATE VIEW to exploit this and use the > derived expressions. > PS: The exact mechanics of formatting the name is very much debatable. > E.g.spaces between token, squeezing out comments - upper casing - preserving > quotes or double quotes...) -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42468) Implement agg by (String, String)*
[ https://issues.apache.org/jira/browse/SPARK-42468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692307#comment-17692307 ] Apache Spark commented on SPARK-42468: -- User 'amaliujia' has created a pull request for this issue: https://github.com/apache/spark/pull/40125 > Implement agg by (String, String)* > -- > > Key: SPARK-42468 > URL: https://issues.apache.org/jira/browse/SPARK-42468 > Project: Spark > Issue Type: Sub-task > Components: Connect >Affects Versions: 3.4.0 >Reporter: Rui Wang >Assignee: Rui Wang >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-42529) Support Cube,Rollup,Pivot
Rui Wang created SPARK-42529: Summary: Support Cube,Rollup,Pivot Key: SPARK-42529 URL: https://issues.apache.org/jira/browse/SPARK-42529 Project: Spark Issue Type: Sub-task Components: Connect Affects Versions: 3.4.0 Reporter: Rui Wang Assignee: Rui Wang -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-42527) Scala Client add Window functions
[ https://issues.apache.org/jira/browse/SPARK-42527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Herman van Hövell resolved SPARK-42527. --- Fix Version/s: 3.4.0 Assignee: Yang Jie Resolution: Fixed > Scala Client add Window functions > - > > Key: SPARK-42527 > URL: https://issues.apache.org/jira/browse/SPARK-42527 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-42527) Scala Client add Window functions
[ https://issues.apache.org/jira/browse/SPARK-42527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692226#comment-17692226 ] Herman van Hövell commented on SPARK-42527: --- I have merged this to 3.4. It might be in 3.4.0 if RC fails, or 3.4.1 if it passes. > Scala Client add Window functions > - > > Key: SPARK-42527 > URL: https://issues.apache.org/jira/browse/SPARK-42527 > Project: Spark > Issue Type: Improvement > Components: Connect >Affects Versions: 3.5.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Major > Fix For: 3.4.0 > > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-37980) Extend METADATA column to support row indices for file based data sources
[ https://issues.apache.org/jira/browse/SPARK-37980?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17692170#comment-17692170 ] Apache Spark commented on SPARK-37980: -- User 'olaky' has created a pull request for this issue: https://github.com/apache/spark/pull/40124 > Extend METADATA column to support row indices for file based data sources > - > > Key: SPARK-37980 > URL: https://issues.apache.org/jira/browse/SPARK-37980 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Prakhar Jain >Assignee: Ala Luszczak >Priority: Major > Fix For: 3.4.0 > > > Spark recently added hidden metadata column support for File based > datasources as part of SPARK-37273. > We should extend it to support ROW_INDEX/ROW_POSITION also. > > Meaning of ROW_POSITION: > ROW_INDEX/ROW_POSITION is basically an index of a row within a file. E.g. 5th > row in the file will have ROW_INDEX 5. > > Use cases: > Row Indexes can be used in a variety of ways. A (fileName, rowIndex) tuple > uniquely identifies row in a table. This information can be used to mark rows > e.g. this can be used by indexer etc. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org