[jira] [Updated] (SPARK-44973) Fix ArrayIndexOutOfBoundsException in conv()
[ https://issues.apache.org/jira/browse/SPARK-44973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-44973: -- Affects Version/s: 3.0.3 > Fix ArrayIndexOutOfBoundsException in conv() > > > Key: SPARK-44973 > URL: https://issues.apache.org/jira/browse/SPARK-44973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.3, 3.3.3, 3.4.1, 3.5.0 >Reporter: Gera Shegalov >Assignee: Mark Jarvin >Priority: Major > Labels: pull-request-available > Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4 > > > {code:scala} > scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false) > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183) > at > org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821) > at > org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-44973) CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException
[ https://issues.apache.org/jira/browse/SPARK-44973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788157#comment-17788157 ] Gera Shegalov commented on SPARK-44973: --- 3.0.3 is the oldest version on my box and it exhibits the same bug: {code:java} scala> spark.version res1: String = 3.0.3 scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:148) at org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:338) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:690) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457) {code} > CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException > -- > > Key: SPARK-44973 > URL: https://issues.apache.org/jira/browse/SPARK-44973 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.3, 3.4.1, 3.5.0 >Reporter: Gera Shegalov >Priority: Major > Labels: pull-request-available > > {code:scala} > scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false) > java.lang.ArrayIndexOutOfBoundsException: -1 > at > org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183) > at > org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463) > at > org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821) > at > org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81) > at > org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates
[ https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781550#comment-17781550 ] Gera Shegalov commented on SPARK-20075: --- This would be a great feature that can help spark-rapids plugin users that require a non-default classifier such as cuda12 > Support classifier, packaging in Maven coordinates > -- > > Key: SPARK-20075 > URL: https://issues.apache.org/jira/browse/SPARK-20075 > Project: Spark > Issue Type: Improvement > Components: Spark Shell, Spark Submit >Affects Versions: 2.1.0 >Reporter: Sean R. Owen >Priority: Minor > Labels: bulk-closed > > Currently, it's possible to add dependencies to an app using its Maven > coordinates on the command line: {{group:artifact:version}}. However, really > Maven coordinates are 5-dimensional: > {{group:artifact:packaging:classifier:version}}. In some rare but real cases > it's important to be able to specify the classifier. And while we're at it > why not try to support packaging? > I have a WIP PR that I'll post soon. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None
[ https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771698#comment-17771698 ] Gera Shegalov commented on SPARK-43389: --- There is a symmetrical issue on the DataFrameWriter side: {code:python} >>> spark.createDataFrame([('some value',),]).write.option('someOpt', >>> None).saveAsTable("hive_csv_t21") {code} {code:java} 23/10/03 21:39:12 WARN HiveExternalCatalog: Could not persist `spark_catalog`.`default`.`hive_csv_t21` in a Hive compatible way. Persisting it into Hive metastore in Spark SQL specific format. org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:java.lang.NullPointerException: Null values not allowed in persistent maps.) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874) at org.apache.spark.sql.hive.client.Shim_v0_12.createTable(HiveShim.scala:614) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:573) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283) at org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:571) at org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:526) at org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:415) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99) at org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:402) at org.apache.spark.sql.rapids.shims.GpuCreateDataSourceTableAsSelectCommand.run(GpuCreateDataSourceTableAsSelectCommandShims.scala:91) at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult$lzycompute(GpuExecutedCommandExec.scala:52) at com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult(GpuExecutedCommandExec.scala:50) at com.nvidia.spark.rapids.GpuExecutedCommandExec.executeCollect(GpuExecutedCommandExec.scala:61) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at
[jira] [Created] (SPARK-44973) CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException
Gera Shegalov created SPARK-44973: - Summary: CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException Key: SPARK-44973 URL: https://issues.apache.org/jira/browse/SPARK-44973 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1, 3.5.0 Reporter: Gera Shegalov {code:scala} scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false) java.lang.ArrayIndexOutOfBoundsException: -1 at org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183) at org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463) at org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821) at org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57) at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81) at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91) {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
[ https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-44943: -- Description: Signed conversion does not detect overflow {code:java} >>> spark.conf.set('spark.sql.ansi.enabled', True) >>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False) +---+ |conv(-9223372036854775809, 10, -10)| +---+ |-9223372036854775807 | +---+ {code} Unsigned conversion produces -1 but does not throw in the ANSI mode {code} >>> sql("SELECT conv('-9223372036854775809', 10, 10)").show(truncate=False) +--+ |conv(-9223372036854775809, 10, 10)| +--+ |18446744073709551615 | +--+ {code} was: {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}} {{>>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)}} {{+---+}} {{|conv(-9223372036854775809, 10, -10)|}} {{+---+}} {{|-9223372036854775807 |}} {{+---+}} > CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow > > > Key: SPARK-44943 > URL: https://issues.apache.org/jira/browse/SPARK-44943 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Gera Shegalov >Priority: Major > > Signed conversion does not detect overflow > {code:java} > >>> spark.conf.set('spark.sql.ansi.enabled', True) > >>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False) > +---+ > |conv(-9223372036854775809, 10, -10)| > +---+ > |-9223372036854775807 | > +---+ > {code} > Unsigned conversion produces -1 but does not throw in the ANSI mode > {code} > >>> sql("SELECT conv('-9223372036854775809', 10, 10)").show(truncate=False) > +--+ > |conv(-9223372036854775809, 10, 10)| > +--+ > |18446744073709551615 | > +--+ > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
[ https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-44943: -- Affects Version/s: 3.4.1 > CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow > > > Key: SPARK-44943 > URL: https://issues.apache.org/jira/browse/SPARK-44943 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.1, 3.5.0, 4.0.0 >Reporter: Gera Shegalov >Priority: Major > > {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}} > {{>>> sql("SELECT conv('-9223372036854775809', 10, > -10)").show(truncate=False)}} > {{+---+}} > {{|conv(-9223372036854775809, 10, -10)|}} > {{+---+}} > {{|-9223372036854775807 |}} > {{+---+}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
[ https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-44943: -- Affects Version/s: 3.5.0 > CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow > > > Key: SPARK-44943 > URL: https://issues.apache.org/jira/browse/SPARK-44943 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.5.0, 4.0.0 >Reporter: Gera Shegalov >Priority: Major > > {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}} > {{>>> sql("SELECT conv('-9223372036854775809', 10, > -10)").show(truncate=False)}} > {{+---+}} > {{|conv(-9223372036854775809, 10, -10)|}} > {{+---+}} > {{|-9223372036854775807 |}} > {{+---+}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
[ https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-44943: -- Affects Version/s: 4.0.0 (was: 3.4.1) > CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow > > > Key: SPARK-44943 > URL: https://issues.apache.org/jira/browse/SPARK-44943 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 4.0.0 >Reporter: Gera Shegalov >Priority: Major > > {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}} > {{>>> sql("SELECT conv('-9223372036854775809', 10, > -10)").show(truncate=False)}} > {{+---+}} > {{|conv(-9223372036854775809, 10, -10)|}} > {{+---+}} > {{|-9223372036854775807 |}} > {{+---+}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
Gera Shegalov created SPARK-44943: - Summary: CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow Key: SPARK-44943 URL: https://issues.apache.org/jira/browse/SPARK-44943 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.1 Reporter: Gera Shegalov {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}} {{>>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)}} {{+---+}} {{|conv(-9223372036854775809, 10, -10)|}} {{+---+}} {{|-9223372036854775807 |}} {{+---+}} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
[ https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-42752: -- Description: Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action was: Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: >>> spark.conf Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf self._conf = RuntimeConfig(self._jsparkSession.conf()) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action > Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop > Free" distibution > --- > > Key: SPARK-42752 > URL: https://issues.apache.org/jira/browse/SPARK-42752 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 > Environment: local >Reporter: Gera Shegalov >Priority: Major > > Reproduction steps: > 1. download a standard "Hadoop Free" build > 2. Start pyspark REPL with Hive support > {code:java} > SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) > ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf > spark.sql.catalogImplementation=hive > {code} > 3. Execute any simple dataframe operation > {code:java} > >>> spark.range(100).show() > Traceback (most recent call last): > File "", line 1, in > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", > line 416, in range > jdf = self._jsparkSession.range(0, int(start), int(step), > int(numPartitions)) > File > "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", > line 1321,
[jira] [Created] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution
Gera Shegalov created SPARK-42752: - Summary: Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution Key: SPARK-42752 URL: https://issues.apache.org/jira/browse/SPARK-42752 Project: Spark Issue Type: Bug Components: PySpark, SQL Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0 Environment: local Reporter: Gera Shegalov Reproduction steps: 1. download a standard "Hadoop Free" build 2. Start pyspark REPL with Hive support {code:java} SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf spark.sql.catalogImplementation=hive {code} 3. Execute any simple dataframe operation {code:java} >>> spark.range(100).show() Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 416, in range jdf = self._jsparkSession.range(0, int(start), int(step), int(numPartitions)) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: >>> spark.conf Traceback (most recent call last): File "", line 1, in File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", line 347, in conf self._conf = RuntimeConfig(self._jsparkSession.conf()) File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__ File "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.IllegalArgumentException: {code} 4. In fact you can just call spark.conf to trigger this issue {code:java} >>> spark.conf Traceback (most recent call last): File "", line 1, in ... {code} There are probably two issues here: 1) that Hive support should be gracefully disabled if it the dependency not on the classpath as claimed by https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html 2) but at the very least the user should be able to see the exception to understand the issue, and take an action -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov reopened SPARK-41793: --- > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384 ] Gera Shegalov edited comment on SPARK-41793 at 2/9/23 6:26 PM: --- Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. UPDATE: if we consider the range of this query plan above {{RangeFrame, -10.23, 6.79)}}. {code} spark-sql> select b - 10.23 as lower, b, b + 6.79 as upper from test_table; 11342371013783243717493546650944533.24 11342371013783243717493546650944543.47 11342371013783243717493546650944550.26 9989.76 .99 NULL {code} as consisting of the union of [lower; b) b (b; upper] only the (b; upper] is undefined the first [lower; b) and b are defined and have at least one row. either way not counting the current row is a counterintuitive result was (Author: jira.shegalov): Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384 ] Gera Shegalov commented on SPARK-41793: --- Another interpretation of why the pre-3.4 count of 1 may be actually correct could be that regardless of whether the window frame bound values overflow or not the current row is always part of the window it defines. Whether or not it should be the case can be clarified in the doc. > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903 ] Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM: --- if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and backported to maintenance branches? was (Author: jira.shegalov): if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and probably backported to maintenance branches? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903 ] Gera Shegalov commented on SPARK-41793: --- if the consensus is that it's not a correctness bug in 3.4, then this fix should probably be documented and probably backported to maintenance branches? > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Blocker > Labels: correctness > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653238#comment-17653238 ] Gera Shegalov commented on SPARK-41793: --- Similarly in SQLite {code} .header on create table test_table(a long, b decimal(38,2)); insert into test_table values ('9223372036854775807', '11342371013783243717493546650944543.47'), ('9223372036854775807', '.99'); select * from test_table; select count(1) over( partition by a order by b asc range between 10.2345 preceding and 6.7890 following) as cnt_1 from test_table; {code} yields {code} a|b 9223372036854775807|1.13423710137832e+34 9223372036854775807|1.0e+36 cnt_1 1 1 {code} > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals
[ https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-41793: -- Summary: Incorrect result for window frames defined by a range clause on large decimals (was: Incorrect result for window frames defined as ranges on large decimals ) > Incorrect result for window frames defined by a range clause on large > decimals > --- > > Key: SPARK-41793 > URL: https://issues.apache.org/jira/browse/SPARK-41793 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.4.0 >Reporter: Gera Shegalov >Priority: Major > > Context > https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 > The following windowing query on a simple two-row input should produce two > non-empty windows as a result > {code} > from pprint import pprint > data = [ > ('9223372036854775807', '11342371013783243717493546650944543.47'), > ('9223372036854775807', '.99') > ] > df1 = spark.createDataFrame(data, 'a STRING, b STRING') > df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) > df2.createOrReplaceTempView('test_table') > df = sql(''' > SELECT > COUNT(1) OVER ( > PARTITION BY a > ORDER BY b ASC > RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING > ) AS CNT_1 > FROM > test_table > ''') > res = df.collect() > df.explain(True) > pprint(res) > {code} > Spark 3.4.0-SNAPSHOT output: > {code} > [Row(CNT_1=1), Row(CNT_1=0)] > {code} > Spark 3.3.1 output as expected: > {code} > Row(CNT_1=1), Row(CNT_1=1)] > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-41793) Incorrect result for window frames defined as ranges on large decimals
Gera Shegalov created SPARK-41793: - Summary: Incorrect result for window frames defined as ranges on large decimals Key: SPARK-41793 URL: https://issues.apache.org/jira/browse/SPARK-41793 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.4.0 Reporter: Gera Shegalov Context https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686 The following windowing query on a simple two-row input should produce two non-empty windows as a result {code} from pprint import pprint data = [ ('9223372036854775807', '11342371013783243717493546650944543.47'), ('9223372036854775807', '.99') ] df1 = spark.createDataFrame(data, 'a STRING, b STRING') df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)')) df2.createOrReplaceTempView('test_table') df = sql(''' SELECT COUNT(1) OVER ( PARTITION BY a ORDER BY b ASC RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING ) AS CNT_1 FROM test_table ''') res = df.collect() df.explain(True) pprint(res) {code} Spark 3.4.0-SNAPSHOT output: {code} [Row(CNT_1=1), Row(CNT_1=0)] {code} Spark 3.3.1 output as expected: {code} Row(CNT_1=1), Row(CNT_1=1)] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-35408) Improve parameter validation in DataFrame.show
Gera Shegalov created SPARK-35408: - Summary: Improve parameter validation in DataFrame.show Key: SPARK-35408 URL: https://issues.apache.org/jira/browse/SPARK-35408 Project: Spark Issue Type: Improvement Components: PySpark Affects Versions: 3.1.1 Reporter: Gera Shegalov Being more used to Scala API the user may be tempted to call {code:python} df.show(False) {code} in PySparkl and she will receive an error message that does not easily map to the user code {noformat} py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748 {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-31900) Client memory passed unvalidated to the JVM Xmx
Gera Shegalov created SPARK-31900: - Summary: Client memory passed unvalidated to the JVM Xmx Key: SPARK-31900 URL: https://issues.apache.org/jira/browse/SPARK-31900 Project: Spark Issue Type: Bug Components: Spark Submit Affects Versions: 2.0.0, 3.0.0 Reporter: Gera Shegalov When launching in the client mode Spark launcher uses spark.driver.memory config (among other settings in the precedence order). Unlike the cluster mode, the client memory config is not trimmed and not validated to be a valid suffix for the [Xmx option|https://docs.oracle.com/en/java/javase/14/docs/specs/man/java.html#extra-options-for-java]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
[ https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758102#comment-16758102 ] Gera Shegalov commented on SPARK-23155: --- [~kabhwan], [~vanzin] I would still be interested to be able to use the new mechanism with the old logs. [https://github.com/apache/spark/pull/23720] is a quick draft to demo how we could achieve this flexibly with named capture groups. > YARN-aggregated executor/driver logs appear unavailable when NM is down > --- > > Key: SPARK-23155 > URL: https://issues.apache.org/jira/browse/SPARK-23155 > Project: Spark > Issue Type: Improvement > Components: Deploy >Affects Versions: 2.2.1 >Reporter: Gera Shegalov >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.0.0 > > > Unlike MapReduce JobHistory Server, Spark history server isn't rewriting > container log URL's to point to the aggregated yarn.log.server.url location > and relies on the NodeManager webUI to trigger a redirect. This fails when > the NM is down. Note that NM may be down permanently after decommissioning in > traditional environments or when used in a cloud environment such as AWS EMR > where either worker nodes are taken away with autoscale, the whole cluster is > used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI
[ https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758095#comment-16758095 ] Gera Shegalov commented on SPARK-26792: --- [~kabhwan] thanks for doing this work. I verified that I can configure SHS so it satisfies our use case. Changing the default in Spark is a nice-to-have but not a high priority from my perspective. > Apply custom log URL to Spark UI > > > Key: SPARK-26792 > URL: https://issues.apache.org/jira/browse/SPARK-26792 > Project: Spark > Issue Type: Improvement > Components: Web UI >Affects Versions: 3.0.0 >Reporter: Jungtaek Lim >Priority: Major > > SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed > apps. > While getting reviews from SPARK-23155, I've got two comments which applying > custom log URLs to UI would help achieving it. Quoting these comments here: > https://github.com/apache/spark/pull/23260#issuecomment-456827963 > {quote} > Sorry I haven't had time to look through all the code so this might be a > separate jira, but one thing I thought of here is it would be really nice not > to have specifically stderr/stdout. users can specify any log4j.properties > and some tools like oozie by default end up using hadoop log4j rather then > spark log4j, so files aren't necessarily the same. Also users can put in > other logs files so it would be nice to have links to those from the UI. It > seems simpler if we just had a link to the directory and it read the files > within there. Other things in Hadoop do it this way, but I'm not sure if that > works well for other resource managers, any thoughts on that? As long as this > doesn't prevent the above I can file a separate jira for it. > {quote} > https://github.com/apache/spark/pull/23260#issuecomment-456904716 > {quote} > Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We > typically configure Spark jobs to write the GC log and dump heap on OOM > using , and/or we use the rolling file appender to deal with > large logs during debugging. So linking the YARN container log overview > page would make much more sense for us. We work it around with a custom > submit process that logs all important URLs on the submit side log. > {quote} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values
Gera Shegalov created SPARK-25221: - Summary: [DEPLOY] Consistent trailing whitespace treatment of conf values Key: SPARK-25221 URL: https://issues.apache.org/jira/browse/SPARK-25221 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.3.1 Reporter: Gera Shegalov When you use a custom line delimiter {{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a trailing whitespace character it's only possible when specified via {{--conf}} . Our pipeline consists of a highly customized generated jobs. Storing all the config in a properities file is not only better for readability but even necessary to avoid dealing with {{ARGS_MAX}} on different OS. Spark should uniformly avoid trimming conf values in both cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23956) Use effective RPC port in AM registration
[ https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-23956: -- Priority: Minor (was: Major) > Use effective RPC port in AM registration > -- > > Key: SPARK-23956 > URL: https://issues.apache.org/jira/browse/SPARK-23956 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 2.3.0 >Reporter: Gera Shegalov >Priority: Minor > > AM's should use their real rpc port in the AM registration for better > diagnostics in Application Report. > {code} > 18/04/10 14:56:21 INFO Client: > client token: N/A > diagnostics: N/A > ApplicationMaster host: localhost > ApplicationMaster RPC port: 58338 > queue: default > start time: 1523397373659 > final status: UNDEFINED > tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23956) Use effective RPC port in AM registration
Gera Shegalov created SPARK-23956: - Summary: Use effective RPC port in AM registration Key: SPARK-23956 URL: https://issues.apache.org/jira/browse/SPARK-23956 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.3.0 Reporter: Gera Shegalov AM's should use their real rpc port in the AM registration for better diagnostics in Application Report. {code} 18/04/10 14:56:21 INFO Client: client token: N/A diagnostics: N/A ApplicationMaster host: localhost ApplicationMaster RPC port: 58338 queue: default start time: 1523397373659 final status: UNDEFINED tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/ {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23386) Enable direct application links before replay
Gera Shegalov created SPARK-23386: - Summary: Enable direct application links before replay Key: SPARK-23386 URL: https://issues.apache.org/jira/browse/SPARK-23386 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov In a deployment with multiple 10K of large event logs it may take *many hours* until all logs are replayed. Most our users use SHS by clicking on a link in a client log in case of an error. Direct links currently don't work until the event log is processed in a replay thread. This Jira proposes to link appid to the event logs already during scan, without a full replay. This makes on-demand retrievals accessible almost immediately upon SHS start. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23296) Diagnostics message for user code exceptions should include the stacktrace
Gera Shegalov created SPARK-23296: - Summary: Diagnostics message for user code exceptions should include the stacktrace Key: SPARK-23296 URL: https://issues.apache.org/jira/browse/SPARK-23296 Project: Spark Issue Type: Improvement Components: YARN Affects Versions: 2.2.1 Reporter: Gera Shegalov When Spark job fails on user exception solely {{Throwable#toString}} is included in diagnostics. It may take less clicks to get to the stack trace appears on both YARN webUI and the client log. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-12963: -- Shepherd: Sean Owen > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!
[ https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331815#comment-16331815 ] Gera Shegalov commented on SPARK-12963: --- We hit the same issue on nodes where a process is not allowed to listen on all NIC. An easy fix is to make sure that the Driver in ApplicationMaster inherits an explicitly configured public hostname of the NodeManager. > In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' > failed after 16 retries! > - > > Key: SPARK-12963 > URL: https://issues.apache.org/jira/browse/SPARK-12963 > Project: Spark > Issue Type: Bug > Components: Deploy >Affects Versions: 1.6.0 >Reporter: lichenglin >Priority: Critical > > I have 3 node cluster:namenode second and data1; > I use this shell to submit job on namenode: > bin/spark-submit --deploy-mode cluster --class com.bjdv.spark.job.Abc > --total-executor-cores 5 --master spark://namenode:6066 > hdfs://namenode:9000/sparkjars/spark.jar > The Driver may be started on the other node such as data1. > The problem is : > when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode > the driver will be started with this param such as > SPARK_LOCAL_IP=namenode > but the driver will start at data1, > the dirver will try to binding the ip 'namenode' on data1. > so driver will throw exception like this: > Service 'Driver' failed after 16 retries! -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down
Gera Shegalov created SPARK-23155: - Summary: YARN-aggregated executor/driver logs appear unavailable when NM is down Key: SPARK-23155 URL: https://issues.apache.org/jira/browse/SPARK-23155 Project: Spark Issue Type: Improvement Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov Unlike MapReduce JobHistory Server, Spark history server isn't rewriting container log URL's to point to the aggregated yarn.log.server.url location and relies on the NodeManager webUI to trigger a redirect. This fails when the NM is down. Note that NM may be down permanently after decommissioning in traditional environments or when used in a cloud environment such as AWS EMR where either worker nodes are taken away with autoscale, the whole cluster is used to run a single job. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default
Gera Shegalov created SPARK-22914: - Summary: Subbing for spark.history.ui.port does not resolve by default Key: SPARK-22914 URL: https://issues.apache.org/jira/browse/SPARK-22914 Project: Spark Issue Type: Bug Components: Deploy Affects Versions: 2.2.1 Reporter: Gera Shegalov In order not to hardcode SHS web ui port and not duplicate information that is already configured we might be inclined to define {{spark.yarn.historyServer.address}} as {code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code} However, since spark.history.ui.port is not registered its resolution fails when it's not explicitly set in the deployed spark conf. -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-22875: -- Flags: Patch > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id
[ https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gera Shegalov updated SPARK-22875: -- Shepherd: Steve Loughran > Assembly build fails for a high user id > --- > > Key: SPARK-22875 > URL: https://issues.apache.org/jira/browse/SPARK-22875 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 2.2.1 >Reporter: Gera Shegalov > > {code} > ./build/mvn package -Pbigtop-dist -DskipTests > [ERROR] Failed to execute goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project > spark-assembly_2.11: Execution dist of goal > org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id > '123456789' is too big ( > 2097151 ). -> [Help 1] > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-22875) Assembly build fails for a high user id
Gera Shegalov created SPARK-22875: - Summary: Assembly build fails for a high user id Key: SPARK-22875 URL: https://issues.apache.org/jira/browse/SPARK-22875 Project: Spark Issue Type: Bug Components: Build Affects Versions: 2.2.1 Reporter: Gera Shegalov {code} ./build/mvn package -Pbigtop-dist -DskipTests [ERROR] Failed to execute goal org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project spark-assembly_2.11: Execution dist of goal org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id '123456789' is too big ( > 2097151 ). -> [Help 1] {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-2602) sbt/sbt test steals window focus on OS X
[ https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068044#comment-14068044 ] Gera Shegalov commented on SPARK-2602: -- Take a look at the thread on HADOOP-10290 sbt/sbt test steals window focus on OS X Key: SPARK-2602 URL: https://issues.apache.org/jira/browse/SPARK-2602 Project: Spark Issue Type: Improvement Components: Build Reporter: Nicholas Chammas Priority: Minor On OS X, I run {{sbt/sbt test}} from Terminal and then go off and do something else with my computer. It appears that there are several things in the test suite that launch Java programs that, for some reason, steal window focus. It can get very annoying, especially if you happen to be typing something in a different window, to be suddenly teleported to a random Java application and have your finely crafted keystrokes be sent where they weren't intended. It would be nice if {{sbt/sbt test}} didn't do that. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (SPARK-2577) File upload to viewfs is broken due to mount point resolution
Gera Shegalov created SPARK-2577: Summary: File upload to viewfs is broken due to mount point resolution Key: SPARK-2577 URL: https://issues.apache.org/jira/browse/SPARK-2577 Project: Spark Issue Type: Bug Components: YARN Reporter: Gera Shegalov Priority: Blocker YARN client resolves paths of uploaded artifacts. When a viewfs path is resolved, the filesystem changes to the target file system. However, the original fs is passed to {{ClientDistributedCacheManager#addResource}}. {code} 14/07/18 01:30:31 INFO yarn.Client: Uploading file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar to viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar Exception in thread main java.lang.IllegalArgumentException: Wrong FS: hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar, expected: viewfs:/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136) at org.apache.spark.SparkContext.init(SparkContext.scala:320) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} There are two options: # do not resolve path because symlinks are currently disabled in Hadoop # pass the correct filesystem object -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-2577) File upload to viewfs is broken due to mount point resolution
[ https://issues.apache.org/jira/browse/SPARK-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066243#comment-14066243 ] Gera Shegalov commented on SPARK-2577: -- https://github.com/apache/spark/pull/1483 File upload to viewfs is broken due to mount point resolution - Key: SPARK-2577 URL: https://issues.apache.org/jira/browse/SPARK-2577 Project: Spark Issue Type: Bug Components: YARN Reporter: Gera Shegalov Priority: Blocker YARN client resolves paths of uploaded artifacts. When a viewfs path is resolved, the filesystem changes to the target file system. However, the original fs is passed to {{ClientDistributedCacheManager#addResource}}. {code} 14/07/18 01:30:31 INFO yarn.Client: Uploading file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar to viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar Exception in thread main java.lang.IllegalArgumentException: Wrong FS: hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar, expected: viewfs:/ at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116) at org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345) at org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236) at org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229) at scala.collection.immutable.List.foreach(List.scala:318) at org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229) at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37) at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136) at org.apache.spark.SparkContext.init(SparkContext.scala:320) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) {code} There are two options: # do not resolve path because symlinks are currently disabled in Hadoop # pass the correct filesystem object -- This message was sent by Atlassian JIRA (v6.2#6252)