[jira] [Updated] (SPARK-44973) Fix ArrayIndexOutOfBoundsException in conv()

2023-11-21 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-44973:
--
Affects Version/s: 3.0.3

> Fix ArrayIndexOutOfBoundsException in conv()
> 
>
> Key: SPARK-44973
> URL: https://issues.apache.org/jira/browse/SPARK-44973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.3, 3.3.3, 3.4.1, 3.5.0
>Reporter: Gera Shegalov
>Assignee: Mark Jarvin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.2, 4.0.0, 3.5.1, 3.3.4
>
>
> {code:scala}
> scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false)
> java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183)
>   at 
> org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-44973) CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException

2023-11-20 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-44973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17788157#comment-17788157
 ] 

Gera Shegalov commented on SPARK-44973:
---

3.0.3 is the oldest version on my box and it exhibits the same bug:

 
{code:java}
scala> spark.version
res1: String = 3.0.3
scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false)
java.lang.ArrayIndexOutOfBoundsException: -1
  at 
org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:148)
  at 
org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:338)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:690)
  at 
org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:457)
{code}
 

> CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException
> --
>
> Key: SPARK-44973
> URL: https://issues.apache.org/jira/browse/SPARK-44973
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.3, 3.4.1, 3.5.0
>Reporter: Gera Shegalov
>Priority: Major
>  Labels: pull-request-available
>
> {code:scala}
> scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false)
> java.lang.ArrayIndexOutOfBoundsException: -1
>   at 
> org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183)
>   at 
> org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463)
>   at 
> org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821)
>   at 
> org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81)
>   at 
> org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20075) Support classifier, packaging in Maven coordinates

2023-10-31 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-20075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17781550#comment-17781550
 ] 

Gera Shegalov commented on SPARK-20075:
---

This would be a great feature that can help spark-rapids plugin users that 
require a non-default classifier such as cuda12

> Support classifier, packaging in Maven coordinates
> --
>
> Key: SPARK-20075
> URL: https://issues.apache.org/jira/browse/SPARK-20075
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Shell, Spark Submit
>Affects Versions: 2.1.0
>Reporter: Sean R. Owen
>Priority: Minor
>  Labels: bulk-closed
>
> Currently, it's possible to add dependencies to an app using its Maven 
> coordinates on the command line: {{group:artifact:version}}. However, really 
> Maven coordinates are 5-dimensional: 
> {{group:artifact:packaging:classifier:version}}. In some rare but real cases 
> it's important to be able to specify the classifier. And while we're at it 
> why not try to support packaging?
> I have a WIP PR that I'll post soon.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-43389) spark.read.csv throws NullPointerException when lineSep is set to None

2023-10-03 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-43389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771698#comment-17771698
 ] 

Gera Shegalov commented on SPARK-43389:
---

There is a symmetrical issue on the DataFrameWriter side:
{code:python}
>>> spark.createDataFrame([('some value',),]).write.option('someOpt', 
>>> None).saveAsTable("hive_csv_t21")
{code}
 
{code:java}
23/10/03 21:39:12 WARN HiveExternalCatalog: Could not persist 
`spark_catalog`.`default`.`hive_csv_t21` in a Hive compatible way. Persisting 
it into Hive metastore in Spark SQL specific format.
org.apache.hadoop.hive.ql.metadata.HiveException: 
MetaException(message:java.lang.NullPointerException: Null values not allowed 
in persistent maps.)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:869)
    at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:874)
    at 
org.apache.spark.sql.hive.client.Shim_v0_12.createTable(HiveShim.scala:614)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$createTable$1(HiveClientImpl.scala:573)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:303)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:234)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:233)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:283)
    at 
org.apache.spark.sql.hive.client.HiveClientImpl.createTable(HiveClientImpl.scala:571)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.saveTableIntoHive(HiveExternalCatalog.scala:526)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.createDataSourceTable(HiveExternalCatalog.scala:415)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$createTable$1(HiveExternalCatalog.scala:274)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
    at 
org.apache.spark.sql.hive.HiveExternalCatalog.createTable(HiveExternalCatalog.scala:245)
    at 
org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.createTable(ExternalCatalogWithListener.scala:94)
    at 
org.apache.spark.sql.catalyst.catalog.SessionCatalog.createTable(SessionCatalog.scala:402)
    at 
org.apache.spark.sql.rapids.shims.GpuCreateDataSourceTableAsSelectCommand.run(GpuCreateDataSourceTableAsSelectCommandShims.scala:91)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult$lzycompute(GpuExecutedCommandExec.scala:52)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.sideEffectResult(GpuExecutedCommandExec.scala:50)
    at 
com.nvidia.spark.rapids.GpuExecutedCommandExec.executeCollect(GpuExecutedCommandExec.scala:61)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125)
    at 
org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201)
    at 
org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900)
    at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107)
    at 
org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:461)
    at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:461)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32)
    at 
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:437)
    at 
org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98)
    at 

[jira] [Created] (SPARK-44973) CONV('-9223372036854775808', 10, -2) throws ArrayIndexOutOfBoundsException

2023-08-25 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-44973:
-

 Summary: CONV('-9223372036854775808', 10, -2) throws 
ArrayIndexOutOfBoundsException
 Key: SPARK-44973
 URL: https://issues.apache.org/jira/browse/SPARK-44973
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1, 3.5.0
Reporter: Gera Shegalov



{code:scala}
scala> sql(s"SELECT CONV('${Long.MinValue}', 10, -2)").show(false)
java.lang.ArrayIndexOutOfBoundsException: -1
  at 
org.apache.spark.sql.catalyst.util.NumberConverter$.convert(NumberConverter.scala:183)
  at 
org.apache.spark.sql.catalyst.expressions.Conv.nullSafeEval(mathExpressions.scala:463)
  at 
org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:821)
  at 
org.apache.spark.sql.catalyst.expressions.ToPrettyString.eval(ToPrettyString.scala:57)
  at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.org$apache$spark$sql$catalyst$optimizer$ConstantFolding$$constantFolding(expressions.scala:81)
  at 
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$.$anonfun$constantFolding$4(expressions.scala:91)
{code}




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow

2023-08-24 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-44943:
--
Description: 
Signed conversion does not detect overflow 
{code:java}
>>> spark.conf.set('spark.sql.ansi.enabled', True)
>>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)
+---+
|conv(-9223372036854775809, 10, -10)|
+---+
|-9223372036854775807               |
+---+
{code}

Unsigned conversion produces -1 but does not throw in the ANSI mode
{code}
>>> sql("SELECT conv('-9223372036854775809', 10, 10)").show(truncate=False)
+--+
|conv(-9223372036854775809, 10, 10)|
+--+
|18446744073709551615  |
+--+
{code}


  was:
{{>>> spark.conf.set('spark.sql.ansi.enabled', True)}}
{{>>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)}}
{{+---+}}
{{|conv(-9223372036854775809, 10, -10)|}}
{{+---+}}
{{|-9223372036854775807               |}}
{{+---+}}


> CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
> 
>
> Key: SPARK-44943
> URL: https://issues.apache.org/jira/browse/SPARK-44943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Gera Shegalov
>Priority: Major
>
> Signed conversion does not detect overflow 
> {code:java}
> >>> spark.conf.set('spark.sql.ansi.enabled', True)
> >>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)
> +---+
> |conv(-9223372036854775809, 10, -10)|
> +---+
> |-9223372036854775807               |
> +---+
> {code}
> Unsigned conversion produces -1 but does not throw in the ANSI mode
> {code}
> >>> sql("SELECT conv('-9223372036854775809', 10, 10)").show(truncate=False)
> +--+
> |conv(-9223372036854775809, 10, 10)|
> +--+
> |18446744073709551615  |
> +--+
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow

2023-08-24 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-44943:
--
Affects Version/s: 3.4.1

> CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
> 
>
> Key: SPARK-44943
> URL: https://issues.apache.org/jira/browse/SPARK-44943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.1, 3.5.0, 4.0.0
>Reporter: Gera Shegalov
>Priority: Major
>
> {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}}
> {{>>> sql("SELECT conv('-9223372036854775809', 10, 
> -10)").show(truncate=False)}}
> {{+---+}}
> {{|conv(-9223372036854775809, 10, -10)|}}
> {{+---+}}
> {{|-9223372036854775807               |}}
> {{+---+}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow

2023-08-24 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-44943:
--
Affects Version/s: 3.5.0

> CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
> 
>
> Key: SPARK-44943
> URL: https://issues.apache.org/jira/browse/SPARK-44943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.5.0, 4.0.0
>Reporter: Gera Shegalov
>Priority: Major
>
> {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}}
> {{>>> sql("SELECT conv('-9223372036854775809', 10, 
> -10)").show(truncate=False)}}
> {{+---+}}
> {{|conv(-9223372036854775809, 10, -10)|}}
> {{+---+}}
> {{|-9223372036854775807               |}}
> {{+---+}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow

2023-08-24 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-44943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-44943:
--
Affects Version/s: 4.0.0
   (was: 3.4.1)

> CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow
> 
>
> Key: SPARK-44943
> URL: https://issues.apache.org/jira/browse/SPARK-44943
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 4.0.0
>Reporter: Gera Shegalov
>Priority: Major
>
> {{>>> spark.conf.set('spark.sql.ansi.enabled', True)}}
> {{>>> sql("SELECT conv('-9223372036854775809', 10, 
> -10)").show(truncate=False)}}
> {{+---+}}
> {{|conv(-9223372036854775809, 10, -10)|}}
> {{+---+}}
> {{|-9223372036854775807               |}}
> {{+---+}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-44943) CONV produces incorrect result near Long.MIN_VALUE, fails to detect overflow

2023-08-24 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-44943:
-

 Summary: CONV produces incorrect result near Long.MIN_VALUE, fails 
to detect overflow
 Key: SPARK-44943
 URL: https://issues.apache.org/jira/browse/SPARK-44943
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.1
Reporter: Gera Shegalov


{{>>> spark.conf.set('spark.sql.ansi.enabled', True)}}
{{>>> sql("SELECT conv('-9223372036854775809', 10, -10)").show(truncate=False)}}
{{+---+}}
{{|conv(-9223372036854775809, 10, -10)|}}
{{+---+}}
{{|-9223372036854775807               |}}
{{+---+}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-42752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-42752:
--
Description: 
Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 

  was:
Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 347, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 


> Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop 
> Free" distibution
> ---
>
> Key: SPARK-42752
> URL: https://issues.apache.org/jira/browse/SPARK-42752
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
> Environment: local
>Reporter: Gera Shegalov
>Priority: Major
>
> Reproduction steps:
> 1. download a standard "Hadoop Free" build
> 2. Start pyspark REPL with Hive support
> {code:java}
> SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
> ~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
> spark.sql.catalogImplementation=hive
> {code}
> 3. Execute any simple dataframe operation
> {code:java}
> >>> spark.range(100).show()
> Traceback (most recent call last):
>   File "", line 1, in 
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py",
>  line 416, in range
> jdf = self._jsparkSession.range(0, int(start), int(step), 
> int(numPartitions))
>   File 
> "/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
>  line 1321, 

[jira] [Created] (SPARK-42752) Unprintable IllegalArgumentException with Hive catalog enabled in "Hadoop Free" distibution

2023-03-10 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-42752:
-

 Summary: Unprintable IllegalArgumentException with Hive catalog 
enabled in "Hadoop Free" distibution
 Key: SPARK-42752
 URL: https://issues.apache.org/jira/browse/SPARK-42752
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 3.1.3, 3.2.4, 3.3.3, 3.4.1, 3.5.0
 Environment: local
Reporter: Gera Shegalov


Reproduction steps:
1. download a standard "Hadoop Free" build
2. Start pyspark REPL with Hive support
{code:java}
SPARK_DIST_CLASSPATH=$(~/dist/hadoop-3.4.0-SNAPSHOT/bin/hadoop classpath) 
~/dist/spark-3.2.3-bin-without-hadoop/bin/pyspark --conf 
spark.sql.catalogImplementation=hive
{code}
3. Execute any simple dataframe operation
{code:java}
>>> spark.range(100).show()
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 416, in range
jdf = self._jsparkSession.range(0, int(start), int(step), 
int(numPartitions))
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/session.py", 
line 347, in conf
self._conf = RuntimeConfig(self._jsparkSession.conf())
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py",
 line 1321, in __call__
  File 
"/home/user/dist/spark-3.2.3-bin-without-hadoop/python/pyspark/sql/utils.py", 
line 117, in deco
raise converted from None
pyspark.sql.utils.IllegalArgumentException: 
{code}
4. In fact you can just call spark.conf to trigger this issue
{code:java}
>>> spark.conf
Traceback (most recent call last):
  File "", line 1, in 
...
{code}

There are probably two issues here:
1) that Hive support should be gracefully disabled if it the dependency not on 
the classpath as claimed by 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
2) but at the very least the user should be able to see the exception to 
understand the issue, and take an action

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-09 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov reopened SPARK-41793:
---

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-09 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/9/23 6:26 PM:
---

Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

UPDATE: if we consider the range of this query plan above {{RangeFrame, -10.23, 
6.79)}}.
{code}
spark-sql> select b - 10.23 as lower, b, b + 6.79 as upper from test_table;
11342371013783243717493546650944533.24  11342371013783243717493546650944543.47  
11342371013783243717493546650944550.26
9989.76 .99 
NULL
{code}

as consisting of the union of 
[lower; b) 
b 
(b; upper]

only the (b; upper] is undefined

the first [lower; b) and b are defined and have at least one row.  

either way not counting the current row is a counterintuitive result



was (Author: jira.shegalov):
Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-07 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17685384#comment-17685384
 ] 

Gera Shegalov commented on SPARK-41793:
---

Another interpretation of why the pre-3.4 count of 1 may be actually correct 
could be that regardless of whether the window frame bound values overflow or 
not  the current row is always part of the window it defines. Whether or not it 
should be the case can be clarified in the doc.

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-06 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903
 ] 

Gera Shegalov edited comment on SPARK-41793 at 2/6/23 7:38 PM:
---

if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and backported to maintenance branches?


was (Author: jira.shegalov):
if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and probably backported to maintenance branches?

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2023-02-06 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17684903#comment-17684903
 ] 

Gera Shegalov commented on SPARK-41793:
---

if the consensus is that it's not a correctness bug in 3.4,  then this fix 
should probably be documented and probably backported to maintenance branches?

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Blocker
>  Labels: correctness
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2022-12-30 Thread Gera Shegalov (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17653238#comment-17653238
 ] 

Gera Shegalov commented on SPARK-41793:
---

Similarly in SQLite
{code}
.header on

create table test_table(a long, b decimal(38,2));
insert into test_table 
values
  ('9223372036854775807', '11342371013783243717493546650944543.47'),
  ('9223372036854775807', '.99');

select * from test_table;

select 
  count(1) over(
partition by a 
order by b asc
range between 10.2345 preceding and 6.7890 following) as cnt_1 
  from 
test_table;
{code}

yields

{code}
a|b
9223372036854775807|1.13423710137832e+34
9223372036854775807|1.0e+36
cnt_1
1
1
{code}

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Major
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-41793) Incorrect result for window frames defined by a range clause on large decimals

2022-12-30 Thread Gera Shegalov (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-41793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-41793:
--
Summary: Incorrect result for window frames defined by a range clause on 
large decimals   (was: Incorrect result for window frames defined as ranges on 
large decimals )

> Incorrect result for window frames defined by a range clause on large 
> decimals 
> ---
>
> Key: SPARK-41793
> URL: https://issues.apache.org/jira/browse/SPARK-41793
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: Gera Shegalov
>Priority: Major
>
> Context 
> https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686
> The following windowing query on a simple two-row input should produce two 
> non-empty windows as a result
> {code}
> from pprint import pprint
> data = [
>   ('9223372036854775807', '11342371013783243717493546650944543.47'),
>   ('9223372036854775807', '.99')
> ]
> df1 = spark.createDataFrame(data, 'a STRING, b STRING')
> df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
> df2.createOrReplaceTempView('test_table')
> df = sql('''
>   SELECT 
> COUNT(1) OVER (
>   PARTITION BY a 
>   ORDER BY b ASC 
>   RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
> ) AS CNT_1 
>   FROM 
> test_table
>   ''')
> res = df.collect()
> df.explain(True)
> pprint(res)
> {code}
> Spark 3.4.0-SNAPSHOT output:
> {code}
> [Row(CNT_1=1), Row(CNT_1=0)]
> {code}
> Spark 3.3.1 output as expected:
> {code}
> Row(CNT_1=1), Row(CNT_1=1)]
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-41793) Incorrect result for window frames defined as ranges on large decimals

2022-12-30 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-41793:
-

 Summary: Incorrect result for window frames defined as ranges on 
large decimals 
 Key: SPARK-41793
 URL: https://issues.apache.org/jira/browse/SPARK-41793
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.4.0
Reporter: Gera Shegalov


Context 
https://github.com/NVIDIA/spark-rapids/issues/7429#issuecomment-1368040686

The following windowing query on a simple two-row input should produce two 
non-empty windows as a result

{code}
from pprint import pprint
data = [
  ('9223372036854775807', '11342371013783243717493546650944543.47'),
  ('9223372036854775807', '.99')
]
df1 = spark.createDataFrame(data, 'a STRING, b STRING')
df2 = df1.select(df1.a.cast('LONG'), df1.b.cast('DECIMAL(38,2)'))
df2.createOrReplaceTempView('test_table')
df = sql('''
  SELECT 
COUNT(1) OVER (
  PARTITION BY a 
  ORDER BY b ASC 
  RANGE BETWEEN 10.2345 PRECEDING AND 6.7890 FOLLOWING
) AS CNT_1 
  FROM 
test_table
  ''')
res = df.collect()
df.explain(True)
pprint(res)
{code}

Spark 3.4.0-SNAPSHOT output:
{code}
[Row(CNT_1=1), Row(CNT_1=0)]
{code}

Spark 3.3.1 output as expected:
{code}
Row(CNT_1=1), Row(CNT_1=1)]
{code}







--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-35408) Improve parameter validation in DataFrame.show

2021-05-14 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-35408:
-

 Summary: Improve parameter validation in DataFrame.show
 Key: SPARK-35408
 URL: https://issues.apache.org/jira/browse/SPARK-35408
 Project: Spark
  Issue Type: Improvement
  Components: PySpark
Affects Versions: 3.1.1
Reporter: Gera Shegalov


Being more used to Scala API the user may  be tempted to call  
{code:python}
df.show(False)
{code}
in PySparkl and she will receive an error message that does not easily map to 
the user code 
{noformat}
py4j.Py4JException: Method showString([class java.lang.Boolean, class 
java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31900) Client memory passed unvalidated to the JVM Xmx

2020-06-03 Thread Gera Shegalov (Jira)
Gera Shegalov created SPARK-31900:
-

 Summary: Client memory passed unvalidated to the JVM Xmx
 Key: SPARK-31900
 URL: https://issues.apache.org/jira/browse/SPARK-31900
 Project: Spark
  Issue Type: Bug
  Components: Spark Submit
Affects Versions: 2.0.0, 3.0.0
Reporter: Gera Shegalov


When launching in the client mode Spark launcher uses spark.driver.memory 
config (among other settings in the precedence order). Unlike the cluster mode, 
the client memory config is not trimmed and not validated to be a valid suffix 
for the [Xmx 
option|https://docs.oracle.com/en/java/javase/14/docs/specs/man/java.html#extra-options-for-java].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2019-02-01 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-23155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758102#comment-16758102
 ] 

Gera Shegalov commented on SPARK-23155:
---

[~kabhwan], [~vanzin] I would still be interested to be able to use the new 
mechanism with the old logs. [https://github.com/apache/spark/pull/23720] is a 
quick draft to demo how we could achieve this flexibly with named capture 
groups.

> YARN-aggregated executor/driver logs appear unavailable when NM is down
> ---
>
> Key: SPARK-23155
> URL: https://issues.apache.org/jira/browse/SPARK-23155
> Project: Spark
>  Issue Type: Improvement
>  Components: Deploy
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.0.0
>
>
> Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
> container log URL's to point to the aggregated yarn.log.server.url location 
> and relies on the NodeManager webUI to trigger a redirect. This fails when 
> the NM is down. Note that NM may be down permanently after decommissioning in 
> traditional environments or when used in a cloud environment such as AWS EMR 
> where either worker nodes are taken away with autoscale, the whole cluster is 
> used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26792) Apply custom log URL to Spark UI

2019-02-01 Thread Gera Shegalov (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16758095#comment-16758095
 ] 

Gera Shegalov commented on SPARK-26792:
---

[~kabhwan] thanks for doing this work. I verified that I can configure SHS so 
it satisfies our use case. Changing the default in Spark is a nice-to-have but 
not a high priority from my perspective.

> Apply custom log URL to Spark UI
> 
>
> Key: SPARK-26792
> URL: https://issues.apache.org/jira/browse/SPARK-26792
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 3.0.0
>Reporter: Jungtaek Lim
>Priority: Major
>
> SPARK-23155 enables SHS to set up custom log URLs for incompleted / completed 
> apps.
> While getting reviews from SPARK-23155, I've got two comments which applying 
> custom log URLs to UI would help achieving it. Quoting these comments here:
> https://github.com/apache/spark/pull/23260#issuecomment-456827963
> {quote}
> Sorry I haven't had time to look through all the code so this might be a 
> separate jira, but one thing I thought of here is it would be really nice not 
> to have specifically stderr/stdout. users can specify any log4j.properties 
> and some tools like oozie by default end up using hadoop log4j rather then 
> spark log4j, so files aren't necessarily the same. Also users can put in 
> other logs files so it would be nice to have links to those from the UI. It 
> seems simpler if we just had a link to the directory and it read the files 
> within there. Other things in Hadoop do it this way, but I'm not sure if that 
> works well for other resource managers, any thoughts on that? As long as this 
> doesn't prevent the above I can file a separate jira for it.
> {quote}
> https://github.com/apache/spark/pull/23260#issuecomment-456904716
> {quote}
> Hi Tom, +1: singling out stdout and stderr is definitely an annoyance. We
> typically configure Spark jobs to write the GC log and dump heap on OOM
> using ,  and/or we use the rolling file appender to deal with
> large logs during debugging. So linking the YARN container log overview
> page would make much more sense for us. We work it around with a custom
> submit process that logs all important URLs on the submit side log.
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25221) [DEPLOY] Consistent trailing whitespace treatment of conf values

2018-08-23 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-25221:
-

 Summary: [DEPLOY] Consistent trailing whitespace treatment of conf 
values
 Key: SPARK-25221
 URL: https://issues.apache.org/jira/browse/SPARK-25221
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.3.1
Reporter: Gera Shegalov


When you use a custom line delimiter 
{{spark.hadoop.textinputformat.record.delimiter}} that has a leading or a 
trailing whitespace character it's only possible when specified via  {{--conf}} 
. Our pipeline consists of a highly customized generated jobs. Storing all the 
config in a properities file is not only better for readability but even 
necessary to avoid dealing with {{ARGS_MAX}} on different OS. Spark should 
uniformly avoid trimming conf values in both cases. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23956) Use effective RPC port in AM registration

2018-04-11 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-23956:
--
Priority: Minor  (was: Major)

> Use effective RPC port in AM registration 
> --
>
> Key: SPARK-23956
> URL: https://issues.apache.org/jira/browse/SPARK-23956
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 2.3.0
>Reporter: Gera Shegalov
>Priority: Minor
>
> AM's should use their real rpc port in the AM registration for better 
> diagnostics in Application Report.
> {code}
> 18/04/10 14:56:21 INFO Client:
> client token: N/A
> diagnostics: N/A
> ApplicationMaster host: localhost
> ApplicationMaster RPC port: 58338
> queue: default
> start time: 1523397373659
> final status: UNDEFINED
> tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23956) Use effective RPC port in AM registration

2018-04-10 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-23956:
-

 Summary: Use effective RPC port in AM registration 
 Key: SPARK-23956
 URL: https://issues.apache.org/jira/browse/SPARK-23956
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.3.0
Reporter: Gera Shegalov


AM's should use their real rpc port in the AM registration for better 
diagnostics in Application Report.

{code}
18/04/10 14:56:21 INFO Client:
client token: N/A
diagnostics: N/A
ApplicationMaster host: localhost
ApplicationMaster RPC port: 58338
queue: default
start time: 1523397373659
final status: UNDEFINED
tracking URL: http://localhost:8088/proxy/application_1523370127531_0016/
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23386) Enable direct application links before replay

2018-02-10 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-23386:
-

 Summary: Enable direct application links before replay
 Key: SPARK-23386
 URL: https://issues.apache.org/jira/browse/SPARK-23386
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


In a deployment with multiple 10K of large event logs it may take *many hours* 
until all logs are replayed. Most our users use SHS by clicking on a link in a 
client log in case of an error. Direct links currently don't work until the 
event log is processed in a replay thread. This Jira proposes to link appid to 
the event logs already during scan, without a full replay. This makes on-demand 
retrievals accessible almost immediately upon SHS start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23296) Diagnostics message for user code exceptions should include the stacktrace

2018-02-01 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-23296:
-

 Summary: Diagnostics message for user code exceptions should 
include the stacktrace
 Key: SPARK-23296
 URL: https://issues.apache.org/jira/browse/SPARK-23296
 Project: Spark
  Issue Type: Improvement
  Components: YARN
Affects Versions: 2.2.1
Reporter: Gera Shegalov


When Spark job fails on user exception solely {{Throwable#toString}} is 
included in diagnostics. It may take less clicks to get to the stack trace 
appears on both YARN webUI and the client log. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-01-19 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-12963:
--
Shepherd: Sean Owen

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12963) In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' failed after 16 retries!

2018-01-18 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16331815#comment-16331815
 ] 

Gera Shegalov commented on SPARK-12963:
---

We hit the same issue on nodes where a process is not allowed to listen on all 
NIC. An easy fix is to make sure that the Driver in ApplicationMaster inherits 
an explicitly configured public hostname of the NodeManager.

> In cluster mode,spark_local_ip will cause driver exception:Service 'Driver' 
> failed  after 16 retries!
> -
>
> Key: SPARK-12963
> URL: https://issues.apache.org/jira/browse/SPARK-12963
> Project: Spark
>  Issue Type: Bug
>  Components: Deploy
>Affects Versions: 1.6.0
>Reporter: lichenglin
>Priority: Critical
>
> I have 3 node cluster:namenode second and data1;
> I use this shell to submit job on namenode:
> bin/spark-submit   --deploy-mode cluster --class com.bjdv.spark.job.Abc  
> --total-executor-cores 5  --master spark://namenode:6066
> hdfs://namenode:9000/sparkjars/spark.jar
> The Driver may be started on the other node such as data1.
> The problem is :
> when I set SPARK_LOCAL_IP in conf/spark-env.sh on namenode
> the driver will be started with this param such as 
> SPARK_LOCAL_IP=namenode
> but the driver will start at data1,
> the dirver will try to binding the ip 'namenode' on data1.
> so driver will throw exception like this:
>  Service 'Driver' failed  after 16 retries!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23155) YARN-aggregated executor/driver logs appear unavailable when NM is down

2018-01-18 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-23155:
-

 Summary: YARN-aggregated executor/driver logs appear unavailable 
when NM is down
 Key: SPARK-23155
 URL: https://issues.apache.org/jira/browse/SPARK-23155
 Project: Spark
  Issue Type: Improvement
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


Unlike MapReduce JobHistory Server, Spark history server isn't rewriting 
container log URL's to point to the aggregated yarn.log.server.url location and 
relies on the NodeManager webUI to trigger a redirect. This fails when the NM 
is down. Note that NM may be down permanently after decommissioning in 
traditional environments or when used in a cloud environment such as AWS EMR 
where either worker nodes are taken away with autoscale, the whole cluster is 
used to run a single job.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22914) Subbing for spark.history.ui.port does not resolve by default

2017-12-27 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-22914:
-

 Summary: Subbing for spark.history.ui.port does not resolve by 
default
 Key: SPARK-22914
 URL: https://issues.apache.org/jira/browse/SPARK-22914
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 2.2.1
Reporter: Gera Shegalov


In order not to hardcode SHS web ui port and not duplicate information that is 
already configured we might be inclined to define 
{{spark.yarn.historyServer.address}} as 
{code}http://${hadoopconf-yarn.resourcemanager.hostname}:${spark.history.ui.port}{code}

However, since spark.history.ui.port is not registered its resolution fails 
when it's not explicitly set in the deployed spark conf. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-22875:
--
Flags: Patch

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gera Shegalov updated SPARK-22875:
--
Shepherd: Steve Loughran

> Assembly build fails for a high user id
> ---
>
> Key: SPARK-22875
> URL: https://issues.apache.org/jira/browse/SPARK-22875
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 2.2.1
>Reporter: Gera Shegalov
>
> {code}
> ./build/mvn package -Pbigtop-dist -DskipTests
> [ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
> spark-assembly_2.11: Execution dist of goal 
> org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
> '123456789' is too big ( > 2097151 ). -> [Help 1]
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-22875) Assembly build fails for a high user id

2017-12-22 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-22875:
-

 Summary: Assembly build fails for a high user id
 Key: SPARK-22875
 URL: https://issues.apache.org/jira/browse/SPARK-22875
 Project: Spark
  Issue Type: Bug
  Components: Build
Affects Versions: 2.2.1
Reporter: Gera Shegalov


{code}
./build/mvn package -Pbigtop-dist -DskipTests
[ERROR] Failed to execute goal 
org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single (dist) on project 
spark-assembly_2.11: Execution dist of goal 
org.apache.maven.plugins:maven-assembly-plugin:3.1.0:single failed: user id 
'123456789' is too big ( > 2097151 ). -> [Help 1]
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2602) sbt/sbt test steals window focus on OS X

2014-07-20 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14068044#comment-14068044
 ] 

Gera Shegalov commented on SPARK-2602:
--

Take a look at the thread on HADOOP-10290

 sbt/sbt test steals window focus on OS X
 

 Key: SPARK-2602
 URL: https://issues.apache.org/jira/browse/SPARK-2602
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor

 On OS X, I run {{sbt/sbt test}} from Terminal and then go off and do 
 something else with my computer. It appears that there are several things in 
 the test suite that launch Java programs that, for some reason, steal window 
 focus. 
 It can get very annoying, especially if you happen to be typing something in 
 a different window, to be suddenly teleported to a random Java application 
 and have your finely crafted keystrokes be sent where they weren't intended.
 It would be nice if {{sbt/sbt test}} didn't do that.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2577) File upload to viewfs is broken due to mount point resolution

2014-07-18 Thread Gera Shegalov (JIRA)
Gera Shegalov created SPARK-2577:


 Summary: File upload to viewfs is broken due to mount point 
resolution
 Key: SPARK-2577
 URL: https://issues.apache.org/jira/browse/SPARK-2577
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gera Shegalov
Priority: Blocker


YARN client resolves paths of uploaded artifacts. When a viewfs path is 
resolved, the filesystem changes to the target file system. However, the 
original fs is passed to {{ClientDistributedCacheManager#addResource}}. 

{code}
14/07/18 01:30:31 INFO yarn.Client: Uploading 
file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
 to 
viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar,
 expected: viewfs:/
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116)
at 
org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345)
at 
org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236)
at 
org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
at scala.collection.immutable.List.foreach(List.scala:318)
at 
org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
at 
org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
at 
org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
at 
org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)
at org.apache.spark.SparkContext.init(SparkContext.scala:320)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
{code}

There are two options:
# do not resolve path because symlinks are currently disabled in Hadoop
# pass the correct filesystem object



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2577) File upload to viewfs is broken due to mount point resolution

2014-07-18 Thread Gera Shegalov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2577?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14066243#comment-14066243
 ] 

Gera Shegalov commented on SPARK-2577:
--

https://github.com/apache/spark/pull/1483

 File upload to viewfs is broken due to mount point resolution
 -

 Key: SPARK-2577
 URL: https://issues.apache.org/jira/browse/SPARK-2577
 Project: Spark
  Issue Type: Bug
  Components: YARN
Reporter: Gera Shegalov
Priority: Blocker

 YARN client resolves paths of uploaded artifacts. When a viewfs path is 
 resolved, the filesystem changes to the target file system. However, the 
 original fs is passed to {{ClientDistributedCacheManager#addResource}}. 
 {code}
 14/07/18 01:30:31 INFO yarn.Client: Uploading 
 file:/Users/gshegalov/workspace/spark-tw/assembly/target/scala-2.10/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
  to 
 viewfs:/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar
 Exception in thread main java.lang.IllegalArgumentException: Wrong FS: 
 hdfs://ns1:8020/user/gshegalov/.sparkStaging/application_1405479201490_0049/spark-assembly-1.1.0-SNAPSHOT-hadoop3.0.0-SNAPSHOT.jar,
  expected: viewfs:/
   at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:643)
   at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getUriPath(ViewFileSystem.java:116)
   at 
 org.apache.hadoop.fs.viewfs.ViewFileSystem.getFileStatus(ViewFileSystem.java:345)
   at 
 org.apache.spark.deploy.yarn.ClientDistributedCacheManager.addResource(ClientDistributedCacheManager.scala:72)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:236)
   at 
 org.apache.spark.deploy.yarn.ClientBase$$anonfun$prepareLocalResources$5.apply(ClientBase.scala:229)
   at scala.collection.immutable.List.foreach(List.scala:318)
   at 
 org.apache.spark.deploy.yarn.ClientBase$class.prepareLocalResources(ClientBase.scala:229)
   at 
 org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:37)
   at org.apache.spark.deploy.yarn.Client.runApp(Client.scala:74)
   at 
 org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:81)
   at 
 org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:136)
   at org.apache.spark.SparkContext.init(SparkContext.scala:320)
   at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:28)
   at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
   at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
   at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
 {code}
 There are two options:
 # do not resolve path because symlinks are currently disabled in Hadoop
 # pass the correct filesystem object



--
This message was sent by Atlassian JIRA
(v6.2#6252)