[jira] [Commented] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-04-23 Thread Takeshi Yamamuro (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449326#comment-16449326
 ] 

Takeshi Yamamuro commented on SPARK-24043:
--

I tried this with codegen=off;
{code:java}
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
  /_/
 
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_31)
Type in expressions to have them evaluated.
Type :help for more information.

scala> sql("SET spark.sql.codegen.wholeStage=false")
res0: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> Seq((1)).toDF("a").filter('a > rand(7)).show 
+---+
|  a|
+---+
|  1|
+---+
{code}

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-23 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449313#comment-16449313
 ] 

Liang-Chi Hsieh commented on SPARK-24058:
-

OK. I will work on this. Thanks.

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24002) Task not serializable caused by org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes

2018-04-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-24002.
---
   Resolution: Fixed
Fix Version/s: 2.4.0

This is resolved via [https://github.com/apache/spark/pull/21086] .

> Task not serializable caused by 
> org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes
> 
>
> Key: SPARK-24002
> URL: https://issues.apache.org/jira/browse/SPARK-24002
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Xiao Li
>Priority: Major
> Fix For: 2.4.0
>
>
> Having two queries one is a 1000-line SQL query and a 3000-line SQL query. 
> Need to run at least one hour with a heavy write workload to reproduce once. 
> {code}
> Py4JJavaError: An error occurred while calling o153.sql.
> : org.apache.spark.SparkException: Job aborted.
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:223)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:189)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
>   at 
> org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$6.apply(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$$anonfun$59.apply(Dataset.scala:3021)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withCustomExecutionEnv(SQLExecution.scala:89)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:127)
>   at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3020)
>   at org.apache.spark.sql.Dataset.(Dataset.scala:190)
>   at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:74)
>   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:646)
>   at sun.reflect.GeneratedMethodAccessor153.invoke(Unknown Source)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
>   at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
>   at py4j.Gateway.invoke(Gateway.java:293)
>   at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)
>   at py4j.GatewayConnection.run(GatewayConnection.java:226)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.spark.SparkException: Exception thrown in Future.get: 
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:190)
>   at 
> org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:267)
>   at 
> org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec.doConsume(BroadcastNestedLoopJoinExec.scala:530)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.ProjectExec.consume(basicPhysicalOperators.scala:37)
>   at 
> org.apache.spark.sql.execution.ProjectExec.doConsume(basicPhysicalOperators.scala:69)
>   at 
> org.apache.spark.sql.execution.CodegenSupport$class.consume(WholeStageCodegenExec.scala:155)
>   at 
> org.apache.spark.sql.execution.FilterExec.consume(basicPhysicalOperators.scala:144)
>   ...
>   at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:190)
>   ... 23 more
> Caused by: java.util.concurrent.ExecutionException: 
> org.apache.spark.SparkException: Task not serializable
>   at java.util.concurrent.FutureTask.report(FutureTask.java:122)
>   at java.util.concurrent.FutureTask.get(FutureTask.java:206)
>   at 
> org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:179)
>   ... 276 more
> Caused by: org.apache.spark.SparkException: Task not serializable
>   at 
> org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:340)
>   at 
> org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:330)
>   at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:156)
> 

[jira] [Commented] (SPARK-23589) Add interpreted execution for ExternalMapToCatalyst expression

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449290#comment-16449290
 ] 

Apache Spark commented on SPARK-23589:
--

User 'maropu' has created a pull request for this issue:
https://github.com/apache/spark/pull/21137

> Add interpreted execution for ExternalMapToCatalyst expression
> --
>
> Key: SPARK-23589
> URL: https://issues.apache.org/jira/browse/SPARK-23589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24043) InterpretedPredicate.eval fails if expression tree contains Nondeterministic expressions

2018-04-23 Thread Bruce Robbins (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449281#comment-16449281
 ] 

Bruce Robbins commented on SPARK-24043:
---

[~maropu]

> Do I miss any precondition?

For this bug to materialize in spark-shell, Spark SQL needs to be interpreted 
mode (whole-stage codegen and predicate codegen are shut off).

I ran some of the DataFrame and Dataset test suites in interpreted mode and 
this bug popped out (during the run for the test "handle nondeterministic 
expressions correctly for set operations"). To put Spark SQL in interpreted 
mode, I manually shut off whole-stage codegen and predicate codegen. It was 
still off when I did the above spark-shell demo.

Outside of manually tweaking Spark, it's difficult to get predicate codegen to 
fail (It's easy to get whole-stage codegen to fall back – just supply more than 
300 columns in your query. Predicate codegen is more resilient). That's why 
this is a low impact bug. However, at some point we might want to test 
interpreted mode.

I will make a PR, but it's no emergency.

To see the bug in action with Spark as-is, add these test cases to 
PredicateSuite. The first should succeed (no Nondeterministic expressions). The 
second will fail with an exception ("Nondeterministic expression 
org.apache.spark.sql.catalyst.expressions.Rand should be initialized before 
eval"):
{code:java}
  test("Interpreted Predicate should work without nondeterministic 
expressions") {
val interpreted = InterpretedPredicate.create(LessThan(Literal(0.2), 
Literal(1.0)))
interpreted.initialize(0)
assert(interpreted.eval(new UnsafeRow()))
  }

  test("Interpreted Predicate should initialize nondeterministic expressions") {
val interpreted = InterpretedPredicate.create(LessThan(Rand(7), 
Literal(1.0)))
interpreted.initialize(0)
assert(interpreted.eval(new UnsafeRow()))
  }
{code}

> InterpretedPredicate.eval fails if expression tree contains Nondeterministic 
> expressions
> 
>
> Key: SPARK-24043
> URL: https://issues.apache.org/jira/browse/SPARK-24043
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Bruce Robbins
>Priority: Minor
>
> When whole-stage codegen and predicate codegen both fail, FilterExec falls 
> back to using InterpretedPredicate. If the predicate's expression contains 
> any non-deterministic expressions, the evaluation throws an error:
> {noformat}
> scala> val df = Seq((1)).toDF("a")
> df: org.apache.spark.sql.DataFrame = [a: int]
> scala> df.filter('a > 0).show // this works fine
> 2018-04-21 20:39:26 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (value#1 > 0)
> +---+
> |  a|
> +---+
> |  1|
> +---+
> scala> df.filter('a > rand(7)).show // this will throw an error
> 2018-04-21 20:39:40 WARN  FilterExec:66 - Codegen disabled for this 
> expression:
>  (cast(value#1 as double) > rand(7))
> 2018-04-21 20:39:40 ERROR Executor:91 - Exception in task 0.0 in stage 1.0 
> (TID 1)
> java.lang.IllegalArgumentException: requirement failed: Nondeterministic 
> expression org.apache.spark.sql.catalyst.expressions.Rand should be 
> initialized before eval.
>   at scala.Predef$.require(Predef.scala:224)
>   at 
> org.apache.spark.sql.catalyst.expressions.Nondeterministic$class.eval(Expression.scala:326)
>   at 
> org.apache.spark.sql.catalyst.expressions.RDG.eval(randomExpressions.scala:34)
> {noformat}
> This is because no code initializes the Nondeterministic expressions before 
> eval is called on them.
> This is a low impact issue, since it would require both whole-stage codegen 
> and predicate codegen to fail before FilterExec would fall back to using 
> InterpretedPredicate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24061) [SS]TypedFilter is not supported in Continuous Processing

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24061:


Assignee: (was: Apache Spark)

> [SS]TypedFilter is not supported in Continuous Processing
> -
>
> Key: SPARK-24061
> URL: https://issues.apache.org/jira/browse/SPARK-24061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wang Yanlin
>Priority: Major
> Attachments: TypedFilter_error.png
>
>
> using filter with filter function in continuous processing application ,cause 
> error



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24061) [SS]TypedFilter is not supported in Continuous Processing

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449274#comment-16449274
 ] 

Apache Spark commented on SPARK-24061:
--

User 'yanlin-Lynn' has created a pull request for this issue:
https://github.com/apache/spark/pull/21136

> [SS]TypedFilter is not supported in Continuous Processing
> -
>
> Key: SPARK-24061
> URL: https://issues.apache.org/jira/browse/SPARK-24061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wang Yanlin
>Priority: Major
> Attachments: TypedFilter_error.png
>
>
> using filter with filter function in continuous processing application ,cause 
> error



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24061) [SS]TypedFilter is not supported in Continuous Processing

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24061:


Assignee: Apache Spark

> [SS]TypedFilter is not supported in Continuous Processing
> -
>
> Key: SPARK-24061
> URL: https://issues.apache.org/jira/browse/SPARK-24061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wang Yanlin
>Assignee: Apache Spark
>Priority: Major
> Attachments: TypedFilter_error.png
>
>
> using filter with filter function in continuous processing application ,cause 
> error



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24061) [SS]TypedFilter is not supported in Continuous Processing

2018-04-23 Thread Wang Yanlin (JIRA)
Wang Yanlin created SPARK-24061:
---

 Summary: [SS]TypedFilter is not supported in Continuous Processing
 Key: SPARK-24061
 URL: https://issues.apache.org/jira/browse/SPARK-24061
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.4.0
Reporter: Wang Yanlin
 Attachments: TypedFilter_error.png

using filter with filter function in continuous processing application ,cause 
error



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24061) [SS]TypedFilter is not supported in Continuous Processing

2018-04-23 Thread Wang Yanlin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wang Yanlin updated SPARK-24061:

Attachment: TypedFilter_error.png

> [SS]TypedFilter is not supported in Continuous Processing
> -
>
> Key: SPARK-24061
> URL: https://issues.apache.org/jira/browse/SPARK-24061
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Wang Yanlin
>Priority: Major
> Attachments: TypedFilter_error.png
>
>
> using filter with filter function in continuous processing application ,cause 
> error



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24059) When blacklist disable always hash to a bad local directory may cause job failure

2018-04-23 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449254#comment-16449254
 ] 

zhoukang commented on SPARK-24059:
--

When we set multiple local dir on different disk.
Retry will make sense.
[~srowen]

> When blacklist disable always hash to a bad local directory may cause job 
> failure
> -
>
> Key: SPARK-24059
> URL: https://issues.apache.org/jira/browse/SPARK-24059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> When blacklist disable , if we always hashed temp shuffle to a bad local 
> directory on the same executor will cause job failure.
> Like below:
> {code:java}
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/27/temp_shuffle_159e6886-b76f-4d96-9600-aee62ada0fa9
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/06/temp_shuffle_ba7f0a29-8e02-4ffa-94f7-01f72d214821
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/32/temp_shuffle_7030256c-fc24-4d45-a901-be23c2c3fbd6
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/zjyprc-hadoop/nodemanager/usercache/h_message_push/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/14/temp_shuffle_65816622-6217-43b9-bc9e-e2f67dc9a9de
>  (Read-only file system)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24060) StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession creation

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449237#comment-16449237
 ] 

Apache Spark commented on SPARK-24060:
--

User 'pwoody' has created a pull request for this issue:
https://github.com/apache/spark/pull/21135

> StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession 
> creation
> ---
>
> Key: SPARK-24060
> URL: https://issues.apache.org/jira/browse/SPARK-24060
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Patrick Woody
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24060) StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession creation

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24060:


Assignee: (was: Apache Spark)

> StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession 
> creation
> ---
>
> Key: SPARK-24060
> URL: https://issues.apache.org/jira/browse/SPARK-24060
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Patrick Woody
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24060) StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession creation

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24060:


Assignee: Apache Spark

> StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession 
> creation
> ---
>
> Key: SPARK-24060
> URL: https://issues.apache.org/jira/browse/SPARK-24060
> Project: Spark
>  Issue Type: Bug
>  Components: Tests
>Affects Versions: 2.4.0
>Reporter: Patrick Woody
>Assignee: Apache Spark
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24060) StreamingSymmetricHashJoinHelperSuite should initialize after SparkSession creation

2018-04-23 Thread Patrick Woody (JIRA)
Patrick Woody created SPARK-24060:
-

 Summary: StreamingSymmetricHashJoinHelperSuite should initialize 
after SparkSession creation
 Key: SPARK-24060
 URL: https://issues.apache.org/jira/browse/SPARK-24060
 Project: Spark
  Issue Type: Bug
  Components: Tests
Affects Versions: 2.4.0
Reporter: Patrick Woody






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24059) When blacklist disable always hash to a bad local directory may cause job failure

2018-04-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449227#comment-16449227
 ] 

Sean Owen commented on SPARK-24059:
---

If the disk has failed, retry doesn't make a lot of sense.

Why not enable the blacklist, then?

> When blacklist disable always hash to a bad local directory may cause job 
> failure
> -
>
> Key: SPARK-24059
> URL: https://issues.apache.org/jira/browse/SPARK-24059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> When blacklist disable , if we always hashed temp shuffle to a bad local 
> directory on the same executor will cause job failure.
> Like below:
> {code:java}
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/27/temp_shuffle_159e6886-b76f-4d96-9600-aee62ada0fa9
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/06/temp_shuffle_ba7f0a29-8e02-4ffa-94f7-01f72d214821
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/32/temp_shuffle_7030256c-fc24-4d45-a901-be23c2c3fbd6
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/zjyprc-hadoop/nodemanager/usercache/h_message_push/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/14/temp_shuffle_65816622-6217-43b9-bc9e-e2f67dc9a9de
>  (Read-only file system)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23973) Remove consecutive sorts

2018-04-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23973?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23973.
-
   Resolution: Fixed
 Assignee: Marco Gaido
Fix Version/s: 2.4.0

> Remove consecutive sorts
> 
>
> Key: SPARK-23973
> URL: https://issues.apache.org/jira/browse/SPARK-23973
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Henry Robinson
>Assignee: Marco Gaido
>Priority: Minor
> Fix For: 2.4.0
>
>
> As a follow-on from SPARK-23375, it would be easy to remove redundant sorts 
> in the following kind of query:
> {code}
> Seq((1), (3)).toDF("int").orderBy('int.asc).orderBy('int.desc).explain()
> == Physical Plan ==
> *(2) Sort [int#35 DESC NULLS LAST], true, 0
> +- Exchange rangepartitioning(int#35 DESC NULLS LAST, 200)
>+- *(1) Sort [int#35 ASC NULLS FIRST], true, 0
>   +- Exchange rangepartitioning(int#35 ASC NULLS FIRST, 200)
>  +- LocalTableScan [int#35]
> {code}
> There's no need to perform {{(1) Sort}}. Since the sort operator isn't 
> stable, AFAIK, it should be ok to remove a sort on any column that gets 
> 'overwritten' by a subsequent one in this way. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24059) When blacklist disable always hash to a bad local directory may cause job failure

2018-04-23 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449175#comment-16449175
 ] 

zhoukang commented on SPARK-24059:
--

cc [~vanzin] [~srowen]
If we add retry make sense?
I will work on this if this fix make sense!
Thanks

> When blacklist disable always hash to a bad local directory may cause job 
> failure
> -
>
> Key: SPARK-24059
> URL: https://issues.apache.org/jira/browse/SPARK-24059
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
>
> When blacklist disable , if we always hashed temp shuffle to a bad local 
> directory on the same executor will cause job failure.
> Like below:
> {code:java}
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/27/temp_shuffle_159e6886-b76f-4d96-9600-aee62ada0fa9
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/06/temp_shuffle_ba7f0a29-8e02-4ffa-94f7-01f72d214821
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/32/temp_shuffle_7030256c-fc24-4d45-a901-be23c2c3fbd6
>  (Read-only file system)
> java.io.FileNotFoundException: 
> /home/work/hdd8/yarn/zjyprc-hadoop/nodemanager/usercache/h_message_push/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/14/temp_shuffle_65816622-6217-43b9-bc9e-e2f67dc9a9de
>  (Read-only file system)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24059) When blacklist disable always hash to a bad local directory may cause job failure

2018-04-23 Thread zhoukang (JIRA)
zhoukang created SPARK-24059:


 Summary: When blacklist disable always hash to a bad local 
directory may cause job failure
 Key: SPARK-24059
 URL: https://issues.apache.org/jira/browse/SPARK-24059
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: zhoukang


When blacklist disable , if we always hashed temp shuffle to a bad local 
directory on the same executor will cause job failure.
Like below:

{code:java}
java.io.FileNotFoundException: 
/home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/27/temp_shuffle_159e6886-b76f-4d96-9600-aee62ada0fa9
 (Read-only file system)

java.io.FileNotFoundException: 
/home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/06/temp_shuffle_ba7f0a29-8e02-4ffa-94f7-01f72d214821
 (Read-only file system)

java.io.FileNotFoundException: 
/home/work/hdd8/yarn/xxx/nodemanager/usercache/xxx/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/32/temp_shuffle_7030256c-fc24-4d45-a901-be23c2c3fbd6
 (Read-only file system)

java.io.FileNotFoundException: 
/home/work/hdd8/yarn/zjyprc-hadoop/nodemanager/usercache/h_message_push/appcache/application_1520502842490_20813/blockmgr-3beeddbf-bb83-4a74-ad7a-e796fe592b7c/14/temp_shuffle_65816622-6217-43b9-bc9e-e2f67dc9a9de
 (Read-only file system)
{code}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-23 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449122#comment-16449122
 ] 

Hyukjin Kwon commented on SPARK-24051:
--

Thanks for elaborating it. The details look very helpful. Will try to have some 
time to take a look as well.

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449073#comment-16449073
 ] 

Marcelo Vanzin commented on SPARK-23850:


Anyway, I'll take a stab at cleaning this up tomorrow if I find the time.

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449072#comment-16449072
 ] 

Marcelo Vanzin commented on SPARK-23850:


Yeah, things like {{java.vendor.url}} end up redacted too...

But there's the issue of JDBC drivers allowing passwords in their URL. Though 
they generally allow the password to be provided separately in a properties 
object too, which is preferable.

Also, it turns out other parts of SQL use a different way of redacting things 
added in SPARK-22791. The way I read that, the code added to 
{{SaveIntoDataSourceCommand}} is now redundant, since paths that print plans 
will be automatically redacted by the code in {{QueryExecution}}.

That change added a separate config that defaults to the value of the config in 
core. If we change that to be a separate config instead of falling back to the 
code config, we could have different defaults (leaving the URL alone on the 
core side), but that changes behavior slightly.

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-23 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449054#comment-16449054
 ] 

Thomas Graves commented on SPARK-23850:
---

the url seems somewhat silly to me to, look at the environment page on yarn, at 
least in our environment it has redacted in a bunch of places that don't make 
sense.  If its an issue with the thriftserver and certain urls we should fix 
those separately.

> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23850) We should not redact username|user|url from UI by default

2018-04-23 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16449046#comment-16449046
 ] 

Marcelo Vanzin commented on SPARK-23850:


Unless I hear back from the people CC'ed above, I'm going to revert the 
offending part of the original change, which is this:

{noformat}
-  .createWithDefault("(?i)secret|password".r) 
+  .createWithDefault("(?i)secret|password|url|user|username".r)
{noformat}

I'm a little inclined to keep redacting the URL since at least Oracle allows 
the password to be defined in the URL; it would be useful to see the rest of 
the URL for debugging purposes, but that would require extra code.

But redacting the user name is pretty silly; if you look at the env page with 
this change, you see a lot of things like this:

{noformat}
user.home   *(redacted)
user.timezone   *(redacted)
user.country*(redacted)
{noformat}

Which is kinda pointless.



> We should not redact username|user|url from UI by default
> -
>
> Key: SPARK-23850
> URL: https://issues.apache.org/jira/browse/SPARK-23850
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 2.2.1
>Reporter: Thomas Graves
>Priority: Major
>
> SPARK-22479 was filed to not print the log jdbc credentials, but in there 
> they also added  the username and url to be redacted.  I'm not sure why these 
> were added and to me by default these do not have security concerns.  It 
> makes it more usable by default to be able to see these things.  Users with 
> high security concerns can simply add them in their configs.
> Also on yarn just redacting url doesn't secure anything because if you go to 
> the environment ui page you see all sorts of paths and really its just 
> confusing that some of its redacted and other parts aren't.  If this was 
> specifically for jdbc I think it needs to be just applied there and not 
> broadly.
> If we remove these we need to test what the jdbc driver is going to log from 
> SPARK-22479.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23455) Default Params in ML should be saved separately

2018-04-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23455:
--
Target Version/s: 2.4.0

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23455) Default Params in ML should be saved separately

2018-04-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-23455:
-

Assignee: Liang-Chi Hsieh

> Default Params in ML should be saved separately
> ---
>
> Key: SPARK-23455
> URL: https://issues.apache.org/jira/browse/SPARK-23455
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.4.0
>Reporter: Liang-Chi Hsieh
>Assignee: Liang-Chi Hsieh
>Priority: Major
>
> We save ML's user-supplied params and default params as one entity in JSON. 
> During loading the saved models, we set all the loaded params into created ML 
> model instances as user-supplied params.
> It causes some problems, e.g., if we strictly disallow some params to be set 
> at the same time, a default param can fail the param check because it is 
> treated as user-supplied param after loading.
> The loaded default params should not be set as user-supplied params. We 
> should save ML default params separately in JSON.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448994#comment-16448994
 ] 

Joseph K. Bradley commented on SPARK-24058:
---

CCing [~viirya] since you're the natural one to take this.  Thanks!

> Default Params in ML should be saved separately: Python API
> ---
>
> Key: SPARK-24058
> URL: https://issues.apache.org/jira/browse/SPARK-24058
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Affects Versions: 2.4.0
>Reporter: Joseph K. Bradley
>Priority: Major
>
> See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed 
> in Scala, we must change it for Python for Spark 2.4.0 as well in order to 
> keep the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24058) Default Params in ML should be saved separately: Python API

2018-04-23 Thread Joseph K. Bradley (JIRA)
Joseph K. Bradley created SPARK-24058:
-

 Summary: Default Params in ML should be saved separately: Python 
API
 Key: SPARK-24058
 URL: https://issues.apache.org/jira/browse/SPARK-24058
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Affects Versions: 2.4.0
Reporter: Joseph K. Bradley


See [SPARK-23455] for reference.  Since DefaultParamsReader has been changed in 
Scala, we must change it for Python for Spark 2.4.0 as well in order to keep 
the 2 in sync.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24057) put the real data type in the AssertionError message

2018-04-23 Thread Huaxin Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-24057:
---
Description: 
I had a wrong data type in one of my tests and got the following message: 

/spark/python/pyspark/sql/types.py", line 405, in __init__

assert isinstance(dataType, DataType), "dataType should be DataType"

AssertionError: dataType should be DataType

I checked types.py,  line 405, in __init__, it has

{{assert isinstance(dataType, DataType), "dataType should be DataType"}}

I think it meant to be 

{{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
DataType)}}

so the error message will be something like 

{{AssertionError:  should be }}

There are a couple of other places that have the same problem. 

  was:
I had a wrong data type in one of my tests and got the following message: 

/spark/python/pyspark/sql/types.py", line 405, in __init__

        assert isinstance(dataType, DataType), "dataType should be DataType"

    AssertionError: dataType should be DataType

 

I checked types.py,  line 405, in __init__, it has

assert isinstance(dataType, DataType), "dataType should be DataType"

I think it meant to be 

assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, DataType)

so the error message will be something like 

AssertionError:  should be 

There are a couple of other places that have the same problem. 


> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
> assert isinstance(dataType, DataType), "dataType should be DataType"
> AssertionError: dataType should be DataType
> I checked types.py,  line 405, in __init__, it has
> {{assert isinstance(dataType, DataType), "dataType should be DataType"}}
> I think it meant to be 
> {{assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)}}
> so the error message will be something like 
> {{AssertionError:  should be  'pyspark.sql.types.DataType'>}}
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24057) put the real data type in the AssertionError message

2018-04-23 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448965#comment-16448965
 ] 

Huaxin Gao commented on SPARK-24057:


I will submit a PR to fix the problem. 

> put the real data type in the AssertionError message
> 
>
> Key: SPARK-24057
> URL: https://issues.apache.org/jira/browse/SPARK-24057
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Huaxin Gao
>Priority: Minor
>
> I had a wrong data type in one of my tests and got the following message: 
> /spark/python/pyspark/sql/types.py", line 405, in __init__
>         assert isinstance(dataType, DataType), "dataType should be DataType"
>     AssertionError: dataType should be DataType
>  
> I checked types.py,  line 405, in __init__, it has
> assert isinstance(dataType, DataType), "dataType should be DataType"
> I think it meant to be 
> assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, 
> DataType)
> so the error message will be something like 
> AssertionError:  should be 
> There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24057) put the real data type in the AssertionError message

2018-04-23 Thread Huaxin Gao (JIRA)
Huaxin Gao created SPARK-24057:
--

 Summary: put the real data type in the AssertionError message
 Key: SPARK-24057
 URL: https://issues.apache.org/jira/browse/SPARK-24057
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Huaxin Gao


I had a wrong data type in one of my tests and got the following message: 

/spark/python/pyspark/sql/types.py", line 405, in __init__

        assert isinstance(dataType, DataType), "dataType should be DataType"

    AssertionError: dataType should be DataType

 

I checked types.py,  line 405, in __init__, it has

assert isinstance(dataType, DataType), "dataType should be DataType"

I think it meant to be 

assert isinstance(dataType, DataType), "%s should be %s"  % (dataType, DataType)

so the error message will be something like 

AssertionError:  should be 

There are a couple of other places that have the same problem. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24056:


Assignee: Apache Spark  (was: Tathagata Das)

> Make consumer creation lazy in Kafka source for Structured streaming
> 
>
> Key: SPARK-24056
> URL: https://issues.apache.org/jira/browse/SPARK-24056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Apache Spark
>Priority: Minor
>
> Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) 
> eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. 
> However, we create dummy KafkaMicroBatchReader to get the schema and 
> immediately stop it. Its better to make the consumer creation lazy, it will 
> be created on the first attempt to fetch offsets using the KafkaOffsetReader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24056?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24056:


Assignee: Tathagata Das  (was: Apache Spark)

> Make consumer creation lazy in Kafka source for Structured streaming
> 
>
> Key: SPARK-24056
> URL: https://issues.apache.org/jira/browse/SPARK-24056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) 
> eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. 
> However, we create dummy KafkaMicroBatchReader to get the schema and 
> immediately stop it. Its better to make the consumer creation lazy, it will 
> be created on the first attempt to fetch offsets using the KafkaOffsetReader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448925#comment-16448925
 ] 

Apache Spark commented on SPARK-24056:
--

User 'tdas' has created a pull request for this issue:
https://github.com/apache/spark/pull/21134

> Make consumer creation lazy in Kafka source for Structured streaming
> 
>
> Key: SPARK-24056
> URL: https://issues.apache.org/jira/browse/SPARK-24056
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.4.0
>Reporter: Tathagata Das
>Assignee: Tathagata Das
>Priority: Minor
>
> Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) 
> eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. 
> However, we create dummy KafkaMicroBatchReader to get the schema and 
> immediately stop it. Its better to make the consumer creation lazy, it will 
> be created on the first attempt to fetch offsets using the KafkaOffsetReader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-23 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448884#comment-16448884
 ] 

Dongjoon Hyun commented on SPARK-23961:
---

Yep. I'm also able to observe this for all Spark 2.X (2.0 ~ 2.3). 
`toLocalIterator` is introduced at Spark 2.0.

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23961) pyspark toLocalIterator throws an exception

2018-04-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23961?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23961:
--
Affects Version/s: 2.0.2
   2.1.2
   2.3.0

> pyspark toLocalIterator throws an exception
> ---
>
> Key: SPARK-23961
> URL: https://issues.apache.org/jira/browse/SPARK-23961
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Michel Lemay
>Priority: Minor
>  Labels: DataFrame, pyspark
>
> Given a dataframe and use toLocalIterator. If we do not consume all records, 
> it will throw: 
> {quote}ERROR PythonRDD: Error while sending iterator
>  java.net.SocketException: Connection reset by peer: socket write error
>  at java.net.SocketOutputStream.socketWrite0(Native Method)
>  at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
>  at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
>  at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
>  at java.io.DataOutputStream.write(DataOutputStream.java:107)
>  at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
>  at 
> org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:497)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:509)
>  at scala.collection.Iterator$class.foreach(Iterator.scala:893)
>  at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
>  at 
> org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:509)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at 
> org.apache.spark.api.python.PythonRDD$$anon$2$$anonfun$run$1.apply(PythonRDD.scala:705)
>  at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1337)
>  at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:706)
> {quote}
>  
> To reproduce, here is a simple pyspark shell script that show the error:
> {quote}import itertools
>  df = spark.read.parquet("large parquet folder").cache()
> print(df.count())
>  b = df.toLocalIterator()
>  print(len(list(itertools.islice(b, 20
>  b = None # Make the iterator goes out of scope.  Throws here.
> {quote}
>  
> Observations:
>  * Consuming all records do not throw.  Taking only a subset of the 
> partitions create the error.
>  * In another experiment, doing the same on a regular RDD works if we 
> cache/materialize it. If we do not cache the RDD, it throws similarly.
>  * It works in scala shell
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24056) Make consumer creation lazy in Kafka source for Structured streaming

2018-04-23 Thread Tathagata Das (JIRA)
Tathagata Das created SPARK-24056:
-

 Summary: Make consumer creation lazy in Kafka source for 
Structured streaming
 Key: SPARK-24056
 URL: https://issues.apache.org/jira/browse/SPARK-24056
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.4.0
Reporter: Tathagata Das
Assignee: Tathagata Das


Currently, the driver side of the Kafka source (i.e. KafkaMicroBatchReader) 
eagerly creates a consumer as soon as the Kafk aMicroBatchReader is created. 
However, we create dummy KafkaMicroBatchReader to get the schema and 
immediately stop it. Its better to make the consumer creation lazy, it will be 
created on the first attempt to fetch offsets using the KafkaOffsetReader.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23799) [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of empty table with analyzed statistics

2018-04-23 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li updated SPARK-23799:

Fix Version/s: (was: 2.3.1)

> [CBO] FilterEstimation.evaluateInSet produces devision by zero in a case of 
> empty table with analyzed statistics
> 
>
> Key: SPARK-23799
> URL: https://issues.apache.org/jira/browse/SPARK-23799
> Project: Spark
>  Issue Type: Bug
>  Components: Optimizer
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Michael Shtelma
>Assignee: Michael Shtelma
>Priority: Major
> Fix For: 2.4.0
>
>
> Spark 2.2.1 and 2.3.0 can produce NumberFormatException (see below) during 
> the analysis of the queries, which are using previously analyzed hive tables. 
> The NumberFormatException occurs because in [FilterEstimation.scala on lines 
> 50 and 
> 52|https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala?utf8=%E2%9C%93#L50-L52]
>  the method calculateFilterSelectivity returns NaN, which is caused by 
> devision by zero. This leads to NumberFormatException during conversion from 
> Double to BigDecimal. 
> NaN is caused by devision by zero in evaluateInSet method. 
> Exception:
> java.lang.NumberFormatException
> at java.math.BigDecimal.(BigDecimal.java:494)
> at java.math.BigDecimal.(BigDecimal.java:824)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:52)
> at scala.math.BigDecimal$.decimal(BigDecimal.scala:55)
> at scala.math.BigDecimal$.double2bigDecimal(BigDecimal.scala:343)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.FilterEstimation.estimate(FilterEstimation.scala:52)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:43)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitFilter(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$$anonfun$rowCountsExist$1.apply(EstimationUtils.scala:32)
> at 
> scala.collection.IndexedSeqOptimized$class.prefixLengthImpl(IndexedSeqOptimized.scala:38)
> at 
> scala.collection.IndexedSeqOptimized$class.forall(IndexedSeqOptimized.scala:43)
> at scala.collection.mutable.WrappedArray.forall(WrappedArray.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.EstimationUtils$.rowCountsExist(EstimationUtils.scala:32)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.ProjectEstimation$.estimate(ProjectEstimation.scala:27)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:63)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visitProject(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlanVisitor$class.visit(LogicalPlanVisitor.scala:37)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.BasicStatsPlanVisitor$.visit(BasicStatsPlanVisitor.scala:25)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:35)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$$anonfun$stats$1.apply(LogicalPlanStats.scala:33)
> at scala.Option.getOrElse(Option.scala:121)
> at 
> org.apache.spark.sql.catalyst.plans.logical.statsEstimation.LogicalPlanStats$class.stats(LogicalPlanStats.scala:33)
> at 
> org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.stats(LogicalPlan.scala:30)
> at 
> org.apache.spark.sql.catalyst.optimizer.CostBasedJoinReorder$$anonfun$2.apply(CostBasedJoinReorder.scala:64)
> at 
> 

[jira] [Updated] (SPARK-24055) Add e2e test for using kubectl proxy for submitting spark jobs

2018-04-23 Thread Anirudh Ramanathan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anirudh Ramanathan updated SPARK-24055:
---
Summary: Add e2e test for using kubectl proxy for submitting spark jobs  
(was: Add e2e test for using kubectl proxy for submission)

> Add e2e test for using kubectl proxy for submitting spark jobs
> --
>
> Key: SPARK-24055
> URL: https://issues.apache.org/jira/browse/SPARK-24055
> Project: Spark
>  Issue Type: Test
>  Components: Kubernetes
>Affects Versions: 3.0.0
>Reporter: Anirudh Ramanathan
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448854#comment-16448854
 ] 

Apache Spark commented on SPARK-24013:
--

User 'mgaido91' has created a pull request for this issue:
https://github.com/apache/spark/pull/21133

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24055) Add e2e test for using kubectl proxy for submission

2018-04-23 Thread Anirudh Ramanathan (JIRA)
Anirudh Ramanathan created SPARK-24055:
--

 Summary: Add e2e test for using kubectl proxy for submission
 Key: SPARK-24055
 URL: https://issues.apache.org/jira/browse/SPARK-24055
 Project: Spark
  Issue Type: Test
  Components: Kubernetes
Affects Versions: 3.0.0
Reporter: Anirudh Ramanathan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24013:


Assignee: Apache Spark

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Assignee: Apache Spark
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24013:


Assignee: (was: Apache Spark)

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24029) Set "reuse address" flag on listen sockets

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24029?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448838#comment-16448838
 ] 

Apache Spark commented on SPARK-24029:
--

User 'vanzin' has created a pull request for this issue:
https://github.com/apache/spark/pull/21132

> Set "reuse address" flag on listen sockets
> --
>
> Key: SPARK-24029
> URL: https://issues.apache.org/jira/browse/SPARK-24029
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Marcelo Vanzin
>Priority: Minor
> Fix For: 2.4.0
>
>
> Setting the "reuse address" option allows a server socket to be bound to a 
> port that may still have connections in the "close wait" state. Without it, 
> for example, re-starting the history server could result in a BindException.
> This is more important for things that have explicit ports, but doesn't hurt 
> also in other places, especially since Spark allows most servers to bind to a 
> port range by setting the root port + the retry count.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-04-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448834#comment-16448834
 ] 

Xiao Li commented on SPARK-23874:
-

[~bryanc] The fix was only merged to Arrow 0.10? Does Arrow also ship Arrow 
0.9.1? I am just afraid Arrow 0.10 might also introduce some regressions. 

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24023) Built-in SQL Functions improvement in SparkR

2018-04-23 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448835#comment-16448835
 ] 

Shivaram Venkataraman commented on SPARK-24023:
---

Thanks [~hyukjin.kwon] . +1 to what Felix said. I think its fine to have one 
sub-task be a collection of functions if they are all small and related.

> Built-in SQL Functions improvement in SparkR
> 
>
> Key: SPARK-24023
> URL: https://issues.apache.org/jira/browse/SPARK-24023
> Project: Spark
>  Issue Type: Umbrella
>  Components: SparkR
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This JIRA targets to add an R functions corresponding to SPARK-23899. We have 
> been usually adding a function with Scala alone or with both Scala and Python 
> APIs.
> It's could be messy if there are duplicates for R sides in SPARK-23899. 
> Followup for each JIRA might be possible but then again messy to manage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on

2018-04-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid reassigned SPARK-23888:


Assignee: wuyi

> speculative task should not run on a given host where another attempt is 
> already running on
> ---
>
> Key: SPARK-23888
> URL: https://issues.apache.org/jira/browse/SPARK-23888
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: speculation
> Fix For: 2.4.0
>
>
>  
> There's a bug in:
> {code:java}
> /** Check whether a task is currently running an attempt on a given host */
>  private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
>taskAttempts(taskIndex).exists(_.host == host)
>  }
> {code}
> This will ignore hosts which have finished attempts, so we should check 
> whether the attempt is currently running on the given host. 
> And it is possible for a speculative task to run on a host where another 
> attempt failed here before.
> Assume we have only two machines: host1, host2.  We first run task0.0 on 
> host1. Then, due to  a long time waiting for task0.0, we launch a speculative 
> task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run 
> since there's already  a copy running on host2. After another long time, we 
> launch a new  speculative task0.2. And, now, we can run task0.2 on host1 
> again, since there's no more running attempt on host1.
> **
> After discussion in the PR, we simply make the comment be consistent the 
> method's behavior. See details in PR#20998.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23888) speculative task should not run on a given host where another attempt is already running on

2018-04-23 Thread Imran Rashid (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Imran Rashid resolved SPARK-23888.
--
   Resolution: Fixed
Fix Version/s: (was: 2.3.0)
   2.4.0

Issue resolved by pull request 20998
[https://github.com/apache/spark/pull/20998]

> speculative task should not run on a given host where another attempt is 
> already running on
> ---
>
> Key: SPARK-23888
> URL: https://issues.apache.org/jira/browse/SPARK-23888
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler, Spark Core
>Affects Versions: 2.3.0
>Reporter: wuyi
>Assignee: wuyi
>Priority: Major
>  Labels: speculation
> Fix For: 2.4.0
>
>
>  
> There's a bug in:
> {code:java}
> /** Check whether a task is currently running an attempt on a given host */
>  private def hasAttemptOnHost(taskIndex: Int, host: String): Boolean = {
>taskAttempts(taskIndex).exists(_.host == host)
>  }
> {code}
> This will ignore hosts which have finished attempts, so we should check 
> whether the attempt is currently running on the given host. 
> And it is possible for a speculative task to run on a host where another 
> attempt failed here before.
> Assume we have only two machines: host1, host2.  We first run task0.0 on 
> host1. Then, due to  a long time waiting for task0.0, we launch a speculative 
> task0.1 on host2. And, task0.1 finally failed on host1, but it can not re-run 
> since there's already  a copy running on host2. After another long time, we 
> launch a new  speculative task0.2. And, now, we can run task0.2 on host1 
> again, since there's no more running attempt on host1.
> **
> After discussion in the PR, we simply make the comment be consistent the 
> method's behavior. See details in PR#20998.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23433:


Assignee: Apache Spark

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Assignee: Apache Spark
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23433:


Assignee: (was: Apache Spark)

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23433) java.lang.IllegalStateException: more than one active taskSet for stage

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448822#comment-16448822
 ] 

Apache Spark commented on SPARK-23433:
--

User 'squito' has created a pull request for this issue:
https://github.com/apache/spark/pull/21131

> java.lang.IllegalStateException: more than one active taskSet for stage
> ---
>
> Key: SPARK-23433
> URL: https://issues.apache.org/jira/browse/SPARK-23433
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Shixiong Zhu
>Priority: Major
>
> This following error thrown by DAGScheduler stopped the cluster:
> {code}
> 18/02/11 13:22:27 ERROR DAGSchedulerEventProcessLoop: 
> DAGSchedulerEventProcessLoop failed; shutting down SparkContext
> java.lang.IllegalStateException: more than one active taskSet for stage 
> 7580621: 7580621.2,7580621.1
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.submitTasks(TaskSchedulerImpl.scala:229)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitMissingTasks(DAGScheduler.scala:1193)
>   at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:1059)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:900)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$submitWaitingChildStages$6.apply(DAGScheduler.scala:899)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.scheduler.DAGScheduler.submitWaitingChildStages(DAGScheduler.scala:899)
>   at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskCompletion(DAGScheduler.scala:1427)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1929)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1880)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1868)
>   at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23874) Upgrade apache/arrow to 0.10.0

2018-04-23 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23874:
--
Summary: Upgrade apache/arrow to 0.10.0  (was: Upgrade apache/arrow to 
0.9.0)

> Upgrade apache/arrow to 0.10.0
> --
>
> Key: SPARK-23874
> URL: https://issues.apache.org/jira/browse/SPARK-23874
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Assignee: Bryan Cutler
>Priority: Major
>
> Version 0.9.0 of apache arrow comes with a bug fix related to array 
> serialization. 
> https://issues.apache.org/jira/browse/ARROW-1973



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23004) Structured Streaming raise "llegalStateException: Cannot remove after already committed or aborted"

2018-04-23 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23004.
---
   Resolution: Fixed
Fix Version/s: 2.3.1
   3.0.0

Issue resolved by pull request 21124
[https://github.com/apache/spark/pull/21124]

> Structured Streaming raise "llegalStateException: Cannot remove after already 
> committed or aborted"
> ---
>
> Key: SPARK-23004
> URL: https://issues.apache.org/jira/browse/SPARK-23004
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.3.0
> Environment: Run on yarn or local both raise the exception.
>Reporter: secfree
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0, 2.3.1
>
>
> A structured streaming query with a streaming aggregation can throw the 
> following error in rare cases. 
> {code:java}
> java.lang.IllegalStateException: Cannot remove after already committed or 
> aborted at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:659
>  ) at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$HDFSBackedStateStore.remove(HDFSBackedStateStoreProvider.scala:143)
>  at 
> org.apache.spark.sql.execution.streaming.StateStoreSaveExec$$anonfun$doExecute$3$$anon$1.hasNext(statefulOperators.scala:233)
>  at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.processInputs(ObjectAggregationIterator.scala:191)
>  at 
> org.apache.spark.sql.execution.aggregate.ObjectAggregationIterator.(ObjectAggregationIterator.scala:80)
>  at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:111)
>  at 
> org.apache.spark.sql.execution.aggregate.ObjectHashAggregateExec$$anonfun$doExecute$1$$anonfun$2.apply(ObjectHashAggregateExec.scala:103)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>  at 
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
>  at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at 
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at 
> org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at 
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at 
> org.apache.spark.scheduler.Task.run(Task.scala:108) at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745) 18-01-05 13:29:57 WARN 
> TaskSetManager:66 Lost task 68.0 in stage 1933.0 (TID 196171, localhost, 
> executor driver): java.lang.IllegalStateException: Cannot remove after 
> already committed or aborted at 
> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$verify(HDFSBackedStateStoreProvider.scala:659){code}
>  
> This can happen when the following conditions are accidentally hit. 
>  # Streaming aggregation with aggregation function that is a subset of 
> {{[TypedImperativeAggregation|https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala#L473]}}
>  (for example, {{collect_set}}, {{collect_list}}, {{percentile}}, etc.). 
>  # Query running in {{update}} mode
>  # After the shuffle, a partition has exactly 128 records. 
> This happens because of the following. 
>  # The {{StateStoreSaveExec}} used in streaming aggregations has the 
> [following 
> logic|https://github.com/apache/spark/blob/76b8b840ddc951ee6203f9cccd2c2b9671c1b5e8/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/statefulOperators.scala#L359]
>  when used in {{update}} mode.
>  ## There is an iterator that reads data from its parent iterator and updates 
> the StateStore.
>  ## When the parent iterator is fully consumed (i.e. {{baseIterator.hasNext}} 
> returns false) then all state changes are committed by calling 
> {{StateStore.commit}}. 
>  ## The implementation of {{StateStore.commit()}} in {{HDFSBackedStateStore}} 
> does not 

[jira] [Resolved] (SPARK-24053) Support add subdirectory named as user name on staging directory

2018-04-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-24053.

Resolution: Not A Bug

Copy & paste of PR comments:

{quote}
> When we have multiple users on the same cluster

Then the staging directory would be created under the respective user already 
(it's created under the user's home directory). I have no idea why you need 
this.

Moreover, you can set spark.yarn.stagingDir to whatever you want, and may even 
reference env variables or system properties. e.g.

spark.yarn.stagingDir=/tmp/${system:user.name}

And the staging directory will be created under that location. So really you 
don't need to write any code for this, even if you have a weird deployment 
where users don't have their own home directory.
{quote}

>  Support add subdirectory named as user name on staging directory
> -
>
> Key: SPARK-24053
> URL: https://issues.apache.org/jira/browse/SPARK-24053
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Minor
>
> When we have multiple users on the same cluster,we can support add 
> subdirectory which named as login user



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-21168) KafkaRDD should always set kafka clientId.

2018-04-23 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger resolved SPARK-21168.

   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 19887
[https://github.com/apache/spark/pull/19887]

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Assignee: liuzhaokun
>Priority: Trivial
> Fix For: 2.4.0
>
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-21168) KafkaRDD should always set kafka clientId.

2018-04-23 Thread Cody Koeninger (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cody Koeninger reassigned SPARK-21168:
--

Assignee: liuzhaokun

> KafkaRDD should always set kafka clientId.
> --
>
> Key: SPARK-21168
> URL: https://issues.apache.org/jira/browse/SPARK-21168
> Project: Spark
>  Issue Type: Bug
>  Components: DStreams
>Affects Versions: 2.0.2
>Reporter: Xingxing Di
>Assignee: liuzhaokun
>Priority: Trivial
> Fix For: 2.4.0
>
>
> I found KafkaRDD not set kafka client.id in "fetchBatch" method 
> (FetchRequestBuilder will set clientId to empty by default),  normally this 
> will affect nothing, but in our case ,we use clientId at kafka server side, 
> so we have to rebuild spark-streaming-kafka。



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2018-04-23 Thread Miao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448639#comment-16448639
 ] 

Miao Wang commented on SPARK-23996:
---

Thanks for your reply! I am learning the core part of the algorithm. 

> Implement the optimal KLL algorithms for quantiles in streams
> -
>
> Key: SPARK-23996
> URL: https://issues.apache.org/jira/browse/SPARK-23996
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current implementation for approximate quantiles - a variant of 
> Grunwald-Khanna, which I implemented - is not the best in light of recent 
> papers:
>  - it is not exactly the one from the paper for performance reasons, but the 
> changes are not documented beyond comments on the code
>  - there are now more optimal algorithms with proven bounds (unlike q-digest, 
> the other contender at the time)
> I propose that we revisit the current implementation and look at the 
> Karnin-Lang-Liberty algorithm (KLL) for example:
> [https://arxiv.org/abs/1603.05346]
> [https://edoliberty.github.io//papers/streamingQuantiles.pdf]
> This algorithm seems to have favorable characteristics for streaming and a 
> distributed implementation, and there is a python implementation for 
> reference.
> It is a fairly standalone piece, and in that respect available to people who 
> don't know too much about spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-04-23 Thread Tavis Barr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448634#comment-16448634
 ] 

Tavis Barr commented on SPARK-18112:


Sorry, following myself up... the next parameter, HIVE_STATS_RETRIES_WAIT, also 
needs to be removed.  A quick search of the code does not find these two 
parameters used anywhere else in Spark, so I think the two lines can just be 
deleted without causing any downstream issues.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-18112) Spark2.x does not support read data from Hive 2.x metastore

2018-04-23 Thread Tavis Barr (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448587#comment-16448587
 ] 

Tavis Barr commented on SPARK-18112:


It looks to me like this issue has actually not been fixed.  As seen in the 
stack trace, the offending code is in 

/src/main/scala/org/apache/spark/sql/hive/HiveUtils.scala

line 205, where the method attempts to fetch the value of configuration 
parameter HIVE_STATS_JDBC_TIMEOUT

which was a configuration parameter defined in import 
org.apache.hadoop.hive.conf.HiveConf, which is part of hive-common.  However, 
this configuration parameter was removed in Hive 2, therefore the above code 
will throw an exception when run with hive-common versions 2.x.  It is possible 
there are other configuration parameters requested in HiveUtils.scala that have 
been removed as well; I haven't checked.  In any event, the above line 205 is 
still present in the Master branch as of today, so Spark still does not work 
with Hive 2.x.

> Spark2.x does not support read data from Hive 2.x metastore
> ---
>
> Key: SPARK-18112
> URL: https://issues.apache.org/jira/browse/SPARK-18112
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0, 2.0.1
>Reporter: KaiXu
>Assignee: Xiao Li
>Priority: Critical
> Fix For: 2.2.0
>
>
> Hive2.0 has been released in February 2016, after that Hive2.0.1 and 
> Hive2.1.0 have also been released for a long time, but till now spark only 
> support to read hive metastore data from Hive1.2.1 and older version, since 
> Hive2.x has many bugs fixed and performance improvement it's better and 
> urgent to upgrade to support Hive2.x
> failed to load data from hive2.x metastore:
> Exception in thread "main" java.lang.NoSuchFieldError: HIVE_STATS_JDBC_TIMEOUT
> at 
> org.apache.spark.sql.hive.HiveUtils$.hiveClientConfigurations(HiveUtils.scala:197)
> at 
> org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:262)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive$lzycompute(HiveSharedState.scala:39)
> at 
> org.apache.spark.sql.hive.HiveSharedState.metadataHive(HiveSharedState.scala:38)
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog$lzycompute(HiveSharedState.scala:4
> at 
> org.apache.spark.sql.hive.HiveSharedState.externalCatalog(HiveSharedState.scala:45)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog$lzycompute(HiveSessionState.scala:50)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:48)
> at 
> org.apache.spark.sql.hive.HiveSessionState.catalog(HiveSessionState.scala:31)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:568)
> at org.apache.spark.sql.SparkSession.table(SparkSession.scala:564)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-23 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai resolved SPARK-24024.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

> Fix deviance calculations in GLM to handle corner cases
> ---
>
> Key: SPARK-24024
> URL: https://issues.apache.org/jira/browse/SPARK-24024
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: Teng Peng
>Priority: Minor
> Fix For: 2.4.0
>
>
> It is reported by Spark users that the deviance calculations does not handle 
> a corner case. Thus, the correct model summary cannot be obtained. The user 
> has confirmed the the issue is in
> override def deviance(y: Double, mu: Double, weight: Double): Double = {
>  2.0 * weight * (y * math.log(y / mu) - (y - mu))
>  }
> when y = 0.
>  
> The user also mentioned there are many other places he believe we should 
> check the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24024) Fix deviance calculations in GLM to handle corner cases

2018-04-23 Thread DB Tsai (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24024?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

DB Tsai reassigned SPARK-24024:
---

Assignee: Teng Peng

> Fix deviance calculations in GLM to handle corner cases
> ---
>
> Key: SPARK-24024
> URL: https://issues.apache.org/jira/browse/SPARK-24024
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Teng Peng
>Assignee: Teng Peng
>Priority: Minor
>
> It is reported by Spark users that the deviance calculations does not handle 
> a corner case. Thus, the correct model summary cannot be obtained. The user 
> has confirmed the the issue is in
> override def deviance(y: Double, mu: Double, weight: Double): Double = {
>  2.0 * weight * (y * math.log(y / mu) - (y - mu))
>  }
> when y = 0.
>  
> The user also mentioned there are many other places he believe we should 
> check the same thing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23638) Spark on k8s: spark.kubernetes.initContainer.image has no effect

2018-04-23 Thread Yinan Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23638?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yinan Li resolved SPARK-23638.
--
Resolution: Not A Problem

> Spark on k8s: spark.kubernetes.initContainer.image has no effect
> 
>
> Key: SPARK-23638
> URL: https://issues.apache.org/jira/browse/SPARK-23638
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
> Environment: K8 server: Ubuntu 16.04
> Submission client: macOS Sierra 10.12.x
> Client Version: version.Info\{Major:"1", Minor:"9", GitVersion:"v1.9.3", 
> GitCommit:"d2835416544f298c919e2ead3be3d0864b52323b", GitTreeState:"clean", 
> BuildDate:"2018-02-07T12:22:21Z", GoVersion:"go1.9.2", Compiler:"gc", 
> Platform:"darwin/amd64"}
> Server Version: version.Info\{Major:"1", Minor:"8", GitVersion:"v1.8.3", 
> GitCommit:"f0efb3cb883751c5ffdbe6d515f3cb4fbe7b7acd", GitTreeState:"clean", 
> BuildDate:"2017-11-08T18:27:48Z", GoVersion:"go1.8.3", Compiler:"gc", 
> Platform:"linux/amd64"}
>Reporter: maheshvra
>Priority: Major
>
> Hi all - I am trying to use initContainer to download remote dependencies. To 
> begin with, I ran a test with initContainer which basically "echo hello 
> world". However, when i triggered the pod deployment via spark-submit, I did 
> not see any trace of initContainer execution in my kubernetes cluster.
>  
> {code:java}
> SPARK_DRIVER_MEMORY: 1g 
> SPARK_DRIVER_CLASS: com.bigdata.App SPARK_DRIVER_ARGS: -c 
> /opt/spark/work-dir/app/main/environments/int -w 
> ./../../workflows/workflow_main.json -e prod -n features -v off 
> SPARK_DRIVER_BIND_ADDRESS:  
> SPARK_JAVA_OPT_0: -Dspark.submit.deployMode=cluster 
> SPARK_JAVA_OPT_1: -Dspark.driver.blockManager.port=7079 
> SPARK_JAVA_OPT_2: -Dspark.app.name=fg-am00-raw12 
> SPARK_JAVA_OPT_3: 
> -Dspark.kubernetes.container.image=docker.com/cmapp/fg-am00-raw:1.0.0 
> SPARK_JAVA_OPT_4: -Dspark.app.id=spark-4fa9a5ce1b1d401fa9c1e413ff030d44 
> SPARK_JAVA_OPT_5: 
> -Dspark.jars=/opt/spark/jars/aws-java-sdk-1.7.4.jar,/opt/spark/jars/hadoop-aws-2.7.3.jar,/opt/spark/jars/guava-14.0.1.jar,/opt/spark/jars/SparkApp.jar,/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar
>  
> SPARK_JAVA_OPT_6: -Dspark.driver.port=7078 
> SPARK_JAVA_OPT_7: 
> -Dspark.kubernetes.initContainer.image=docker.com/cmapp/custombusybox:1.0.0 
> SPARK_JAVA_OPT_8: 
> -Dspark.kubernetes.executor.podNamePrefix=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615
>  
> SPARK_JAVA_OPT_9: 
> -Dspark.kubernetes.driver.pod.name=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver
>  
> SPARK_JAVA_OPT_10: 
> -Dspark.driver.host=fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver-svc.experimental.svc
>  SPARK_JAVA_OPT_11: -Dspark.executor.instances=5 
> SPARK_JAVA_OPT_12: 
> -Dspark.hadoop.fs.s3a.server-side-encryption-algorithm=AES256 
> SPARK_JAVA_OPT_13: -Dspark.kubernetes.namespace=experimental 
> SPARK_JAVA_OPT_14: 
> -Dspark.kubernetes.authenticate.driver.serviceAccountName=experimental-service-account
>  SPARK_JAVA_OPT_15: -Dspark.master=k8s://https://bigdata
> {code}
>  
> Further, I did not see spec.initContainers section in the generated pod. 
> Please see the details below
>  
> {code:java}
>  
> {
> "kind": "Pod",
> "apiVersion": "v1",
> "metadata": {
> "name": "fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver",
> "namespace": "experimental",
> "selfLink": 
> "/api/v1/namespaces/experimental/pods/fg-am00-raw12-b1c8112b8536304ab0fc64fcc41e0615-driver",
> "uid": "adc5a50a-2342-11e8-87dc-12c5b3954044",
> "resourceVersion": "299054",
> "creationTimestamp": "2018-03-09T02:36:32Z",
> "labels": {
> "spark-app-selector": "spark-4fa9a5ce1b1d401fa9c1e413ff030d44",
> "spark-role": "driver"
> },
> "annotations": {
> "spark-app-name": "fg-am00-raw12"
> }
> },
> "spec": {
> "volumes": [
> {
> "name": "experimental-service-account-token-msmth",
> "secret": {
> "secretName": "experimental-service-account-token-msmth",
> "defaultMode": 420
> }
> }
> ],
> "containers": [
> {
> "name": "spark-kubernetes-driver",
> "image": "docker.com/cmapp/fg-am00-raw:1.0.0",
> "args": [
> "driver"
> ],
> "env": [
> {
> "name": "SPARK_DRIVER_MEMORY",
> "value": "1g"
> },
> {
> "name": "SPARK_DRIVER_CLASS",
> "value": "com.myapp.App"
> },
> {
> "name": "SPARK_DRIVER_ARGS",
> "value": "-c /opt/spark/work-dir/app/main/environments/int -w 
> ./../../workflows/workflow_main.json -e prod -n features -v off"
> },
> {
> "name": "SPARK_DRIVER_BIND_ADDRESS",
> "valueFrom": {
> "fieldRef": {
> "apiVersion": "v1",
> "fieldPath": "status.podIP"
> }
> }
> },
> {
> "name": "SPARK_MOUNTED_CLASSPATH",
> "value": 
> "/opt/spark/jars/aws-java-sdk-1.7.4.jar:/opt/spark/jars/hadoop-aws-2.7.3.jar:/opt/spark/jars/guava-14.0.1.jar:/opt/spark/jars/datacleanup-component-1.0-SNAPSHOT.jar:/opt/spark/jars/SparkApp.jar"
> },
> {

[jira] [Commented] (SPARK-24051) Incorrect results for certain queries using Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448390#comment-16448390
 ] 

Emlyn Corrin commented on SPARK-24051:
--

I've managed to reproduce this in {{pyspark}}:
{code}
from pyspark.sql import functions, Window
ds1 = spark.createDataFrame([[1,42],[1,99]], ["a","b"])
ds2 = spark.createDataFrame([[3]], ["a"]).withColumn("b", functions.lit(0))

cols = [functions.col("a"),
functions.col("b").alias("b"),
functions.count(functions.lit(1)).over(Window.partitionBy()).alias("n")]

ds = ds1.select(cols).union(ds2.select(cols)).where(functions.col("n") >= 
1).drop("n")
ds.show()
{code}
I've also found that (in both Java and Python) I can leave off the final 
{{where}} clause if I also leave off the following {{drop}} so that the {{n}} 
column is included in the output (I suppose as long as the it's actually 
observed so that it can't be optimised away).

> Incorrect results for certain queries using Java API on Spark 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24051) Incorrect results for certain queries using Java and Python APIs on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries using Java and Python APIs 
on Spark 2.3.0  (was: Incorrect results for certain queries using Java API on 
Spark 2.3.0)

> Incorrect results for certain queries using Java and Python APIs on Spark 
> 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448365#comment-16448365
 ] 

Marco Gaido commented on SPARK-24013:
-

[~juliuszsompolski] I have been able to reproduce with 1000. Probably 
SPARK-17439 is related. The problem is that the compress method is called too 
many times in this condition. The fix is easy, I'll submit a patch soon, but I 
am not so familiar with this algorithm and the real root cause of the problem, 
so I have to study it a bit in order to check if there are other problems 
causing the performance issue.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24054) Add array_position function / element_at functions

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24054:


Assignee: (was: Apache Spark)

> Add array_position function /  element_at functions
> ---
>
> Key: SPARK-24054
> URL: https://issues.apache.org/jira/browse/SPARK-24054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Add R versions of SPARK-23919 and SPARK-23924



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24054) Add array_position function / element_at functions

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24054?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448276#comment-16448276
 ] 

Apache Spark commented on SPARK-24054:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/21130

> Add array_position function /  element_at functions
> ---
>
> Key: SPARK-24054
> URL: https://issues.apache.org/jira/browse/SPARK-24054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Add R versions of SPARK-23919 and SPARK-23924



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24054) Add array_position function / element_at functions

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24054:


Assignee: Apache Spark

> Add array_position function /  element_at functions
> ---
>
> Key: SPARK-24054
> URL: https://issues.apache.org/jira/browse/SPARK-24054
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 2.4.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Add R versions of SPARK-23919 and SPARK-23924



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24054) Add array_position function / element_at functions

2018-04-23 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-24054:


 Summary: Add array_position function /  element_at functions
 Key: SPARK-24054
 URL: https://issues.apache.org/jira/browse/SPARK-24054
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 2.4.0
Reporter: Hyukjin Kwon


Add R versions of SPARK-23919 and SPARK-23924



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448093#comment-16448093
 ] 

Juliusz Sompolski edited comment on SPARK-24013 at 4/23/18 1:11 PM:


Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 1000 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.


was (Author: juliuszsompolski):
Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 100 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448093#comment-16448093
 ] 

Juliusz Sompolski commented on SPARK-24013:
---

Hi [~mgaido]
I tested again on current master (afbdf427302aba858f95205ecef7667f412b2a6a) and 
I reproduce it:
 !screenshot-1.png! 

Maybe you need to bump up 100 to something higher when running on a bigger 
cluster that splits the range into more tasks?
For me it grinds to a halt after about 250 per task.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Juliusz Sompolski (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Juliusz Sompolski updated SPARK-24013:
--
Attachment: screenshot-1.png

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23589) Add interpreted execution for ExternalMapToCatalyst expression

2018-04-23 Thread Herman van Hovell (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Herman van Hovell resolved SPARK-23589.
---
   Resolution: Fixed
 Assignee: Takeshi Yamamuro
Fix Version/s: 2.4.0

> Add interpreted execution for ExternalMapToCatalyst expression
> --
>
> Key: SPARK-23589
> URL: https://issues.apache.org/jira/browse/SPARK-23589
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Herman van Hovell
>Assignee: Takeshi Yamamuro
>Priority: Major
> Fix For: 2.4.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24013) ApproximatePercentile grinds to a halt on sorted input.

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24013?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448058#comment-16448058
 ] 

Marco Gaido commented on SPARK-24013:
-

I cannot reproduce on current master. For me it was very fast the second query 
too.

> ApproximatePercentile grinds to a halt on sorted input.
> ---
>
> Key: SPARK-24013
> URL: https://issues.apache.org/jira/browse/SPARK-24013
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Juliusz Sompolski
>Priority: Major
>
> Running
> {code}
> sql("select approx_percentile(rid, array(0.1)) from (select rand() as rid 
> from range(1000))").collect()
> {code}
> takes 7 seconds, while
> {code}
> sql("select approx_percentile(id, array(0.1)) from range(1000)").collect()
> {code}
> grinds to a halt - processes the first million rows quickly, and then slows 
> down to a few thousands rows / second (4m rows processed after 20 minutes).
> Thread dumps show that it spends time in QuantileSummary.compress.
> Seems it hits some edge case inefficiency when dealing with sorted data?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24009) spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/aaaaab'

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448056#comment-16448056
 ] 

Marco Gaido commented on SPARK-24009:
-

I have not been able to reproduce.

> spark2.3.0 INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab' 
> -
>
> Key: SPARK-24009
> URL: https://issues.apache.org/jira/browse/SPARK-24009
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: chris_j
>Priority: Major
>
> local mode  spark execute "INSERT OVERWRITE LOCAL DIRECTORY " successfully.
> on yarn spark execute "INSERT OVERWRITE LOCAL DIRECTORY " failed, not 
> permission problem also 
>  
> 1.spark-sql -e "INSERT OVERWRITE LOCAL DIRECTORY '/home/spark/ab'row 
> format delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write local directory successful
> 2.spark-sql  --master yarn -e "INSERT OVERWRITE DIRECTORY 'ab'row format 
> delimited FIELDS TERMINATED BY '\t' STORED AS TEXTFILE select * from 
> default.dim_date"  write hdfs successful
> 3.spark-sql --master yarn -e "INSERT OVERWRITE LOCAL DIRECTORY 
> '/home/spark/ab'row format delimited FIELDS TERMINATED BY '\t' STORED AS 
> TEXTFILE select * from default.dim_date"  on yarn writr local directory failed
>  
>  
> Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: 
> java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:249)
>  at 
> org.apache.spark.sql.hive.execution.HiveOutputWriter.(HiveFileFormat.scala:123)
>  at 
> org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
>  at 
> org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1411)
>  at 
> org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
>  ... 8 more
>  Caused by: java.io.IOException: Mkdirs failed to create 
> [file:/home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0|file://home/spark/ab/.hive-staging_hive_2018-04-18_14-14-37_208_1244164279218288723-1/-ext-1/_temporary/0/_temporary/attempt_20180418141439__m_00_0]
>  (exists=false, 
> cwd=[file:/data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02|file://data/hadoop/tmp/nm-local-dir/usercache/spark/appcache/application_1523246226712_0403/container_1523246226712_0403_01_02])
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:447)
>  at 
> org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:433)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:908)
>  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:801)
>  at 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat.getHiveRecordWriter(HiveIgnoreKeyTextOutputFormat.java:80)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getRecordWriter(HiveFileFormatUtils.java:261)
>  at 
> org.apache.hadoop.hive.ql.io.HiveFileFormatUtils.getHiveRecordWriter(HiveFileFormatUtils.java:246)
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For 

[jira] [Assigned] (SPARK-23564) the optimized logical plan about Left anti join should be further optimization

2018-04-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-23564:
---

Assignee: Marco Gaido

> the  optimized logical plan about Left anti join should be further 
> optimization
> ---
>
> Key: SPARK-23564
> URL: https://issues.apache.org/jira/browse/SPARK-23564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Assignee: Marco Gaido
>Priority: Major
> Fix For: 2.4.0
>
>
> The Optimized Logical Plan of the query '*select * from tt1 left anti join 
> tt2 on tt2.i = tt1.i*' is 
>  
> {code:java}
> == Optimized Logical Plan ==
> Join LeftAnti, (i#2 = i#0)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#0, s#1]
> +- Project [i#2]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#2, s#3]
> {code}
>  
>  
> this plan can be further optimization by 'Filter isnotnull' of right table, 
> as follow:
> {code:java}
> == Optimized Logical Plan ==
> Join LeftAnti, (i#2 = i#0)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#0, s#1]
> +- Project [i#2]
>   +- Filter isnotnull(i#3)
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#2, s#3]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23564) the optimized logical plan about Left anti join should be further optimization

2018-04-23 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-23564.
-
   Resolution: Fixed
Fix Version/s: 2.4.0

Issue resolved by pull request 21083
[https://github.com/apache/spark/pull/21083]

> the  optimized logical plan about Left anti join should be further 
> optimization
> ---
>
> Key: SPARK-23564
> URL: https://issues.apache.org/jira/browse/SPARK-23564
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: KaiXinXIaoLei
>Priority: Major
> Fix For: 2.4.0
>
>
> The Optimized Logical Plan of the query '*select * from tt1 left anti join 
> tt2 on tt2.i = tt1.i*' is 
>  
> {code:java}
> == Optimized Logical Plan ==
> Join LeftAnti, (i#2 = i#0)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#0, s#1]
> +- Project [i#2]
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#2, s#3]
> {code}
>  
>  
> this plan can be further optimization by 'Filter isnotnull' of right table, 
> as follow:
> {code:java}
> == Optimized Logical Plan ==
> Join LeftAnti, (i#2 = i#0)
> :- HiveTableRelation `default`.`tt1`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#0, s#1]
> +- Project [i#2]
>   +- Filter isnotnull(i#3)
> +- HiveTableRelation `default`.`tt2`, 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [i#2, s#3]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7132:
---

Assignee: Apache Spark

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-7132:
---

Assignee: (was: Apache Spark)

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7132) Add fit with validation set to spark.ml GBT

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16448039#comment-16448039
 ] 

Apache Spark commented on SPARK-7132:
-

User 'WeichenXu123' has created a pull request for this issue:
https://github.com/apache/spark/pull/21129

> Add fit with validation set to spark.ml GBT
> ---
>
> Key: SPARK-7132
> URL: https://issues.apache.org/jira/browse/SPARK-7132
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Minor
>
> In spark.mllib GradientBoostedTrees, we have a method runWithValidation which 
> takes a validation set.  We should add that to the spark.ml API.
> This will require a bit of thinking about how the Pipelines API should handle 
> a validation set (since Transformers and Estimators only take 1 input 
> DataFrame).  The current plan is to include an extra column in the input 
> DataFrame which indicates whether the row is for training, validation, etc.
> Goals
> A  [P0] Support efficient validation during training
> B  [P1] Support early stopping based on validation metrics
> C  [P0] Ensure validation data are preprocessed identically to training data
> D  [P1] Support complex Pipelines with multiple models using validation data
> Proposal: column with indicator for train vs validation
> Include an extra column in the input DataFrame which indicates whether the 
> row is for training or validation.  Add a Param “validationFlagCol” used to 
> specify the extra column name.
> A, B, C are easy.
> D is doable.
> Each estimator would need to have its validationFlagCol Param set to the same 
> column.
> Complication: It would be ideal if we could prevent different estimators from 
> using different validation sets.  (Joseph: There is not an obvious way IMO.  
> Maybe we can address this later by, e.g., having Pipelines take a 
> validationFlagCol Param and pass that to the sub-models in the Pipeline.  
> Let’s not worry about this for now.)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15257) Require CREATE EXTERNAL TABLE to specify LOCATION

2018-04-23 Thread ice bai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447949#comment-16447949
 ] 

ice bai commented on SPARK-15257:
-

“but its {color:#FF}metadata{color} won't be deleted even when the user 
drops the table”

what metadata can not be deleted?  In Hive Metastore or in SparkSession ?

> Require CREATE EXTERNAL TABLE to specify LOCATION
> -
>
> Key: SPARK-15257
> URL: https://issues.apache.org/jira/browse/SPARK-15257
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.0
>Reporter: Andrew Or
>Assignee: Andrew Or
>Priority: Major
> Fix For: 2.0.0
>
>
> Right now when the user runs `CREATE EXTERNAL TABLE` without specifying 
> `LOCATION`, the table will still be created in the warehouse directory, but 
> its metadata won't be deleted even when the user drops the table! This is a 
> problem. We should use require the user to also specify `LOCATION`.
> Note: This doesn't not apply to `CREATE EXTERNAL TABLE ... USING`, which is 
> not yet supported.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24053) Support add subdirectory named as user name on staging directory

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447943#comment-16447943
 ] 

Apache Spark commented on SPARK-24053:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21128

>  Support add subdirectory named as user name on staging directory
> -
>
> Key: SPARK-24053
> URL: https://issues.apache.org/jira/browse/SPARK-24053
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Minor
>
> When we have multiple users on the same cluster,we can support add 
> subdirectory which named as login user



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24053) Support add subdirectory named as user name on staging directory

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24053:


Assignee: Apache Spark

>  Support add subdirectory named as user name on staging directory
> -
>
> Key: SPARK-24053
> URL: https://issues.apache.org/jira/browse/SPARK-24053
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Minor
>
> When we have multiple users on the same cluster,we can support add 
> subdirectory which named as login user



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24053) Support add subdirectory named as user name on staging directory

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24053:


Assignee: (was: Apache Spark)

>  Support add subdirectory named as user name on staging directory
> -
>
> Key: SPARK-24053
> URL: https://issues.apache.org/jira/browse/SPARK-24053
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Minor
>
> When we have multiple users on the same cluster,we can support add 
> subdirectory which named as login user



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24053) Support add subdirectory named as user name on staging directory

2018-04-23 Thread zhoukang (JIRA)
zhoukang created SPARK-24053:


 Summary:  Support add subdirectory named as user name on staging 
directory
 Key: SPARK-24053
 URL: https://issues.apache.org/jira/browse/SPARK-24053
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.3.0
Reporter: zhoukang


When we have multiple users on the same cluster,we can support add subdirectory 
which named as login user



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20426) OneForOneStreamManager occupies too much memory.

2018-04-23 Thread ice bai (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447935#comment-16447935
 ] 

ice bai commented on SPARK-20426:
-

refer with the following issue:

https://issues.apache.org/jira/browse/SPARK-20994

> OneForOneStreamManager occupies too much memory.
> 
>
> Key: SPARK-20426
> URL: https://issues.apache.org/jira/browse/SPARK-20426
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle
>Affects Versions: 2.1.0
>Reporter: jin xing
>Assignee: jin xing
>Priority: Major
> Fix For: 2.2.0
>
> Attachments: screenshot-1.png, screenshot-2.png
>
>
> Spark jobs are running on yarn cluster in my warehouse. We enabled the 
> external shuffle service(*--conf spark.shuffle.service.enabled=true*). 
> Recently NodeManager runs OOM now and then. Dumping heap memory, we find that 
> *OneFroOneStreamManager*'s footprint is huge. NodeManager is configured with 
> 5G heap memory. While *OneForOneManager* costs 2.5G and there are 5503233 
> *FileSegmentManagedBuffer* objects. Is there any suggestions to avoid this 
> other than just keep increasing NodeManager's memory? Is it possible to stop 
> *registerStream* in OneForOneStreamManager? Thus we don't need to cache so 
> many metadatas(i.e. StreamState).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread zhoukang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447917#comment-16447917
 ] 

zhoukang commented on SPARK-24052:
--

[~jiangxb] Cloud you help review this?Thanks too much!

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24052:


Assignee: (was: Apache Spark)

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447914#comment-16447914
 ] 

Apache Spark commented on SPARK-24052:
--

User 'caneGuy' has created a pull request for this issue:
https://github.com/apache/spark/pull/21127

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-24052:


Assignee: Apache Spark

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Assignee: Apache Spark
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23901) Data Masking Functions

2018-04-23 Thread Marco Gaido (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447916#comment-16447916
 ] 

Marco Gaido commented on SPARK-23901:
-

[~smilegator] [~ueshin] any comment on my questions?

> Data Masking Functions
> --
>
> Key: SPARK-23901
> URL: https://issues.apache.org/jira/browse/SPARK-23901
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Major
>
> - mask()
>  - mask_first_n()
>  - mask_last_n()
>  - mask_hash()
>  - mask_show_first_n()
>  - mask_show_last_n()
> Reference:
> [1] 
> [https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-DataMaskingFunctions]
> [2] https://issues.apache.org/jira/browse/HIVE-13568
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24052:
-
Description: 
Since we may have multiple version in production cluster,we can showing some 
information on environment page like below:
 !environment-page.png! 

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>
> Since we may have multiple version in production cluster,we can showing some 
> information on environment page like below:
>  !environment-page.png! 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread zhoukang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zhoukang updated SPARK-24052:
-
Attachment: environment-page.png

> Support spark version showing on environment page
> -
>
> Key: SPARK-24052
> URL: https://issues.apache.org/jira/browse/SPARK-24052
> Project: Spark
>  Issue Type: Improvement
>  Components: Web UI
>Affects Versions: 2.3.0
>Reporter: zhoukang
>Priority: Major
> Attachments: environment-page.png
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-24052) Support spark version showing on environment page

2018-04-23 Thread zhoukang (JIRA)
zhoukang created SPARK-24052:


 Summary: Support spark version showing on environment page
 Key: SPARK-24052
 URL: https://issues.apache.org/jira/browse/SPARK-24052
 Project: Spark
  Issue Type: Improvement
  Components: Web UI
Affects Versions: 2.3.0
Reporter: zhoukang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-10697) Lift Calculation in Association Rule mining

2018-04-23 Thread Mourits de Beer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447898#comment-16447898
 ] 

Mourits de Beer edited comment on SPARK-10697 at 4/23/18 10:16 AM:
---

Hello - is this still in progress?

We require the lift of an association rule for a production use case. We can 
solve it for our own end, but it would be more elegant to build the lift method 
into the AssociationRules class, and it feels like that should be fairly 
trivial. Surprising that it's been unresolved for 2.5 years, but, 
understandable.

If a pull request is still necessary, I'll be happy to figure it out.


was (Author: friendlyfire137):
Hello - is this still in progress?

We require the lift of an association rule for a production use case. We can 
solve it for our own end, but it would be more elegant to build the lift method 
into the AssociationRules class, and it feels like that should be fairly 
trivial. Surprising (yet understandable) that it's been unresolved for 2.5 
years.

If a pull request is still necessary, I'll be happy to figure it out.

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10697) Lift Calculation in Association Rule mining

2018-04-23 Thread Mourits de Beer (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447898#comment-16447898
 ] 

Mourits de Beer commented on SPARK-10697:
-

Hello - is this still in progress?

We require the lift of an association rule for a production use case. We can 
solve it for our own end, but it would be more elegant to build the lift method 
into the AssociationRules class, and it feels like that should be fairly 
trivial. Surprising (yet understandable) that it's been unresolved for 2.5 
years.

If a pull request is still necessary, I'll be happy to figure it out.

> Lift Calculation in Association Rule mining
> ---
>
> Key: SPARK-10697
> URL: https://issues.apache.org/jira/browse/SPARK-10697
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Yashwanth Kumar
>Priority: Minor
>
> Lift is to be calculated for Association rule mining in 
> AssociationRules.scala under FPM.
> Lift is a measure of the performance of a  Association rules.
> Adding lift will help to compare the model efficiency.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2750) Add Https support for Web UI

2018-04-23 Thread palash gupta (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2750?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447880#comment-16447880
 ] 

palash gupta commented on SPARK-2750:
-

I am using Spark 1.6.3 for my current project , is there any chance I can 
resolve it in 1.6.3 version .

> Add Https support for Web UI
> 
>
> Key: SPARK-2750
> URL: https://issues.apache.org/jira/browse/SPARK-2750
> Project: Spark
>  Issue Type: New Feature
>  Components: Web UI
>Reporter: Tao Wang
>Assignee: Fei Wang
>Priority: Major
>  Labels: https, ssl, webui
> Fix For: 2.0.0
>
> Attachments: exception on yarn when https enabled.txt
>
>   Original Estimate: 96h
>  Remaining Estimate: 96h
>
> Now I try to add https support for web ui using Jetty ssl integration.Below 
> is the plan:
> 1.Web UI include Master UI, Worker UI, HistoryServer UI and Spark Ui. User 
> can switch between https and http by configure "spark.http.policy" in JVM 
> property for each process, while choose http by default.
> 2.Web port of Master and worker would be decided in order of launch 
> arguments, JVM property, System Env and default port.
> 3.Below is some other configuration items:
> {code}
> spark.ssl.server.keystore.location The file or URL of the SSL Key store
> spark.ssl.server.keystore.password  The password for the key store
> spark.ssl.server.keystore.keypassword The password (if any) for the specific 
> key within the key store
> spark.ssl.server.keystore.type The type of the key store (default "JKS")
> spark.client.https.need-auth True if SSL needs client authentication
> spark.ssl.server.truststore.location The file name or URL of the trust store 
> location
> spark.ssl.server.truststore.password The password for the trust store
> spark.ssl.server.truststore.type The type of the trust store (default "JKS")
> {code}
> Any feedback is welcome!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24051) Incorrect results for certain queries using Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries using Java API on Spark 
2.3.0  (was: Incorrect results for certain queries in Java API on Spark 2.3.0)

> Incorrect results for certain queries using Java API on Spark 2.3.0
> ---
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-24051) Incorrect results for certain queries in Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447854#comment-16447854
 ] 

Emlyn Corrin commented on SPARK-24051:
--

[~hyukjin.kwon] I expanded the title a bit, but feel free to improve it further 
if you want.

> Incorrect results for certain queries in Java API on Spark 2.3.0
> 
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-24051) Incorrect results for certain queries in Java API on Spark 2.3.0

2018-04-23 Thread Emlyn Corrin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-24051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Emlyn Corrin updated SPARK-24051:
-
Summary: Incorrect results for certain queries in Java API on Spark 2.3.0  
(was: Incorrect results)

> Incorrect results for certain queries in Java API on Spark 2.3.0
> 
>
> Key: SPARK-24051
> URL: https://issues.apache.org/jira/browse/SPARK-24051
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Emlyn Corrin
>Priority: Major
>
> I'm seeing Spark 2.3.0 return incorrect results for a certain (very specific) 
> query, demonstrated by the Java program below. It was simplified from a much 
> more complex query, but I'm having trouble simplifying it further without 
> removing the erroneous behaviour.
> {code:java}
> package sparktest;
> import org.apache.spark.SparkConf;
> import org.apache.spark.sql.*;
> import org.apache.spark.sql.expressions.Window;
> import org.apache.spark.sql.types.DataTypes;
> import org.apache.spark.sql.types.Metadata;
> import org.apache.spark.sql.types.StructField;
> import org.apache.spark.sql.types.StructType;
> import java.util.Arrays;
> public class Main {
> public static void main(String[] args) {
> SparkConf conf = new SparkConf()
> .setAppName("SparkTest")
> .setMaster("local[*]");
> SparkSession session = 
> SparkSession.builder().config(conf).getOrCreate();
> Row[] arr1 = new Row[]{
> RowFactory.create(1, 42),
> RowFactory.create(2, 99)};
> StructType sch1 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty()),
> new StructField("b", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds1 = session.createDataFrame(Arrays.asList(arr1), sch1);
> ds1.show();
> Row[] arr2 = new Row[]{
> RowFactory.create(3)};
> StructType sch2 = new StructType(new StructField[]{
> new StructField("a", DataTypes.IntegerType, true, 
> Metadata.empty())});
> Dataset ds2 = session.createDataFrame(Arrays.asList(arr2), sch2)
> .withColumn("b", functions.lit(0));
> ds2.show();
> Column[] cols = new Column[]{
> new Column("a"),
> new Column("b").as("b"),
> functions.count(functions.lit(1))
> .over(Window.partitionBy())
> .as("n")};
> Dataset ds = ds1
> .select(cols)
> .union(ds2.select(cols))
> .where(new Column("n").geq(1))
> .drop("n");
> ds.show();
> //ds.explain(true);
> }
> }
> {code}
> It just calculates the union of 2 datasets,
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> +---+---+
> {code}
> with
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  3|  0|
> +---+---+
> {code}
> The expected result is:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1| 42|
> |  2| 99|
> |  3|  0|
> +---+---+
> {code}
> but instead it prints:
> {code:java}
> +---+---+
> |  a|  b|
> +---+---+
> |  1|  0|
> |  2|  0|
> |  3|  0|
> +---+---+
> {code}
> notice how the value in column c is always zero, overriding the original 
> values in rows 1 and 2.
>  Making seemingly trivial changes, such as replacing {{new 
> Column("b").as("b"),}} with just {{new Column("b"),}} or removing the 
> {{where}} clause after the union, make it behave correctly again.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   >