[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23319:
-
Fix Version/s: (was: 2.4.0)
   2.3.0

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23356) Pushes Project to both sides of Union when expression is non-deterministic

2018-02-07 Thread caoxuewen (JIRA)
caoxuewen created SPARK-23356:
-

 Summary: Pushes Project to both sides of Union when expression is 
non-deterministic
 Key: SPARK-23356
 URL: https://issues.apache.org/jira/browse/SPARK-23356
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: caoxuewen


Currently, PushProjectionThroughUnion optimizer only supports pushdown project 
operator to both sides of a Union operator when expression is deterministic , 
in fact, we can be like pushdown filters, also support pushdown project 
operator to both sides of a Union operator when expression is non-deterministic 
, this PR description fix this problem。now the explain looks like:

=== Applying Rule 
org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion ===
Input LogicalPlan:
Project [a#0, rand(10) AS rnd#9]
+- Union
 :- LocalRelation , [a#0, b#1, c#2]
 :- LocalRelation , [d#3, e#4, f#5]
 +- LocalRelation , [g#6, h#7, i#8]

Output LogicalPlan:
Project [a#0, rand(10) AS rnd#9]
+- Union
 :- Project [a#0]
 : +- LocalRelation , [a#0, b#1, c#2]
 :- Project [d#3]
 : +- LocalRelation , [d#3, e#4, f#5]
 +- Project [g#6]
 +- LocalRelation , [g#6, h#7, i#8]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2018-02-07 Thread Hyukjin Kwon (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356558#comment-16356558
 ] 

Hyukjin Kwon commented on SPARK-20090:
--

Sure, will open up a PR soon.

> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API

2018-02-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356536#comment-16356536
 ] 

Reynold Xin commented on SPARK-20090:
-

Do you mind doing it? Thanks.




> Add StructType.fieldNames to Python API
> ---
>
> Key: SPARK-20090
> URL: https://issues.apache.org/jira/browse/SPARK-20090
> Project: Spark
>  Issue Type: New Feature
>  Components: PySpark, SQL
>Affects Versions: 2.1.0
>Reporter: Joseph K. Bradley
>Assignee: Hyukjin Kwon
>Priority: Trivial
> Fix For: 2.3.0
>
>
> The Scala/Java API for {{StructType}} has a method {{fieldNames}}.  It would 
> be nice if the Python {{StructType}} did as well.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2018-02-07 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356525#comment-16356525
 ] 

Liang-Chi Hsieh commented on SPARK-22446:
-

2.0 and 2.1 also have this issue.

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.1.2, 2.2.1
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
>   

[jira] [Comment Edited] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-07 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356507#comment-16356507
 ] 

Jiang Xingbo edited comment on SPARK-23139 at 2/8/18 5:25 AM:
--


{quote}EventLog may contain mixed encodings such as custom exception 
message{quote}


Could you please elaborate on how this happened?


was (Author: jiangxb1987):
```
EventLog may contain mixed encodings such as custom exception message
```

Could you please elaborate on how this happened?

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-07 Thread Jiang Xingbo (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356507#comment-16356507
 ] 

Jiang Xingbo commented on SPARK-23139:
--

```
EventLog may contain mixed encodings such as custom exception message
```

Could you please elaborate on how this happened?

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2018-02-07 Thread Liang-Chi Hsieh (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356506#comment-16356506
 ] 

Liang-Chi Hsieh commented on SPARK-22446:
-

Yes, this is an issue in Spark 2.2. For earlier version, let me check it.

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.1.2, 2.2.1
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> 

[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356441#comment-16356441
 ] 

Apache Spark commented on SPARK-22700:
--

User 'zhengruifeng' has created a pull request for this issue:
https://github.com/apache/spark/pull/20539

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-07 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356440#comment-16356440
 ] 

Marcelo Vanzin commented on SPARK-23139:


bq. ASSII is enough to spark event log.

No it's not.

bq. And if forcing writing with UTF-8, should also forcing reading with UTF-8 
too.

I'm not saying *if*, I'm saying that that's the expectation, and if that's not 
happening, it's a bug that needs to be fixed.

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2018-02-07 Thread zhengruifeng (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356431#comment-16356431
 ] 

zhengruifeng commented on SPARK-22700:
--

[~WeichenXu123] I have checked others, and them seems ok

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-07 Thread DENG FEI (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356428#comment-16356428
 ] 

DENG FEI commented on SPARK-23139:
--

_ASSII_ is enough to spark event log.

And if forcing writing with UTF-8, should also forcing reading with UTF-8 too.

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356359#comment-16356359
 ] 

Apache Spark commented on SPARK-23319:
--

User 'ueshin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20538

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN

2018-02-07 Thread Weichen Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356354#comment-16356354
 ] 

Weichen Xu commented on SPARK-22700:


[~podongfeng] Have you checked other transformers with handleInvalid option for 
the similar issue ?

> Bucketizer.transform incorrectly drops row containing NaN
> -
>
> Key: SPARK-22700
> URL: https://issues.apache.org/jira/browse/SPARK-22700
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0, 2.3.0
>Reporter: zhengruifeng
>Assignee: zhengruifeng
>Priority: Major
> Fix For: 2.3.0
>
>
> {code}
> import org.apache.spark.ml.feature._
> val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, 
> Double.NaN))).toDF("a", "b")
> val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity)
> val bucketizer: Bucketizer = new 
> Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits)
> bucketizer.setHandleInvalid("skip")
> scala> df.show
> +---+---+
> |  a|  b|
> +---+---+
> |2.3|3.0|
> |NaN|3.0|
> |6.7|NaN|
> +---+---+
> scala> bucketizer.transform(df).show
> +---+---+---+
> |  a|  b| aa|
> +---+---+---+
> |2.3|3.0|0.0|
> +---+---+---+
> {code}
> When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly 
> droped, though colum 'b' is not an input column



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22289) Cannot save LogisticRegressionModel with bounds on coefficients

2018-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22289:
--
Summary: Cannot save LogisticRegressionModel with bounds on coefficients  
(was: Cannot save LogisticRegressionClassificationModel with bounds on 
coefficients)

> Cannot save LogisticRegressionModel with bounds on coefficients
> ---
>
> Key: SPARK-22289
> URL: https://issues.apache.org/jira/browse/SPARK-22289
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Nic Eggert
>Assignee: yuhao yang
>Priority: Major
> Fix For: 2.2.2, 2.3.0
>
>
> I think this was introduced in SPARK-20047.
> Trying to call save on a logistic regression model trained with bounds on its 
> parameters throws an error. This seems to be because Spark doesn't know how 
> to serialize the Matrix parameter.
> Model is set up like this:
> {code}
> val calibrator = new LogisticRegression()
>   .setFeaturesCol("uncalibrated_probability")
>   .setLabelCol("label")
>   .setWeightCol("weight")
>   .setStandardization(false)
>   .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0)))
>   .setFamily("binomial")
>   .setProbabilityCol("probability")
>   .setPredictionCol("logistic_prediction")
>   .setRawPredictionCol("logistic_raw_prediction")
> {code}
> {code}
> 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: 
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
> scala.NotImplementedError: The default jsonEncode only supports string and 
> vector. org.apache.spark.ml.param.Param must override jsonEncode for 
> org.apache.spark.ml.linalg.DenseMatrix.
>   at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
>   at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
>   at scala.collection.AbstractTraversable.map(Traversable.scala:104)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295)
>   at scala.Option.getOrElse(Option.scala:121)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295)
>   at 
> org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277)
>   at 
> org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at 
> org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253)
>   at 
> org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337)
>   at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114)
>   -snip-
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2018-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356340#comment-16356340
 ] 

Joseph K. Bradley commented on SPARK-22446:
---

[~viirya] Did you confirm this is an issue in Spark 2.2 or earlier?

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.1.2, 2.2.1
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>   at 
> 

[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.

2018-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-22446:
--
Affects Version/s: (was: 2.2.0)
   (was: 2.0.0)
   2.0.2
   2.1.2
   2.2.1

> Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" 
> exception incorrectly for filtered data.
> ---
>
> Key: SPARK-22446
> URL: https://issues.apache.org/jira/browse/SPARK-22446
> Project: Spark
>  Issue Type: Bug
>  Components: ML, Optimizer
>Affects Versions: 2.0.2, 2.1.2, 2.2.1
> Environment: spark-shell, local mode, macOS Sierra 10.12.6
>Reporter: Greg Bellchambers
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> In the following, the `indexer` UDF defined inside the 
> `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an 
> "Unseen label" error, despite the label not being present in the transformed 
> DataFrame.
> Here is the definition of the indexer UDF in the transform method:
> {code:java}
> val indexer = udf { label: String =>
>   if (labelToIndex.contains(label)) {
> labelToIndex(label)
>   } else {
> throw new SparkException(s"Unseen label: $label.")
>   }
> }
> {code}
> We can demonstrate the error with a very simple example DataFrame.
> {code:java}
> scala> import org.apache.spark.ml.feature.StringIndexer
> import org.apache.spark.ml.feature.StringIndexer
> scala> // first we create a DataFrame with three cities
> scala> val df = List(
>  | ("A", "London", "StrA"),
>  | ("B", "Bristol", null),
>  | ("C", "New York", "StrC")
>  | ).toDF("ID", "CITY", "CONTENT")
> df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more 
> field]
> scala> df.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  B| Bristol|   null|
> |  C|New York|   StrC|
> +---++---+
> scala> // then we remove the row with null in CONTENT column, which removes 
> Bristol
> scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull)
> dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: 
> string, CITY: string ... 1 more field]
> scala> dfNoBristol.show
> +---++---+
> | ID|CITY|CONTENT|
> +---++---+
> |  A|  London|   StrA|
> |  C|New York|   StrC|
> +---++---+
> scala> // now create a StringIndexer for the CITY column and fit to 
> dfNoBristol
> scala> val model = {
>  | new StringIndexer()
>  | .setInputCol("CITY")
>  | .setOutputCol("CITYIndexed")
>  | .fit(dfNoBristol)
>  | }
> model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb
> scala> // the StringIndexerModel has only two labels: "London" and "New York"
> scala> str.labels foreach println
> London
> New York
> scala> // transform our DataFrame to add an index column
> scala> val dfWithIndex = model.transform(dfNoBristol)
> dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 
> more fields]
> scala> dfWithIndex.show
> +---++---+---+
> | ID|CITY|CONTENT|CITYIndexed|
> +---++---+---+
> |  A|  London|   StrA|0.0|
> |  C|New York|   StrC|1.0|
> +---++---+---+
> {code}
> The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` 
> equal to 1.0 and perform an action. The `indexer` UDF in `transform` method 
> throws an exception reporting unseen label "Bristol". This is irrational 
> behaviour as far as the user of the API is concerned, because there is no 
> such value as "Bristol" when do show all rows of `dfWithIndex`:
> {code:java}
> scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count
> 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40)
> org.apache.spark.SparkException: Failed to execute user defined 
> function($anonfun$5: (string) => double)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
>   at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
>   at 
> org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
>  

[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella

2018-02-07 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356303#comment-16356303
 ] 

Joseph K. Bradley commented on SPARK-23105:
---

Sorry for being AWOL; I unfortunately had to be away from work for a couple of 
weeks.  Thanks so much everyone for carrying through with QA!

> Spark MLlib, GraphX 2.3 QA umbrella
> ---
>
> Key: SPARK-23105
> URL: https://issues.apache.org/jira/browse/SPARK-23105
> Project: Spark
>  Issue Type: Umbrella
>  Components: Documentation, GraphX, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>Priority: Critical
> Fix For: 2.3.0
>
>
> This JIRA lists tasks for the next Spark release's QA period for MLlib and 
> GraphX. *SparkR is separate: SPARK-23114.*
> The list below gives an overview of what is involved, and the corresponding 
> JIRA issues are linked below that.
> h2. API
>  * Check binary API compatibility for Scala/Java
>  * Audit new public APIs (from the generated html doc)
>  ** Scala
>  ** Java compatibility
>  ** Python coverage
>  * Check Experimental, DeveloperApi tags
> h2. Algorithms and performance
>  * Performance tests
> h2. Documentation and example code
>  * For new algorithms, create JIRAs for updating the user guide sections & 
> examples
>  * Update Programming Guide
>  * Update website



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException

2018-02-07 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356271#comment-16356271
 ] 

Márcio Furlani Carmona commented on SPARK-23308:


That's true Steve. I totally agree that it'd be hard to identify what is 
retryable from Spark's perspective. That being said, I agree that the FS should 
be responsible for that decision.

I believe one option would be leaving the retry responsibility to FS 
implementations (as it already seems to be) and adding the documentation for 
this flag making clear you might experience some data losses. Other option 
would be creating a special exception (CorruptedFileException?) that could be 
thrown by FS implementations and let them decide what is a corrupted file or 
just a transient error.

A more complex approach would be having a file blacklist mechanism rather than 
this flag, similarly to the `spark.blacklist.*` feature, where you can decide 
how many times to retry a file before considering it corrupted, then you won't 
need to decide what is worth retrying or not. The downside is that you'll 
always retry, even when there's no point in retrying.

 

*Regarding the socket timeouts:* I also believe it's some kind of throttling.

I'm reading over 80k files with over 10TB of data. The file sizes are not 
uniform, so some files may be read in a single request, other might get split 
into multiple partitions. Considering I'm not overriding the default 
`spark.files.maxPartitionBytes` value of `128 MB`, I should get at least ~82k 
partitions, thus the same number of S3 requests. Also, the files share some 
common prefixes, which [might be bad for S3 index 
access|https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html].

I have a total of 120 executors, with 9 cores each, across 30 workers. But 
since each task involves some good computational time, the median task 
execution time is 8s, so I don't think we're getting close to the 100 TPS for a 
common prefix mentioned in the S3 documentation above.

So, I'm more inclined to say it might be some EC2 network throttling rather 
than S3 throttling. Another reason to believe on that is because I've seen some 
[503 Slow 
Down|https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html] 
errors from S3 in the past when my requests got throttled, which I'm not seeing 
this time.

> ignoreCorruptFiles should not ignore retryable IOException
> --
>
> Key: SPARK-23308
> URL: https://issues.apache.org/jira/browse/SPARK-23308
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1
>Reporter: Márcio Furlani Carmona
>Priority: Minor
>
> When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind 
> of RuntimeException or IOException, but some possible IOExceptions may happen 
> even if the file is not corrupted.
> One example is the SocketTimeoutException which can be retried to possibly 
> fetch the data without meaning the data is corrupted.
>  
> See: 
> https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23300) Print out if Pandas and PyArrow are installed or not in tests

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23300:
-
Fix Version/s: (was: 2.4.0)
   2.3.0

> Print out if Pandas and PyArrow are installed or not in tests
> -
>
> Key: SPARK-23300
> URL: https://issues.apache.org/jira/browse/SPARK-23300
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.3.0
>
>
> This is related with SPARK-23292. Some tests with, for example, PyArrow and 
> Pandas are currently not running in some Python versions in Jenkins.
> At least, we should print out if we are skipping the tests or now explicitly 
> in the console.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23092) Migrate MemoryStream to DataSource V2

2018-02-07 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das reassigned SPARK-23092:
-

Assignee: Tathagata Das

> Migrate MemoryStream to DataSource V2
> -
>
> Key: SPARK-23092
> URL: https://issues.apache.org/jira/browse/SPARK-23092
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Assignee: Tathagata Das
>Priority: Major
> Fix For: 3.0.0
>
>
> We should migrate the MemoryStream for Structured Streaming to DataSourceV2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23092) Migrate MemoryStream to DataSource V2

2018-02-07 Thread Tathagata Das (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tathagata Das resolved SPARK-23092.
---
   Resolution: Fixed
Fix Version/s: 3.0.0

Issue resolved by pull request 20445
[https://github.com/apache/spark/pull/20445]

> Migrate MemoryStream to DataSource V2
> -
>
> Key: SPARK-23092
> URL: https://issues.apache.org/jira/browse/SPARK-23092
> Project: Spark
>  Issue Type: Task
>  Components: Structured Streaming
>Affects Versions: 2.3.0
>Reporter: Burak Yavuz
>Priority: Major
> Fix For: 3.0.0
>
>
> We should migrate the MemoryStream for Structured Streaming to DataSourceV2



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356189#comment-16356189
 ] 

Apache Spark commented on SPARK-23314:
--

User 'icexelloss' has created a pull request for this issue:
https://github.com/apache/spark/pull/20537

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23314:


Assignee: Apache Spark

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Assignee: Apache Spark
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23314:


Assignee: (was: Apache Spark)

> Pandas grouped udf on dataset with timestamp column error 
> --
>
> Key: SPARK-23314
> URL: https://issues.apache.org/jira/browse/SPARK-23314
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Felix Cheung
>Priority: Major
>
> Under  SPARK-22216
> When testing pandas_udf on group bys, I saw this error with the timestamp 
> column.
> File "pandas/_libs/tslib.pyx", line 3593, in 
> pandas._libs.tslib.tz_localize_to_utc
> AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 
> 01:29:30'), try using the 'ambiguous' argument
> For details, see Comment box. I'm able to reproduce this on the latest 
> branch-2.3 (last change from Feb 1 UTC)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Sameer Agarwal (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356176#comment-16356176
 ] 

Sameer Agarwal commented on SPARK-23348:


yes, +1

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  
> All Spark 2.X are the same. For Spark 1.6.3,
> {code}
> scala> sql("select * from tx").show
> ++---+
> |   i|  j|
> ++---+
> |null|  3|
> |   1|  a|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356174#comment-16356174
 ] 

Dongjoon Hyun commented on SPARK-23348:
---

[~cloud_fan], [~smilegator], [~sameerag].
Although this is not a regression at Spark 2.3, can we have this Apache Spark 
2.3?

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  
> All Spark 2.X are the same. For Spark 1.6.3,
> {code}
> scala> sql("select * from tx").show
> ++---+
> |   i|  j|
> ++---+
> |null|  3|
> |   1|  a|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356168#comment-16356168
 ] 

Thomas Graves commented on SPARK-22683:
---

I agree, I think default behavior stays 1. 

I ran a few tests with this patch. I definitely see an improvement in resource 
usage across all the jobs I ran. The jobs were similar job run time or actually 
faster on a few.  I used default 60 second timeout.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356168#comment-16356168
 ] 

Thomas Graves edited comment on SPARK-22683 at 2/7/18 10:24 PM:


I agree, I think default behavior stays 1. 

I ran a few tests with this patch. I definitely see an improvement in resource 
usage across all the jobs I ran. The jobs were similar job run time or actually 
faster on a few.  I used default 60 second timeout.

 

Note none of those jobs were really long running.  small to medium size tasks.


was (Author: tgraves):
I agree, I think default behavior stays 1. 

I ran a few tests with this patch. I definitely see an improvement in resource 
usage across all the jobs I ran. The jobs were similar job run time or actually 
faster on a few.  I used default 60 second timeout.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm 

[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23348:
--
Description: 
 
{code:java}
Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")

Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")

scala> sql("select * from t").show
{code}
 
This query will fail with a strange error:

{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unimplemented type: IntegerType
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
...
{code}
 
All Spark 2.X are the same. For Spark 1.6.3,
{code}
scala> sql("select * from tx").show
++---+
|   i|  j|
++---+
|null|  3|
|   1|  a|
++---+
{code}

  was:
 
{code:java}
Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")

Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")

scala> sql("select * from t").show
{code}
 
This query will fail with a strange error:

 
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unimplemented type: IntegerType
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
...
{code}
 


> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  
> All Spark 2.X are the same. For Spark 1.6.3,
> {code}
> scala> sql("select * from tx").show
> ++---+
> |   i|  j|
> ++---+
> |null|  3|
> |   1|  a|
> ++---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc

2018-02-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356132#comment-16356132
 ] 

Sean Owen commented on SPARK-23354:
---

I'm not clear what about this involves Spark. What length do you mean? The 
first question is even if your code is correct

> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
> ---
>
> Key: SPARK-23354
> URL: https://issues.apache.org/jira/browse/SPARK-23354
> Project: Spark
>  Issue Type: Bug
>  Components: Input/Output
>Affects Versions: 2.2.1
>Reporter: Lav Patel
>Priority: Major
>
> spark jdbc does not maintain length of data type when I move data from MS sql 
> server to Oracle using spark jdbc
>  
> To fix this, I have written code so it will figure out length of column and 
> it does the conversion.
>  
> I can put more details with a code sample if the community is interested. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23348:
--
Affects Version/s: 2.0.2

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23348:
--
Affects Version/s: 2.1.2

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.2, 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23348:
--
Affects Version/s: 2.2.1

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-23348:
--
Description: 
 
{code:java}
Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")

Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")

scala> sql("select * from t").show
{code}
 
This query will fail with a strange error:

 
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unimplemented type: IntegerType
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
...
{code}
 

  was:
 
{code:java}
Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")

Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
{code}
 
This query will fail with a strange error:

 
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unimplemented type: IntegerType
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
...
{code}
 


> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.1, 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> scala> sql("select * from t").show
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-23349.
---
   Resolution: Won't Fix
Fix Version/s: (was: 2.2.1)

> Duplicate and redundant type determination for ShuffleManager Object
> 
>
> Key: SPARK-23349
> URL: https://issues.apache.org/jira/browse/SPARK-23349
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.2.1
>Reporter: Phoenix_Daddy
>Priority: Minor
>
> org.apache.spark.sql.execution.exchange.ShuffleExchange
> In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or 
> else" branch .
> The  condition in the first layer "if" has the same 
> value as the  condition in the second layer "if", that is, 
>  must be true when  is true.
> In addition, the  condition will be used in the second 
> layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356119#comment-16356119
 ] 

Sean Owen commented on SPARK-23329:
---

Yeah, none of the other docs talk about a Column. It's implied that operations 
operate across rows of data in columns. "angles" makes it sound like the 
operator takes several arguments. I would go with the former.

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23318) FP-growth: WARN FPGrowth: Input data is not cached

2018-02-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356111#comment-16356111
 ] 

Sean Owen commented on SPARK-23318:
---

[~tashoyan] did you want to submit a PR for this?

> FP-growth: WARN FPGrowth: Input data is not cached
> --
>
> Key: SPARK-23318
> URL: https://issues.apache.org/jira/browse/SPARK-23318
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.1
>Reporter: Arseniy Tashoyan
>Priority: Minor
>  Labels: MLLib,, fp-growth
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> When running FPGrowth.fit() from _ml_ package, one can see a warning:
> WARN FPGrowth: Input data is not cached.
> This warning occurs even the dataset of transactions is cached.
> Actually this warning comes from the FPGrowth implementation in old _mllib_ 
> package. New FPGrowth performs some transformations on the input data set of 
> transactions and then passes it to the old FPGrowth - without caching. Hence 
> the warning.
> The problem looks similar to SPARK-18356
>  If you don't mind, I can push a similar fix:
> {code}
> // ml.FPGrowth
> val handlePersistence = dataset.storageLevel == StorageLevel.NONE
> if (handlePersistence) {
>   // cache the data
> }
> // then call mllib.FPGrowth
> // finally unpersist the data
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22279:


Assignee: (was: Apache Spark)

> Turn on spark.sql.hive.convertMetastoreOrc by default
> -
>
> Key: SPARK-22279
> URL: https://issues.apache.org/jira/browse/SPARK-22279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` 
> by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-22279:


Assignee: Apache Spark

> Turn on spark.sql.hive.convertMetastoreOrc by default
> -
>
> Key: SPARK-22279
> URL: https://issues.apache.org/jira/browse/SPARK-22279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` 
> by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23355) convertMetastore should not ignore table properties

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356076#comment-16356076
 ] 

Apache Spark commented on SPARK-23355:
--

User 'dongjoon-hyun' has created a pull request for this issue:
https://github.com/apache/spark/pull/20522

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-23355
> URL: https://issues.apache.org/jira/browse/SPARK-23355
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `convertMetastoreOrc/Parquet` ignores table properties.
> This happens for a table created by `STORED AS ORC/PARQUET`.
> Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
> well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
> and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.
> However, it's indeed a regression for the existing Hive ORC table users. So, 
> it'll be reverted for Apache Spark 2.3 via 
> https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23355:


Assignee: Apache Spark

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-23355
> URL: https://issues.apache.org/jira/browse/SPARK-23355
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Assignee: Apache Spark
>Priority: Major
>
> `convertMetastoreOrc/Parquet` ignores table properties.
> This happens for a table created by `STORED AS ORC/PARQUET`.
> Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
> well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
> and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.
> However, it's indeed a regression for the existing Hive ORC table users. So, 
> it'll be reverted for Apache Spark 2.3 via 
> https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23355:


Assignee: (was: Apache Spark)

> convertMetastore should not ignore table properties
> ---
>
> Key: SPARK-23355
> URL: https://issues.apache.org/jira/browse/SPARK-23355
> Project: Spark
>  Issue Type: Test
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> `convertMetastoreOrc/Parquet` ignores table properties.
> This happens for a table created by `STORED AS ORC/PARQUET`.
> Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
> well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
> and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.
> However, it's indeed a regression for the existing Hive ORC table users. So, 
> it'll be reverted for Apache Spark 2.3 via 
> https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23355) convertMetastore should not ignore table properties

2018-02-07 Thread Dongjoon Hyun (JIRA)
Dongjoon Hyun created SPARK-23355:
-

 Summary: convertMetastore should not ignore table properties
 Key: SPARK-23355
 URL: https://issues.apache.org/jira/browse/SPARK-23355
 Project: Spark
  Issue Type: Test
  Components: SQL
Affects Versions: 2.2.1, 2.1.2, 2.0.2, 2.3.0
Reporter: Dongjoon Hyun


`convertMetastoreOrc/Parquet` ignores table properties.

This happens for a table created by `STORED AS ORC/PARQUET`.

Currently, `convertMetastoreParquet` is `true` by default. So, it's a 
well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` 
and  SPARK-22279 tried to turn on that for Feature Parity with Parquet.

However, it's indeed a regression for the existing Hive ORC table users. So, 
it'll be reverted for Apache Spark 2.3 via 
https://github.com/apache/spark/pull/20536 .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun updated SPARK-22279:
--
Fix Version/s: (was: 2.3.0)

> Turn on spark.sql.hive.convertMetastoreOrc by default
> -
>
> Key: SPARK-22279
> URL: https://issues.apache.org/jira/browse/SPARK-22279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` 
> by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Reopened] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default

2018-02-07 Thread Dongjoon Hyun (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun reopened SPARK-22279:
---

This will be reverted.

> Turn on spark.sql.hive.convertMetastoreOrc by default
> -
>
> Key: SPARK-22279
> URL: https://issues.apache.org/jira/browse/SPARK-22279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
>
> Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` 
> by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23045) Have RFormula use OneHotEncoderEstimator

2018-02-07 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-23045:
--
Summary: Have RFormula use OneHotEncoderEstimator  (was: Have RFormula use 
OneHoEncoderEstimator)

> Have RFormula use OneHotEncoderEstimator
> 
>
> Key: SPARK-23045
> URL: https://issues.apache.org/jira/browse/SPARK-23045
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.3.0
>Reporter: Bago Amirbekian
>Assignee: Bago Amirbekian
>Priority: Major
> Fix For: 2.3.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics

2018-02-07 Thread Edwina Lu (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355994#comment-16355994
 ] 

Edwina Lu commented on SPARK-23206:
---

We ([~jerryshao], [~zhz] and I) are planning a conference call via Skype to 
discuss this ticket tomorrow Thursday Feb 8 at 5pm PST. If anyone else is 
interested in joining the call, please let me know.

> Additional Memory Tuning Metrics
> 
>
> Key: SPARK-23206
> URL: https://issues.apache.org/jira/browse/SPARK-23206
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Edwina Lu
>Priority: Major
> Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, 
> StageTab.png
>
>
> At LinkedIn, we have multiple clusters, running thousands of Spark 
> applications, and these numbers are growing rapidly. We need to ensure that 
> these Spark applications are well tuned – cluster resources, including 
> memory, should be used efficiently so that the cluster can support running 
> more applications concurrently, and applications should run quickly and 
> reliably.
> Currently there is limited visibility into how much memory executors are 
> using, and users are guessing numbers for executor and driver memory sizing. 
> These estimates are often much larger than needed, leading to memory wastage. 
> Examining the metrics for one cluster for a month, the average percentage of 
> used executor memory (max JVM used memory across executors /  
> spark.executor.memory) is 35%, leading to an average of 591GB unused memory 
> per application (number of executors * (spark.executor.memory - max JVM used 
> memory)). Spark has multiple memory regions (user memory, execution memory, 
> storage memory, and overhead memory), and to understand how memory is being 
> used and fine-tune allocation between regions, it would be useful to have 
> information about how much memory is being used for the different regions.
> To improve visibility into memory usage for the driver and executors and 
> different memory regions, the following additional memory metrics can be be 
> tracked for each executor and driver:
>  * JVM used memory: the JVM heap size for the executor/driver.
>  * Execution memory: memory used for computation in shuffles, joins, sorts 
> and aggregations.
>  * Storage memory: memory used caching and propagating internal data across 
> the cluster.
>  * Unified memory: sum of execution and storage memory.
> The peak values for each memory metric can be tracked for each executor, and 
> also per stage. This information can be shown in the Spark UI and the REST 
> APIs. Information for peak JVM used memory can help with determining 
> appropriate values for spark.executor.memory and spark.driver.memory, and 
> information about the unified memory region can help with determining 
> appropriate values for spark.memory.fraction and 
> spark.memory.storageFraction. Stage memory information can help identify 
> which stages are most memory intensive, and users can look into the relevant 
> code to determine if it can be optimized.
> The memory metrics can be gathered by adding the current JVM used memory, 
> execution memory and storage memory to the heartbeat. SparkListeners are 
> modified to collect the new metrics for the executors, stages and Spark 
> history log. Only interesting values (peak values per stage per executor) are 
> recorded in the Spark history log, to minimize the amount of additional 
> logging.
> We have attached our design documentation with this ticket and would like to 
> receive feedback from the community for this proposal.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-07 Thread Imran Rashid (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355982#comment-16355982
 ] 

Imran Rashid commented on SPARK-19870:
--

[~eyalfa] any chance you can share those exectuor logs?

those WARN messages may or may not be OK -- the known situation in which they 
occur is when you intentionally do not read to the end of a partition, eg. if 
you take a take() or a limit() etc.

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> 

[jira] [Updated] (SPARK-21084) Improvements to dynamic allocation for notebook use cases

2018-02-07 Thread Mark Hamstra (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mark Hamstra updated SPARK-21084:
-
Description: 
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current planned work:
* [SPARK-21097]: Preserve an executor's cached data when shutting down the 
executor.
* [SPARK-21122]: Make Spark give up executors in a controlled fashion when the 
RM indicates it is running low on capacity.
* (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.

  was:
One important application of Spark is to support many notebook users with a 
single YARN or Spark Standalone cluster.  We at IBM have seen this requirement 
across multiple deployments of Spark: on-premises and private cloud deployments 
at our clients, as well as on the IBM cloud.  The scenario goes something like 
this: "Every morning at 9am, 500 analysts log into their computers and start 
running Spark notebooks intermittently for the next 8 hours." I'm sure that 
many other members of the community are interested in making similar scenarios 
work.

Dynamic allocation is supposed to support these kinds of use cases by shifting 
cluster resources towards users who are currently executing scalable code.  In 
our own testing, we have encountered a number of issues with using the current 
implementation of dynamic allocation for this purpose:
*Issue #1: Starvation.* A Spark job acquires all available containers, 
preventing other jobs or applications from starting.
*Issue #2: Request latency.* Jobs that would normally finish in less than 30 
seconds take 2-4x longer than normal with dynamic allocation.
*Issue #3: Unfair resource allocation due to cached data.* Applications that 
have cached RDD partitions hold onto executors indefinitely, denying those 
resources to other applications.
*Issue #4: Loss of cached data leads to thrashing.*  Applications repeatedly 
lose partitions of cached RDDs because the underlying executors are removed; 
the applications then need to rerun expensive computations.

This umbrella JIRA covers efforts to address these issues by making 
enhancements to Spark.
Here's a high-level summary of the current planned work:
* [SPARK-21097]: Preserve an executor's cached data when shutting down the 
executor.
* (JIRA TBD): Make Spark give up executors in a controlled fashion when the RM 
indicates it is running low on capacity.
* (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors.

Note that this overall plan is subject to change, and other members of the 
community should feel free to suggest changes and to help out.


> Improvements to dynamic allocation for notebook use cases
> -
>
> Key: SPARK-21084
> URL: https://issues.apache.org/jira/browse/SPARK-21084
> Project: Spark
>  Issue Type: Umbrella
>  Components: Block Manager, Scheduler, Spark Core, YARN
>Affects Versions: 2.2.0, 2.3.0
>Reporter: Frederick Reiss
>Priority: Major
>
> One important application of Spark is to support many notebook users with a 
> single 

[jira] [Commented] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355952#comment-16355952
 ] 

Apache Spark commented on SPARK-22279:
--

User 'gatorsmile' has created a pull request for this issue:
https://github.com/apache/spark/pull/20536

> Turn on spark.sql.hive.convertMetastoreOrc by default
> -
>
> Key: SPARK-22279
> URL: https://issues.apache.org/jira/browse/SPARK-22279
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Dongjoon Hyun
>Priority: Major
> Fix For: 2.3.0
>
>
> Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` 
> by default.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings

2018-02-07 Thread Marcelo Vanzin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355946#comment-16355946
 ] 

Marcelo Vanzin commented on SPARK-23139:


Even if you change {{file.encoding}}, Spark should be forcing the written data 
to be in UTF-8. If that's not happening, then that needs to be fixed.

> Read eventLog file with mixed encodings
> ---
>
> Key: SPARK-23139
> URL: https://issues.apache.org/jira/browse/SPARK-23139
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: DENG FEI
>Priority: Major
>
> EventLog may contain mixed encodings such as custom exception message, but 
> caused to replay failure.
> java.nio.charset.MalformedInputException: Input length = 1
> at java.nio.charset.CoderResult.throwException(CoderResult.java:281)
>  at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339)
>  at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
>  at java.io.InputStreamReader.read(InputStreamReader.java:184)
>  at java.io.BufferedReader.fill(BufferedReader.java:161)
>  at java.io.BufferedReader.readLine(BufferedReader.java:324)
>  at java.io.BufferedReader.readLine(BufferedReader.java:389)
>  at 
> scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72)
>  at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836)
>  at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78)
>  at 
> org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507)
>  at 
> org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355943#comment-16355943
 ] 

Mark Hamstra commented on SPARK-22683:
--

I agree that setting the config to 1 should be sufficient to retain current 
behaviour; however, there seemed to be at least some urging in the discussion 
toward a default of 2. I'm not sure that we want to change Spark's default 
behavior in that way.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355920#comment-16355920
 ] 

Thomas Graves commented on SPARK-22683:
---

If the config is set to 1 which keeps the current behavior the job server 
pattern and really any other application by default won't be affected. I don't 
see this as any different then me tuning max executors for example.  Really 
this is just a more dynamic max executors.  

I agree with you that this isn't optimal in ways.  For instances it applies it 
across the entire application where you could run multiple jobs and stages. 
Each of those might not want this config, but that is a different problem where 
we would need to support per stage configuration for example. If its a single 
application then you should be able to set this between jobs programmatically 
if they are serial jobs (although I haven't tested this), but if that doesn't 
work all the dynamic allocation configs would have the same issue.

 

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such 

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Mark Hamstra (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355895#comment-16355895
 ] 

Mark Hamstra commented on SPARK-22683:
--

A concern that I have is that the discussion seems to be very focused on Spark 
"jobs" that are actually Spark Applications that run only a single Job – i.e. a 
workload where Applications and Jobs are submitted 1:1. Tuning for just this 
kind of workload has the potential to negatively impact other Spark usage 
patterns where several Jobs run in a single Application or the common 
Job-server pattern where a single, long-running Application serves many Jobs 
over an extended time.

I'm not saying that the proposals made here and in the associated PR will 
necessarily affect those other usage patterns negatively. What I am saying is 
that there must be strong demonstration that those patterns will not be 
negatively affected before I will be comfortable approving the requested 
changes.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become 

[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion

2018-02-07 Thread Alex Duvall (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355880#comment-16355880
 ] 

Alex Duvall commented on SPARK-15176:
-

Another interested party here - I'd find being able to limit max running tasks 
at a pool level extremely useful.

> Job Scheduling Within Application Suffers from Priority Inversion
> -
>
> Key: SPARK-15176
> URL: https://issues.apache.org/jira/browse/SPARK-15176
> Project: Spark
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 1.6.1
>Reporter: Nick White
>Priority: Major
>
> Say I have two pools, and N cores in my cluster:
> * I submit a job to one, which has M >> N tasks
> * N of the M tasks are scheduled
> * I submit a job to the second pool - but none of its tasks get scheduled 
> until a task from the other pool finishes!
> This can lead to unbounded denial-of-service for the second pool - regardless 
> of `minShare` or `weight` settings. Ideally Spark would support a pre-emption 
> mechanism, or an upper bound on a pool's resource usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc

2018-02-07 Thread Lav Patel (JIRA)
Lav Patel created SPARK-23354:
-

 Summary: spark jdbc does not maintain length of data type when I 
move data from MS sql server to Oracle using spark jdbc
 Key: SPARK-23354
 URL: https://issues.apache.org/jira/browse/SPARK-23354
 Project: Spark
  Issue Type: Bug
  Components: Input/Output
Affects Versions: 2.2.1
Reporter: Lav Patel


spark jdbc does not maintain length of data type when I move data from MS sql 
server to Oracle using spark jdbc

 

To fix this, I have written code so it will figure out length of column and it 
does the conversion.

 

I can put more details with a code sample if the community is interested. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li reassigned SPARK-23345:
---

Assignee: Liang-Chi Hsieh

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Assignee: Liang-Chi Hsieh
>Priority: Major
> Fix For: 2.3.0
>
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23345) Flaky test: FileBasedDataSourceSuite

2018-02-07 Thread Xiao Li (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xiao Li resolved SPARK-23345.
-
   Resolution: Fixed
Fix Version/s: 2.3.0

> Flaky test: FileBasedDataSourceSuite
> 
>
> Key: SPARK-23345
> URL: https://issues.apache.org/jira/browse/SPARK-23345
> Project: Spark
>  Issue Type: Bug
>  Components: SQL, Tests
>Affects Versions: 2.4.0
>Reporter: Marcelo Vanzin
>Priority: Major
> Fix For: 2.3.0
>
>
> See at least once:
> https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/
> {noformat}
> Error Message
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
> Stacktrace
> sbt.ForkMain$ForkError: 
> org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to 
> eventually never returned normally. Attempted 650 times over 10.011358752 
> seconds. Last failure message: There are 1 possibly leaked file streams..
>   at 
> org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26)
>   at 
> org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114)
>   at 
> org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26)
>   at 
> org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234)
> {noformat}
> Log file is too large to attach here, though.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23351) checkpoint corruption in long running application

2018-02-07 Thread David Ahern (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ahern updated SPARK-23351:

Description: 
hi, after leaving my (somewhat high volume) Structured Streaming application 
running for some time, i get the following exception.  The same exception also 
repeats when i try to restart the application.  The only way to get the 
application back running is to clear the checkpoint directory which is far from 
ideal.

Maybe a stream is not being flushed/closed properly internally by Spark when 
checkpointing?

 
 User class threw exception: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
 at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
 at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:108)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

  was:
hi, after leaving my (somewhat high volume) Structured Streaming application 
running for some time, i get the following exception.  The same exception also 
repeats when i try to restart the application.  The only way to get the 
application back running is to clear the checkpoint directory which is far from 
ideal.

Maybe a stream in not being flushed/closed properly internally by Spark when 
checkpointing?

 
 User class threw exception: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
 at 

[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Xuefu Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355634#comment-16355634
 ] 

Xuefu Zhang commented on SPARK-22683:
-

+1 on the idea of including this. Also, +1 on renaming the config. Thanks.

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23341:


Assignee: Apache Spark

> DataSourceOptions should handle path and table names to avoid confusion.
> 
>
> Key: SPARK-23341
> URL: https://issues.apache.org/jira/browse/SPARK-23341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Assignee: Apache Spark
>Priority: Major
>
> DataSourceOptions should have getters for path and table, which should be 
> passed in when creating the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355563#comment-16355563
 ] 

Apache Spark commented on SPARK-23341:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20535

> DataSourceOptions should handle path and table names to avoid confusion.
> 
>
> Key: SPARK-23341
> URL: https://issues.apache.org/jira/browse/SPARK-23341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceOptions should have getters for path and table, which should be 
> passed in when creating the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23341:


Assignee: (was: Apache Spark)

> DataSourceOptions should handle path and table names to avoid confusion.
> 
>
> Key: SPARK-23341
> URL: https://issues.apache.org/jira/browse/SPARK-23341
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Ryan Blue
>Priority: Major
>
> DataSourceOptions should have getters for path and table, which should be 
> passed in when creating the options.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used

2018-02-07 Thread Thomas Graves (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1637#comment-1637
 ] 

Thomas Graves commented on SPARK-22683:
---

ok thanks,  I would like to try this out myself on a few jobs, but  my opinion 
is we should put this config in, if others have strong disagreement please 
speak up, otherwise I think we can move the discussion to the PR.  I do think 
we need to change the name of the config.

 

> DynamicAllocation wastes resources by allocating containers that will barely 
> be used
> 
>
> Key: SPARK-22683
> URL: https://issues.apache.org/jira/browse/SPARK-22683
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.1.0, 2.2.0
>Reporter: Julien Cuquemelle
>Priority: Major
>  Labels: pull-request-available
>
> While migrating a series of jobs from MR to Spark using dynamicAllocation, 
> I've noticed almost a doubling (+114% exactly) of resource consumption of 
> Spark w.r.t MR, for a wall clock time gain of 43%
> About the context: 
> - resource usage stands for vcore-hours allocation for the whole job, as seen 
> by YARN
> - I'm talking about a series of jobs because we provide our users with a way 
> to define experiments (via UI / DSL) that automatically get translated to 
> Spark / MR jobs and submitted on the cluster
> - we submit around 500 of such jobs each day
> - these jobs are usually one shot, and the amount of processing can vary a 
> lot between jobs, and as such finding an efficient number of executors for 
> each job is difficult to get right, which is the reason I took the path of 
> dynamic allocation.  
> - Some of the tests have been scheduled on an idle queue, some on a full 
> queue.
> - experiments have been conducted with spark.executor-cores = 5 and 10, only 
> results for 5 cores have been reported because efficiency was overall better 
> than with 10 cores
> - the figures I give are averaged over a representative sample of those jobs 
> (about 600 jobs) ranging from tens to thousands splits in the data 
> partitioning and between 400 to 9000 seconds of wall clock time.
> - executor idle timeout is set to 30s;
>  
> Definition: 
> - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, 
> which represent the max number of tasks an executor will process in parallel.
> - the current behaviour of the dynamic allocation is to allocate enough 
> containers to have one taskSlot per task, which minimizes latency, but wastes 
> resources when tasks are small regarding executor allocation and idling 
> overhead. 
> The results using the proposal (described below) over the job sample (600 
> jobs):
> - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in 
> resource usage, for a 37% (against 43%) reduction in wall clock time for 
> Spark w.r.t MR
> - by trying to minimize the average resource consumption, I ended up with 6 
> tasks per core, with a 30% resource usage reduction, for a similar wall clock 
> time w.r.t. MR
> What did I try to solve the issue with existing parameters (summing up a few 
> points mentioned in the comments) ?
> - change dynamicAllocation.maxExecutors: this would need to be adapted for 
> each job (tens to thousands splits can occur), and essentially remove the 
> interest of using the dynamic allocation.
> - use dynamicAllocation.backlogTimeout: 
> - setting this parameter right to avoid creating unused executors is very 
> dependant on wall clock time. One basically needs to solve the exponential 
> ramp up for the target time. So this is not an option for my use case where I 
> don't want a per-job tuning. 
> - I've still done a series of experiments, details in the comments. 
> Result is that after manual tuning, the best I could get was a similar 
> resource consumption at the expense of 20% more wall clock time, or a similar 
> wall clock time at the expense of 60% more resource consumption than what I 
> got using my proposal @ 6 tasks per slot (this value being optimized over a 
> much larger range of jobs as already stated)
> - as mentioned in another comment, tampering with the exponential ramp up 
> might yield task imbalance and such old executors could become contention 
> points for other exes trying to remotely access blocks in the old exes (not 
> witnessed in the jobs I'm talking about, but we did see this behavior in 
> other jobs)
> Proposal: 
> Simply add a tasksPerExecutorSlot parameter, which makes it possible to 
> specify how many tasks a single taskSlot should ideally execute to mitigate 
> the overhead of executor allocation.
> PR: https://github.com/apache/spark/pull/19881



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23319:
-
Target Version/s:   (was: 2.4.0)

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-23319.
--
  Resolution: Fixed
Assignee: Hyukjin Kwon
Target Version/s: 2.4.0

Fixed in https://github.com/apache/spark/pull/20487

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23319:
-
Fix Version/s: 2.4.0

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355531#comment-16355531
 ] 

Apache Spark commented on SPARK-23319:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20534

> Skip PySpark tests for old Pandas and old PyArrow
> -
>
> Key: SPARK-23319
> URL: https://issues.apache.org/jira/browse/SPARK-23319
> Project: Spark
>  Issue Type: Test
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= 
> 0.19.2 for now. We should explicitly skip in this case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23300) Print out if Pandas and PyArrow are installed or not in tests

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355483#comment-16355483
 ] 

Apache Spark commented on SPARK-23300:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20533

> Print out if Pandas and PyArrow are installed or not in tests
> -
>
> Key: SPARK-23300
> URL: https://issues.apache.org/jira/browse/SPARK-23300
> Project: Spark
>  Issue Type: Test
>  Components: PySpark, Tests
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Hyukjin Kwon
>Priority: Major
> Fix For: 2.4.0
>
>
> This is related with SPARK-23292. Some tests with, for example, PyArrow and 
> Pandas are currently not running in some Python versions in Jenkins.
> At least, we should print out if we are skipping the tests or now explicitly 
> in the console.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23353:


Assignee: (was: Apache Spark)

> Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
> --
>
> Key: SPARK-23353
> URL: https://issues.apache.org/jira/browse/SPARK-23353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to 
> log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very 
> useful if it can be persisted for further analysis. 
> As a performance reason and actual use case, the PR offers a fraction 
> configuration which can sample the events to be persisted. we also refactor 
> for BlockUpdated with the same sampling way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23353:


Assignee: Apache Spark

> Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
> --
>
> Key: SPARK-23353
> URL: https://issues.apache.org/jira/browse/SPARK-23353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Assignee: Apache Spark
>Priority: Minor
>
> [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to 
> log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very 
> useful if it can be persisted for further analysis. 
> As a performance reason and actual use case, the PR offers a fraction 
> configuration which can sample the events to be persisted. we also refactor 
> for BlockUpdated with the same sampling way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355454#comment-16355454
 ] 

Apache Spark commented on SPARK-23353:
--

User 'LantaoJin' has created a pull request for this issue:
https://github.com/apache/spark/pull/20532

> Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
> --
>
> Key: SPARK-23353
> URL: https://issues.apache.org/jira/browse/SPARK-23353
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.2.1
>Reporter: Lantao Jin
>Priority: Minor
>
> [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to 
> log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very 
> useful if it can be persisted for further analysis. 
> As a performance reason and actual use case, the PR offers a fraction 
> configuration which can sample the events to be persisted. we also refactor 
> for BlockUpdated with the same sampling way.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23352:


Assignee: Apache Spark

> Explicitly specify supported types in Pandas UDFs
> -
>
> Key: SPARK-23352
> URL: https://issues.apache.org/jira/browse/SPARK-23352
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Assignee: Apache Spark
>Priority: Major
>
> Currently, we don't support {{BinaryType}} in Pandas UDFs:
> {code}
> >>> from pyspark.sql.functions import pandas_udf
> >>> pudf = pandas_udf(lambda x: x, "binary")
> >>> df = spark.createDataFrame([[bytearray("a")]])
> >>> df.select(pudf("_1")).show()
> ...
> TypeError: Unsupported type in conversion to Arrow: BinaryType
> {code}
> Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems 
> we can support this case.
> We should better clarify it in doc in Pandas UDFs, and fail fast with type 
> checking ahead, rather than execution time.
> Please consider this case:
> {code}
> pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage 
> because we know the schema ahead
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23352:


Assignee: (was: Apache Spark)

> Explicitly specify supported types in Pandas UDFs
> -
>
> Key: SPARK-23352
> URL: https://issues.apache.org/jira/browse/SPARK-23352
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, we don't support {{BinaryType}} in Pandas UDFs:
> {code}
> >>> from pyspark.sql.functions import pandas_udf
> >>> pudf = pandas_udf(lambda x: x, "binary")
> >>> df = spark.createDataFrame([[bytearray("a")]])
> >>> df.select(pudf("_1")).show()
> ...
> TypeError: Unsupported type in conversion to Arrow: BinaryType
> {code}
> Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems 
> we can support this case.
> We should better clarify it in doc in Pandas UDFs, and fail fast with type 
> checking ahead, rather than execution time.
> Please consider this case:
> {code}
> pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage 
> because we know the schema ahead
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355442#comment-16355442
 ] 

Apache Spark commented on SPARK-23352:
--

User 'HyukjinKwon' has created a pull request for this issue:
https://github.com/apache/spark/pull/20531

> Explicitly specify supported types in Pandas UDFs
> -
>
> Key: SPARK-23352
> URL: https://issues.apache.org/jira/browse/SPARK-23352
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, we don't support {{BinaryType}} in Pandas UDFs:
> {code}
> >>> from pyspark.sql.functions import pandas_udf
> >>> pudf = pandas_udf(lambda x: x, "binary")
> >>> df = spark.createDataFrame([[bytearray("a")]])
> >>> df.select(pudf("_1")).show()
> ...
> TypeError: Unsupported type in conversion to Arrow: BinaryType
> {code}
> Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems 
> we can support this case.
> We should better clarify it in doc in Pandas UDFs, and fail fast with type 
> checking ahead, rather than execution time.
> Please consider this case:
> {code}
> pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage 
> because we know the schema ahead
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling

2018-02-07 Thread Lantao Jin (JIRA)
Lantao Jin created SPARK-23353:
--

 Summary: Allow ExecutorMetricsUpdate events to be logged to the 
event log with sampling
 Key: SPARK-23353
 URL: https://issues.apache.org/jira/browse/SPARK-23353
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.2.1
Reporter: Lantao Jin


[SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to 
log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very 
useful if it can be persisted for further analysis. 

As a performance reason and actual use case, the PR offers a fraction 
configuration which can sample the events to be persisted. we also refactor for 
BlockUpdated with the same sampling way.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-07 Thread Hyukjin Kwon (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23352:
-
Description: 
Currently, we don't support {{BinaryType}} in Pandas UDFs:

{code}
>>> from pyspark.sql.functions import pandas_udf
>>> pudf = pandas_udf(lambda x: x, "binary")
>>> df = spark.createDataFrame([[bytearray("a")]])
>>> df.select(pudf("_1")).show()
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
{code}

Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we 
can support this case.

We should better clarify it in doc in Pandas UDFs, and fail fast with type 
checking ahead, rather than execution time.

Please consider this case:

{code}
pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage because 
we know the schema ahead
{code}

  was:
Currently, we don't support {{BinaryType}} in Pandas UDFs:

{code}
>>> from pyspark.sql.functions import pandas_udf
>>> pudf = pandas_udf(lambda x: x, "binary")
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>> df = spark.createDataFrame([[bytearray("a")]])
>>> df.select(pudf("_1")).show()
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
{code}

Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we 
can support this case.

We should better clarify it in doc in Pandas UDFs, and fail fast with type 
checking ahead, rather than execution time.

Please consider this case:

{code}
pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage because 
we know the schema ahead
{code}


> Explicitly specify supported types in Pandas UDFs
> -
>
> Key: SPARK-23352
> URL: https://issues.apache.org/jira/browse/SPARK-23352
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 2.3.0
>Reporter: Hyukjin Kwon
>Priority: Major
>
> Currently, we don't support {{BinaryType}} in Pandas UDFs:
> {code}
> >>> from pyspark.sql.functions import pandas_udf
> >>> pudf = pandas_udf(lambda x: x, "binary")
> >>> df = spark.createDataFrame([[bytearray("a")]])
> >>> df.select(pudf("_1")).show()
> ...
> TypeError: Unsupported type in conversion to Arrow: BinaryType
> {code}
> Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems 
> we can support this case.
> We should better clarify it in doc in Pandas UDFs, and fail fast with type 
> checking ahead, rather than execution time.
> Please consider this case:
> {code}
> pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage 
> because we know the schema ahead
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23352) Explicitly specify supported types in Pandas UDFs

2018-02-07 Thread Hyukjin Kwon (JIRA)
Hyukjin Kwon created SPARK-23352:


 Summary: Explicitly specify supported types in Pandas UDFs
 Key: SPARK-23352
 URL: https://issues.apache.org/jira/browse/SPARK-23352
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark
Affects Versions: 2.3.0
Reporter: Hyukjin Kwon


Currently, we don't support {{BinaryType}} in Pandas UDFs:

{code}
>>> from pyspark.sql.functions import pandas_udf
>>> pudf = pandas_udf(lambda x: x, "binary")
>>> spark.conf.set("spark.sql.execution.arrow.enabled", "true")
>>> df = spark.createDataFrame([[bytearray("a")]])
>>> df.select(pudf("_1")).show()
...
TypeError: Unsupported type in conversion to Arrow: BinaryType
{code}

Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we 
can support this case.

We should better clarify it in doc in Pandas UDFs, and fail fast with type 
checking ahead, rather than execution time.

Please consider this case:

{code}
pandas_udf(lambda x: x, BinaryType())  # we can fail fast at this stage because 
we know the schema ahead
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355340#comment-16355340
 ] 

Apache Spark commented on SPARK-23349:
--

User 'wujianping10043419' has created a pull request for this issue:
https://github.com/apache/spark/pull/20530

> Duplicate and redundant type determination for ShuffleManager Object
> 
>
> Key: SPARK-23349
> URL: https://issues.apache.org/jira/browse/SPARK-23349
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.2.1
>Reporter: Phoenix_Daddy
>Priority: Minor
> Fix For: 2.2.1
>
>
> org.apache.spark.sql.execution.exchange.ShuffleExchange
> In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or 
> else" branch .
> The  condition in the first layer "if" has the same 
> value as the  condition in the second layer "if", that is, 
>  must be true when  is true.
> In addition, the  condition will be used in the second 
> layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23350) [SS]Exception when stopping continuous processing application

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355326#comment-16355326
 ] 

Apache Spark commented on SPARK-23350:
--

User 'yanlin-Lynn' has created a pull request for this issue:
https://github.com/apache/spark/pull/20529

> [SS]Exception when stopping continuous processing application
> -
>
> Key: SPARK-23350
> URL: https://issues.apache.org/jira/browse/SPARK-23350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang Yanlin
>Priority: Major
>
> SparkException happends when stopping continuous processing application, 
> using Ctrl-C in stand-alone mode.
> 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 
> 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 
> 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error
> org.apache.spark.SparkException: Writing job failed.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Job 0 cancelled because 
> SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 

[jira] [Updated] (SPARK-23351) checkpoint corruption in long running application

2018-02-07 Thread David Ahern (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Ahern updated SPARK-23351:

Description: 
hi, after leaving my (somewhat high volume) Structured Streaming application 
running for some time, i get the following exception.  The same exception also 
repeats when i try to restart the application.  The only way to get the 
application back running is to clear the checkpoint directory which is far from 
ideal.

Maybe a stream in not being flushed/closed properly internally by Spark when 
checkpointing?

 
 User class threw exception: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
java.io.EOFException
 at java.io.DataInputStream.readInt(DataInputStream.java:392)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
 at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
 at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
 at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
 at org.apache.spark.scheduler.Task.run(Task.scala:108)
 at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)

  was:
hi, after leaving my (somewhat high volume) Structured Streaming running for 
some time, i get the following exception.  The same exception also repeats when 
i try to restart the application.  The only way to get the application back 
running is to clear the checkpoint directory which is far from ideal.

Maybe a stream in not being flushed/closed properly internally by Spark when 
checkpointing?

 
User class threw exception: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
at 

[jira] [Commented] (SPARK-23350) [SS]Exception when stopping continuous processing application

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355311#comment-16355311
 ] 

Apache Spark commented on SPARK-23350:
--

User 'yanlin-Lynn' has created a pull request for this issue:
https://github.com/apache/spark/pull/20528

> [SS]Exception when stopping continuous processing application
> -
>
> Key: SPARK-23350
> URL: https://issues.apache.org/jira/browse/SPARK-23350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang Yanlin
>Priority: Major
>
> SparkException happends when stopping continuous processing application, 
> using Ctrl-C in stand-alone mode.
> 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 
> 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 
> 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error
> org.apache.spark.SparkException: Writing job failed.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Job 0 cancelled because 
> SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 

[jira] [Assigned] (SPARK-23350) [SS]Exception when stopping continuous processing application

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23350:


Assignee: Apache Spark

> [SS]Exception when stopping continuous processing application
> -
>
> Key: SPARK-23350
> URL: https://issues.apache.org/jira/browse/SPARK-23350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang Yanlin
>Assignee: Apache Spark
>Priority: Major
>
> SparkException happends when stopping continuous processing application, 
> using Ctrl-C in stand-alone mode.
> 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 
> 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 
> 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error
> org.apache.spark.SparkException: Writing job failed.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Job 0 cancelled because 
> SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> 

[jira] [Assigned] (SPARK-23350) [SS]Exception when stopping continuous processing application

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23350:


Assignee: (was: Apache Spark)

> [SS]Exception when stopping continuous processing application
> -
>
> Key: SPARK-23350
> URL: https://issues.apache.org/jira/browse/SPARK-23350
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wang Yanlin
>Priority: Major
>
> SparkException happends when stopping continuous processing application, 
> using Ctrl-C in stand-alone mode.
> 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 
> 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 
> 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error
> org.apache.spark.SparkException: Writing job failed.
>   at 
> org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
>   at 
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
>   at 
> org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
>   at 
> org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266)
>   at 
> org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
>   at 
> org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
> Caused by: org.apache.spark.SparkException: Job 0 cancelled because 
> SparkContext was shut down
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
>   at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
>   at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
>   at 
> org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
>   at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831)
>   at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
>   at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743)
>   at 
> org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
>   at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
>   at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
>   at 
> org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
>   at 
> org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
>   at 
> org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
>   at 
> 

[jira] [Created] (SPARK-23351) checkpoint corruption in long running application

2018-02-07 Thread David Ahern (JIRA)
David Ahern created SPARK-23351:
---

 Summary: checkpoint corruption in long running application
 Key: SPARK-23351
 URL: https://issues.apache.org/jira/browse/SPARK-23351
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 2.2.0
Reporter: David Ahern


hi, after leaving my (somewhat high volume) Structured Streaming running for 
some time, i get the following exception.  The same exception also repeats when 
i try to restart the application.  The only way to get the application back 
running is to clear the checkpoint directory which is far from ideal.

Maybe a stream in not being flushed/closed properly internally by Spark when 
checkpointing?

 
User class threw exception: 
org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to 
stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost 
task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): 
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:392)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358)
at 
org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265)
at 
org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200)
at 
org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23350) [SS]Exception when stopping continuous processing application

2018-02-07 Thread Wang Yanlin (JIRA)
Wang Yanlin created SPARK-23350:
---

 Summary: [SS]Exception when stopping continuous processing 
application
 Key: SPARK-23350
 URL: https://issues.apache.org/jira/browse/SPARK-23350
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wang Yanlin


SparkException happends when stopping continuous processing application, using 
Ctrl-C in stand-alone mode.

18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 
007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 
3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error
org.apache.spark.SparkException: Writing job failed.
at 
org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268)
at 
org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266)
at 
org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90)
at 
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
at 
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189)
Caused by: org.apache.spark.SparkException: Job 0 cancelled because 
SparkContext was shut down
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837)
at 
org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835)
at scala.collection.mutable.HashSet.foreach(HashSet.scala:78)
at 
org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835)
at 
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831)
at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743)
at 
org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1923)
at 
org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572)
at 
org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988)
at 
org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at 

[jira] [Commented] (SPARK-14047) GBT improvement umbrella

2018-02-07 Thread Nick Pentreath (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355216#comment-16355216
 ] 

Nick Pentreath commented on SPARK-14047:


SPARK-12375 should fix that? Can you check it against the 2.3 RC (or 
branch-2.3)? If not could you provide some code to reproduce the error?

> GBT improvement umbrella
> 
>
> Key: SPARK-14047
> URL: https://issues.apache.org/jira/browse/SPARK-14047
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Joseph K. Bradley
>Priority: Major
>
> This is an umbrella for improvements to learning Gradient Boosted Trees: 
> GBTClassifier, GBTRegressor.
> Note: Aspects of GBTs which are related to individual trees should be listed 
> under [SPARK-14045].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast

2018-02-07 Thread Eyal Farago (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355174#comment-16355174
 ] 

Eyal Farago commented on SPARK-19870:
-

[~irashid], wenth through executors' logs and found no errors.

did find some decent GC pauses and warnings like this: 
{quote}WARN Executor: 1 block locks were not released by TID = 5413: [rdd_53_42]
{quote}
this aligns with my findings of tasks cleaning after themselves, theoretically 
in case of failure as well.

since both distros we're using do not offer any of the relevant versions 
including your fix, I can't test this scenario with your patch (aside from 
other practical constraints :-)) so I can't really verify my assumption about 
the link between SPARK-19870 and SPARK-22083.

[~stevenruppert], [~gdanov], can you guys try running with one of the relevant 
versions?

> Repeatable deadlock on BlockInfoManager and TorrentBroadcast
> 
>
> Key: SPARK-19870
> URL: https://issues.apache.org/jira/browse/SPARK-19870
> Project: Spark
>  Issue Type: Bug
>  Components: Block Manager, Shuffle
>Affects Versions: 2.0.2, 2.1.0
> Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, 
> yarn coarse-grained.
>Reporter: Steven Ruppert
>Priority: Major
> Attachments: stack.txt
>
>
> Running what I believe to be a fairly vanilla spark job, using the RDD api, 
> with several shuffles, a cached RDD, and finally a conversion to DataFrame to 
> save to parquet. I get a repeatable deadlock at the very last reducers of one 
> of the stages.
> Roughly:
> {noformat}
> "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 
> tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry 
> [0x7fffb95f3000]
>java.lang.Thread.State: BLOCKED (on object monitor)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207)
> - waiting to lock <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x0005b12f2290> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
> at 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
> at org.apache.spark.scheduler.Task.run(Task.scala:99)
> at 
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
> {noformat}
> and 
> {noformat}
> "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 
> tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000]
>java.lang.Thread.State: WAITING (on object monitor)
> at java.lang.Object.wait(Native Method)
> at java.lang.Object.wait(Object.java:502)
> at 
> org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202)
> - locked <0x000545736b58> (a 
> org.apache.spark.storage.BlockInfoManager)
> at 
> org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444)
> at 
> org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
> - locked <0x0005445cfc00> (a 
> org.apache.spark.broadcast.TorrentBroadcast$)
> at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
> - locked <0x00059711eb10> (a 
> org.apache.spark.broadcast.TorrentBroadcast)
> at 
> org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
> at 
> org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
> at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
> at 
> 

[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355160#comment-16355160
 ] 

Apache Spark commented on SPARK-23348:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/20527

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23348:


Assignee: Apache Spark

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23348:


Assignee: (was: Apache Spark)

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions

2018-02-07 Thread Mihaly Toth (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355155#comment-16355155
 ] 

Mihaly Toth commented on SPARK-23329:
-

Nice. I like that the redundant description part is simply omitted.

The only issue I see is that {{e}} is actually not an angle. At least we could 
put it into plural, like:
{code}
/**
 * @param e angles in radians
 * @return sines of the angles, as if computed by [[java.lang.Math.sin]]
 * ...
 */
{code}
 
I was also thinking about mentioning that it is actually a Column does not 
inflate the lines very much. At the same time they are more precise.

{code}
/**
 * @param e [[Column]] of angles in radians
 * @return [[Column]] of sines of the angles, as if computed by 
[[java.lang.Math.sin]]
 * ...
 */
{code}

Now looking at this the first one seems better because it tells truth and one 
can figure out easily that the angles are stored in a Column.

> Update the function descriptions with the arguments and returned values of 
> the trigonometric functions
> --
>
> Key: SPARK-23329
> URL: https://issues.apache.org/jira/browse/SPARK-23329
> Project: Spark
>  Issue Type: Documentation
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Xiao Li
>Priority: Minor
>  Labels: starter
>
> We need an update on the function descriptions for all the trigonometric 
> functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the 
> implementation is based on the java.lang.Math. We need a clear description 
> about the units of the input arguments and the returned values. 
> For example, the following descriptions are lacking such info. 
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555
> https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23349:


Assignee: Apache Spark

> Duplicate and redundant type determination for ShuffleManager Object
> 
>
> Key: SPARK-23349
> URL: https://issues.apache.org/jira/browse/SPARK-23349
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.2.1
>Reporter: Phoenix_Daddy
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 2.2.1
>
>
> org.apache.spark.sql.execution.exchange.ShuffleExchange
> In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or 
> else" branch .
> The  condition in the first layer "if" has the same 
> value as the  condition in the second layer "if", that is, 
>  must be true when  is true.
> In addition, the  condition will be used in the second 
> layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355137#comment-16355137
 ] 

Apache Spark commented on SPARK-23349:
--

User 'wujianping10043419' has created a pull request for this issue:
https://github.com/apache/spark/pull/20526

> Duplicate and redundant type determination for ShuffleManager Object
> 
>
> Key: SPARK-23349
> URL: https://issues.apache.org/jira/browse/SPARK-23349
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.2.1
>Reporter: Phoenix_Daddy
>Priority: Minor
> Fix For: 2.2.1
>
>
> org.apache.spark.sql.execution.exchange.ShuffleExchange
> In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or 
> else" branch .
> The  condition in the first layer "if" has the same 
> value as the  condition in the second layer "if", that is, 
>  must be true when  is true.
> In addition, the  condition will be used in the second 
> layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23349:


Assignee: (was: Apache Spark)

> Duplicate and redundant type determination for ShuffleManager Object
> 
>
> Key: SPARK-23349
> URL: https://issues.apache.org/jira/browse/SPARK-23349
> Project: Spark
>  Issue Type: Improvement
>  Components: Shuffle, SQL
>Affects Versions: 2.2.1
>Reporter: Phoenix_Daddy
>Priority: Minor
> Fix For: 2.2.1
>
>
> org.apache.spark.sql.execution.exchange.ShuffleExchange
> In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or 
> else" branch .
> The  condition in the first layer "if" has the same 
> value as the  condition in the second layer "if", that is, 
>  must be true when  is true.
> In addition, the  condition will be used in the second 
> layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object

2018-02-07 Thread Phoenix_Daddy (JIRA)
Phoenix_Daddy created SPARK-23349:
-

 Summary: Duplicate and redundant type determination for 
ShuffleManager Object
 Key: SPARK-23349
 URL: https://issues.apache.org/jira/browse/SPARK-23349
 Project: Spark
  Issue Type: Improvement
  Components: Shuffle, SQL
Affects Versions: 2.2.1
Reporter: Phoenix_Daddy
 Fix For: 2.2.1


org.apache.spark.sql.execution.exchange.ShuffleExchange

In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or else" 
branch .
The  condition in the first layer "if" has the same value 
as the  condition in the second layer "if", that is, 
 must be true when  is true.
In addition, the  condition will be used in the second 
layer "if" and should not be calculated until needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Wenchen Fan (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-23348:

Description: 
 
{code:java}
Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")

Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
{code}
 
This query will fail with a strange error:

 
{code:java}
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
(TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: 
Unimplemented type: IntegerType
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
 at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
...
{code}
 

> append data using saveAsTable should adjust the data types
> --
>
> Key: SPARK-23348
> URL: https://issues.apache.org/jira/browse/SPARK-23348
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.3.0
>Reporter: Wenchen Fan
>Priority: Major
>
>  
> {code:java}
> Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t")
> Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t")
> {code}
>  
> This query will fail with a strange error:
>  
> {code:java}
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in 
> stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 
> (TID 15, localhost, executor driver): 
> java.lang.UnsupportedOperationException: Unimplemented type: IntegerType
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214)
>  at 
> org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261)
> ...
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23348) append data using saveAsTable should adjust the data types

2018-02-07 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-23348:
---

 Summary: append data using saveAsTable should adjust the data types
 Key: SPARK-23348
 URL: https://issues.apache.org/jira/browse/SPARK-23348
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 2.3.0
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org