[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23319: - Fix Version/s: (was: 2.4.0) 2.3.0 > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23356) Pushes Project to both sides of Union when expression is non-deterministic
caoxuewen created SPARK-23356: - Summary: Pushes Project to both sides of Union when expression is non-deterministic Key: SPARK-23356 URL: https://issues.apache.org/jira/browse/SPARK-23356 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.4.0 Reporter: caoxuewen Currently, PushProjectionThroughUnion optimizer only supports pushdown project operator to both sides of a Union operator when expression is deterministic , in fact, we can be like pushdown filters, also support pushdown project operator to both sides of a Union operator when expression is non-deterministic , this PR description fix this problem。now the explain looks like: === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushProjectionThroughUnion === Input LogicalPlan: Project [a#0, rand(10) AS rnd#9] +- Union :- LocalRelation , [a#0, b#1, c#2] :- LocalRelation , [d#3, e#4, f#5] +- LocalRelation , [g#6, h#7, i#8] Output LogicalPlan: Project [a#0, rand(10) AS rnd#9] +- Union :- Project [a#0] : +- LocalRelation , [a#0, b#1, c#2] :- Project [d#3] : +- LocalRelation , [d#3, e#4, f#5] +- Project [g#6] +- LocalRelation , [g#6, h#7, i#8] -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API
[ https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356558#comment-16356558 ] Hyukjin Kwon commented on SPARK-20090: -- Sure, will open up a PR soon. > Add StructType.fieldNames to Python API > --- > > Key: SPARK-20090 > URL: https://issues.apache.org/jira/browse/SPARK-20090 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.0 > > > The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would > be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-20090) Add StructType.fieldNames to Python API
[ https://issues.apache.org/jira/browse/SPARK-20090?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356536#comment-16356536 ] Reynold Xin commented on SPARK-20090: - Do you mind doing it? Thanks. > Add StructType.fieldNames to Python API > --- > > Key: SPARK-20090 > URL: https://issues.apache.org/jira/browse/SPARK-20090 > Project: Spark > Issue Type: New Feature > Components: PySpark, SQL >Affects Versions: 2.1.0 >Reporter: Joseph K. Bradley >Assignee: Hyukjin Kwon >Priority: Trivial > Fix For: 2.3.0 > > > The Scala/Java API for {{StructType}} has a method {{fieldNames}}. It would > be nice if the Python {{StructType}} did as well. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356525#comment-16356525 ] Liang-Chi Hsieh commented on SPARK-22446: - 2.0 and 2.1 also have this issue. > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96) >
[jira] [Comment Edited] (SPARK-23139) Read eventLog file with mixed encodings
[ https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356507#comment-16356507 ] Jiang Xingbo edited comment on SPARK-23139 at 2/8/18 5:25 AM: -- {quote}EventLog may contain mixed encodings such as custom exception message{quote} Could you please elaborate on how this happened? was (Author: jiangxb1987): ``` EventLog may contain mixed encodings such as custom exception message ``` Could you please elaborate on how this happened? > Read eventLog file with mixed encodings > --- > > Key: SPARK-23139 > URL: https://issues.apache.org/jira/browse/SPARK-23139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DENG FEI >Priority: Major > > EventLog may contain mixed encodings such as custom exception message, but > caused to replay failure. > java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings
[ https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356507#comment-16356507 ] Jiang Xingbo commented on SPARK-23139: -- ``` EventLog may contain mixed encodings such as custom exception message ``` Could you please elaborate on how this happened? > Read eventLog file with mixed encodings > --- > > Key: SPARK-23139 > URL: https://issues.apache.org/jira/browse/SPARK-23139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DENG FEI >Priority: Major > > EventLog may contain mixed encodings such as custom exception message, but > caused to replay failure. > java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356506#comment-16356506 ] Liang-Chi Hsieh commented on SPARK-22446: - Yes, this is an issue in Spark 2.2. For earlier version, let me check it. > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at >
[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356441#comment-16356441 ] Apache Spark commented on SPARK-22700: -- User 'zhengruifeng' has created a pull request for this issue: https://github.com/apache/spark/pull/20539 > Bucketizer.transform incorrectly drops row containing NaN > - > > Key: SPARK-22700 > URL: https://issues.apache.org/jira/browse/SPARK-22700 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 2.3.0 > > > {code} > import org.apache.spark.ml.feature._ > val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, > Double.NaN))).toDF("a", "b") > val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity) > val bucketizer: Bucketizer = new > Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits) > bucketizer.setHandleInvalid("skip") > scala> df.show > +---+---+ > | a| b| > +---+---+ > |2.3|3.0| > |NaN|3.0| > |6.7|NaN| > +---+---+ > scala> bucketizer.transform(df).show > +---+---+---+ > | a| b| aa| > +---+---+---+ > |2.3|3.0|0.0| > +---+---+---+ > {code} > When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly > droped, though colum 'b' is not an input column -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings
[ https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356440#comment-16356440 ] Marcelo Vanzin commented on SPARK-23139: bq. ASSII is enough to spark event log. No it's not. bq. And if forcing writing with UTF-8, should also forcing reading with UTF-8 too. I'm not saying *if*, I'm saying that that's the expectation, and if that's not happening, it's a bug that needs to be fixed. > Read eventLog file with mixed encodings > --- > > Key: SPARK-23139 > URL: https://issues.apache.org/jira/browse/SPARK-23139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DENG FEI >Priority: Major > > EventLog may contain mixed encodings such as custom exception message, but > caused to replay failure. > java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356431#comment-16356431 ] zhengruifeng commented on SPARK-22700: -- [~WeichenXu123] I have checked others, and them seems ok > Bucketizer.transform incorrectly drops row containing NaN > - > > Key: SPARK-22700 > URL: https://issues.apache.org/jira/browse/SPARK-22700 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 2.3.0 > > > {code} > import org.apache.spark.ml.feature._ > val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, > Double.NaN))).toDF("a", "b") > val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity) > val bucketizer: Bucketizer = new > Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits) > bucketizer.setHandleInvalid("skip") > scala> df.show > +---+---+ > | a| b| > +---+---+ > |2.3|3.0| > |NaN|3.0| > |6.7|NaN| > +---+---+ > scala> bucketizer.transform(df).show > +---+---+---+ > | a| b| aa| > +---+---+---+ > |2.3|3.0|0.0| > +---+---+---+ > {code} > When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly > droped, though colum 'b' is not an input column -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings
[ https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356428#comment-16356428 ] DENG FEI commented on SPARK-23139: -- _ASSII_ is enough to spark event log. And if forcing writing with UTF-8, should also forcing reading with UTF-8 too. > Read eventLog file with mixed encodings > --- > > Key: SPARK-23139 > URL: https://issues.apache.org/jira/browse/SPARK-23139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DENG FEI >Priority: Major > > EventLog may contain mixed encodings such as custom exception message, but > caused to replay failure. > java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356359#comment-16356359 ] Apache Spark commented on SPARK-23319: -- User 'ueshin' has created a pull request for this issue: https://github.com/apache/spark/pull/20538 > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22700) Bucketizer.transform incorrectly drops row containing NaN
[ https://issues.apache.org/jira/browse/SPARK-22700?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356354#comment-16356354 ] Weichen Xu commented on SPARK-22700: [~podongfeng] Have you checked other transformers with handleInvalid option for the similar issue ? > Bucketizer.transform incorrectly drops row containing NaN > - > > Key: SPARK-22700 > URL: https://issues.apache.org/jira/browse/SPARK-22700 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0, 2.3.0 >Reporter: zhengruifeng >Assignee: zhengruifeng >Priority: Major > Fix For: 2.3.0 > > > {code} > import org.apache.spark.ml.feature._ > val df = spark.createDataFrame(Seq((2.3, 3.0), (Double.NaN, 3.0), (6.7, > Double.NaN))).toDF("a", "b") > val splits = Array(Double.NegativeInfinity, 3.0, Double.PositiveInfinity) > val bucketizer: Bucketizer = new > Bucketizer().setInputCol("a").setOutputCol("aa").setSplits(splits) > bucketizer.setHandleInvalid("skip") > scala> df.show > +---+---+ > | a| b| > +---+---+ > |2.3|3.0| > |NaN|3.0| > |6.7|NaN| > +---+---+ > scala> bucketizer.transform(df).show > +---+---+---+ > | a| b| aa| > +---+---+---+ > |2.3|3.0|0.0| > +---+---+---+ > {code} > When {{handleInvalid}} is set {{skip}}, the last item in input is incorrectly > droped, though colum 'b' is not an input column -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22289) Cannot save LogisticRegressionModel with bounds on coefficients
[ https://issues.apache.org/jira/browse/SPARK-22289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22289: -- Summary: Cannot save LogisticRegressionModel with bounds on coefficients (was: Cannot save LogisticRegressionClassificationModel with bounds on coefficients) > Cannot save LogisticRegressionModel with bounds on coefficients > --- > > Key: SPARK-22289 > URL: https://issues.apache.org/jira/browse/SPARK-22289 > Project: Spark > Issue Type: Bug > Components: ML >Affects Versions: 2.2.0 >Reporter: Nic Eggert >Assignee: yuhao yang >Priority: Major > Fix For: 2.2.2, 2.3.0 > > > I think this was introduced in SPARK-20047. > Trying to call save on a logistic regression model trained with bounds on its > parameters throws an error. This seems to be because Spark doesn't know how > to serialize the Matrix parameter. > Model is set up like this: > {code} > val calibrator = new LogisticRegression() > .setFeaturesCol("uncalibrated_probability") > .setLabelCol("label") > .setWeightCol("weight") > .setStandardization(false) > .setLowerBoundsOnCoefficients(new DenseMatrix(1, 1, Array(0.0))) > .setFamily("binomial") > .setProbabilityCol("probability") > .setPredictionCol("logistic_prediction") > .setRawPredictionCol("logistic_raw_prediction") > {code} > {code} > 17/10/16 15:36:59 ERROR ApplicationMaster: User class threw exception: > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > scala.NotImplementedError: The default jsonEncode only supports string and > vector. org.apache.spark.ml.param.Param must override jsonEncode for > org.apache.spark.ml.linalg.DenseMatrix. > at org.apache.spark.ml.param.Param.jsonEncode(params.scala:98) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:296) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1$$anonfun$2.apply(ReadWrite.scala:295) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) > at > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:234) > at scala.collection.AbstractTraversable.map(Traversable.scala:104) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$$anonfun$1.apply(ReadWrite.scala:295) > at scala.Option.getOrElse(Option.scala:121) > at > org.apache.spark.ml.util.DefaultParamsWriter$.getMetadataToSave(ReadWrite.scala:295) > at > org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:277) > at > org.apache.spark.ml.classification.LogisticRegressionModel$LogisticRegressionModelWriter.saveImpl(LogisticRegression.scala:1182) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:254) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$$anonfun$saveImpl$1.apply(Pipeline.scala:253) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at > org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:253) > at > org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:337) > at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:114) > -snip- > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356340#comment-16356340 ] Joseph K. Bradley commented on SPARK-22446: --- [~viirya] Did you confirm this is an issue in Spark 2.2 or earlier? > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) > at >
[jira] [Updated] (SPARK-22446) Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" exception incorrectly for filtered data.
[ https://issues.apache.org/jira/browse/SPARK-22446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-22446: -- Affects Version/s: (was: 2.2.0) (was: 2.0.0) 2.0.2 2.1.2 2.2.1 > Optimizer causing StringIndexerModel's indexer UDF to throw "Unseen label" > exception incorrectly for filtered data. > --- > > Key: SPARK-22446 > URL: https://issues.apache.org/jira/browse/SPARK-22446 > Project: Spark > Issue Type: Bug > Components: ML, Optimizer >Affects Versions: 2.0.2, 2.1.2, 2.2.1 > Environment: spark-shell, local mode, macOS Sierra 10.12.6 >Reporter: Greg Bellchambers >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > In the following, the `indexer` UDF defined inside the > `org.apache.spark.ml.feature.StringIndexerModel.transform` method throws an > "Unseen label" error, despite the label not being present in the transformed > DataFrame. > Here is the definition of the indexer UDF in the transform method: > {code:java} > val indexer = udf { label: String => > if (labelToIndex.contains(label)) { > labelToIndex(label) > } else { > throw new SparkException(s"Unseen label: $label.") > } > } > {code} > We can demonstrate the error with a very simple example DataFrame. > {code:java} > scala> import org.apache.spark.ml.feature.StringIndexer > import org.apache.spark.ml.feature.StringIndexer > scala> // first we create a DataFrame with three cities > scala> val df = List( > | ("A", "London", "StrA"), > | ("B", "Bristol", null), > | ("C", "New York", "StrC") > | ).toDF("ID", "CITY", "CONTENT") > df: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 1 more > field] > scala> df.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | B| Bristol| null| > | C|New York| StrC| > +---++---+ > scala> // then we remove the row with null in CONTENT column, which removes > Bristol > scala> val dfNoBristol = finalStatic.filter($"CONTENT".isNotNull) > dfNoBristol: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [ID: > string, CITY: string ... 1 more field] > scala> dfNoBristol.show > +---++---+ > | ID|CITY|CONTENT| > +---++---+ > | A| London| StrA| > | C|New York| StrC| > +---++---+ > scala> // now create a StringIndexer for the CITY column and fit to > dfNoBristol > scala> val model = { > | new StringIndexer() > | .setInputCol("CITY") > | .setOutputCol("CITYIndexed") > | .fit(dfNoBristol) > | } > model: org.apache.spark.ml.feature.StringIndexerModel = strIdx_f5afa2fb > scala> // the StringIndexerModel has only two labels: "London" and "New York" > scala> str.labels foreach println > London > New York > scala> // transform our DataFrame to add an index column > scala> val dfWithIndex = model.transform(dfNoBristol) > dfWithIndex: org.apache.spark.sql.DataFrame = [ID: string, CITY: string ... 2 > more fields] > scala> dfWithIndex.show > +---++---+---+ > | ID|CITY|CONTENT|CITYIndexed| > +---++---+---+ > | A| London| StrA|0.0| > | C|New York| StrC|1.0| > +---++---+---+ > {code} > The unexpected behaviour comes when we filter `dfWithIndex` for `CITYIndexed` > equal to 1.0 and perform an action. The `indexer` UDF in `transform` method > throws an exception reporting unseen label "Bristol". This is irrational > behaviour as far as the user of the API is concerned, because there is no > such value as "Bristol" when do show all rows of `dfWithIndex`: > {code:java} > scala> dfWithIndex.filter($"CITYIndexed" === 1.0).count > 17/11/04 00:33:41 ERROR Executor: Exception in task 1.0 in stage 20.0 (TID 40) > org.apache.spark.SparkException: Failed to execute user defined > function($anonfun$5: (string) => double) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithoutKey$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395) > at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408) > at > org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125) >
[jira] [Commented] (SPARK-23105) Spark MLlib, GraphX 2.3 QA umbrella
[ https://issues.apache.org/jira/browse/SPARK-23105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356303#comment-16356303 ] Joseph K. Bradley commented on SPARK-23105: --- Sorry for being AWOL; I unfortunately had to be away from work for a couple of weeks. Thanks so much everyone for carrying through with QA! > Spark MLlib, GraphX 2.3 QA umbrella > --- > > Key: SPARK-23105 > URL: https://issues.apache.org/jira/browse/SPARK-23105 > Project: Spark > Issue Type: Umbrella > Components: Documentation, GraphX, ML, MLlib >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley >Priority: Critical > Fix For: 2.3.0 > > > This JIRA lists tasks for the next Spark release's QA period for MLlib and > GraphX. *SparkR is separate: SPARK-23114.* > The list below gives an overview of what is involved, and the corresponding > JIRA issues are linked below that. > h2. API > * Check binary API compatibility for Scala/Java > * Audit new public APIs (from the generated html doc) > ** Scala > ** Java compatibility > ** Python coverage > * Check Experimental, DeveloperApi tags > h2. Algorithms and performance > * Performance tests > h2. Documentation and example code > * For new algorithms, create JIRAs for updating the user guide sections & > examples > * Update Programming Guide > * Update website -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23308) ignoreCorruptFiles should not ignore retryable IOException
[ https://issues.apache.org/jira/browse/SPARK-23308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356271#comment-16356271 ] Márcio Furlani Carmona commented on SPARK-23308: That's true Steve. I totally agree that it'd be hard to identify what is retryable from Spark's perspective. That being said, I agree that the FS should be responsible for that decision. I believe one option would be leaving the retry responsibility to FS implementations (as it already seems to be) and adding the documentation for this flag making clear you might experience some data losses. Other option would be creating a special exception (CorruptedFileException?) that could be thrown by FS implementations and let them decide what is a corrupted file or just a transient error. A more complex approach would be having a file blacklist mechanism rather than this flag, similarly to the `spark.blacklist.*` feature, where you can decide how many times to retry a file before considering it corrupted, then you won't need to decide what is worth retrying or not. The downside is that you'll always retry, even when there's no point in retrying. *Regarding the socket timeouts:* I also believe it's some kind of throttling. I'm reading over 80k files with over 10TB of data. The file sizes are not uniform, so some files may be read in a single request, other might get split into multiple partitions. Considering I'm not overriding the default `spark.files.maxPartitionBytes` value of `128 MB`, I should get at least ~82k partitions, thus the same number of S3 requests. Also, the files share some common prefixes, which [might be bad for S3 index access|https://docs.aws.amazon.com/AmazonS3/latest/dev/request-rate-perf-considerations.html]. I have a total of 120 executors, with 9 cores each, across 30 workers. But since each task involves some good computational time, the median task execution time is 8s, so I don't think we're getting close to the 100 TPS for a common prefix mentioned in the S3 documentation above. So, I'm more inclined to say it might be some EC2 network throttling rather than S3 throttling. Another reason to believe on that is because I've seen some [503 Slow Down|https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html] errors from S3 in the past when my requests got throttled, which I'm not seeing this time. > ignoreCorruptFiles should not ignore retryable IOException > -- > > Key: SPARK-23308 > URL: https://issues.apache.org/jira/browse/SPARK-23308 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1 >Reporter: Márcio Furlani Carmona >Priority: Minor > > When `spark.sql.files.ignoreCorruptFiles` is set it totally ignores any kind > of RuntimeException or IOException, but some possible IOExceptions may happen > even if the file is not corrupted. > One example is the SocketTimeoutException which can be retried to possibly > fetch the data without meaning the data is corrupted. > > See: > https://github.com/apache/spark/blob/e30e2698a2193f0bbdcd4edb884710819ab6397c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala#L163 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23300) Print out if Pandas and PyArrow are installed or not in tests
[ https://issues.apache.org/jira/browse/SPARK-23300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23300: - Fix Version/s: (was: 2.4.0) 2.3.0 > Print out if Pandas and PyArrow are installed or not in tests > - > > Key: SPARK-23300 > URL: https://issues.apache.org/jira/browse/SPARK-23300 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.3.0 > > > This is related with SPARK-23292. Some tests with, for example, PyArrow and > Pandas are currently not running in some Python versions in Jenkins. > At least, we should print out if we are skipping the tests or now explicitly > in the console. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23092) Migrate MemoryStream to DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das reassigned SPARK-23092: - Assignee: Tathagata Das > Migrate MemoryStream to DataSource V2 > - > > Key: SPARK-23092 > URL: https://issues.apache.org/jira/browse/SPARK-23092 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Assignee: Tathagata Das >Priority: Major > Fix For: 3.0.0 > > > We should migrate the MemoryStream for Structured Streaming to DataSourceV2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23092) Migrate MemoryStream to DataSource V2
[ https://issues.apache.org/jira/browse/SPARK-23092?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tathagata Das resolved SPARK-23092. --- Resolution: Fixed Fix Version/s: 3.0.0 Issue resolved by pull request 20445 [https://github.com/apache/spark/pull/20445] > Migrate MemoryStream to DataSource V2 > - > > Key: SPARK-23092 > URL: https://issues.apache.org/jira/browse/SPARK-23092 > Project: Spark > Issue Type: Task > Components: Structured Streaming >Affects Versions: 2.3.0 >Reporter: Burak Yavuz >Priority: Major > Fix For: 3.0.0 > > > We should migrate the MemoryStream for Structured Streaming to DataSourceV2 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356189#comment-16356189 ] Apache Spark commented on SPARK-23314: -- User 'icexelloss' has created a pull request for this issue: https://github.com/apache/spark/pull/20537 > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23314: Assignee: Apache Spark > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Assignee: Apache Spark >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23314) Pandas grouped udf on dataset with timestamp column error
[ https://issues.apache.org/jira/browse/SPARK-23314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23314: Assignee: (was: Apache Spark) > Pandas grouped udf on dataset with timestamp column error > -- > > Key: SPARK-23314 > URL: https://issues.apache.org/jira/browse/SPARK-23314 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Felix Cheung >Priority: Major > > Under SPARK-22216 > When testing pandas_udf on group bys, I saw this error with the timestamp > column. > File "pandas/_libs/tslib.pyx", line 3593, in > pandas._libs.tslib.tz_localize_to_utc > AmbiguousTimeError: Cannot infer dst time from Timestamp('2015-11-01 > 01:29:30'), try using the 'ambiguous' argument > For details, see Comment box. I'm able to reproduce this on the latest > branch-2.3 (last change from Feb 1 UTC) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356176#comment-16356176 ] Sameer Agarwal commented on SPARK-23348: yes, +1 > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > > All Spark 2.X are the same. For Spark 1.6.3, > {code} > scala> sql("select * from tx").show > ++---+ > | i| j| > ++---+ > |null| 3| > | 1| a| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356174#comment-16356174 ] Dongjoon Hyun commented on SPARK-23348: --- [~cloud_fan], [~smilegator], [~sameerag]. Although this is not a regression at Spark 2.3, can we have this Apache Spark 2.3? > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > > All Spark 2.X are the same. For Spark 1.6.3, > {code} > scala> sql("select * from tx").show > ++---+ > | i| j| > ++---+ > |null| 3| > | 1| a| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356168#comment-16356168 ] Thomas Graves commented on SPARK-22683: --- I agree, I think default behavior stays 1. I ran a few tests with this patch. I definitely see an improvement in resource usage across all the jobs I ran. The jobs were similar job run time or actually faster on a few. I used default 60 second timeout. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356168#comment-16356168 ] Thomas Graves edited comment on SPARK-22683 at 2/7/18 10:24 PM: I agree, I think default behavior stays 1. I ran a few tests with this patch. I definitely see an improvement in resource usage across all the jobs I ran. The jobs were similar job run time or actually faster on a few. I used default 60 second timeout. Note none of those jobs were really long running. small to medium size tasks. was (Author: tgraves): I agree, I think default behavior stays 1. I ran a few tests with this patch. I definitely see an improvement in resource usage across all the jobs I ran. The jobs were similar job run time or actually faster on a few. I used default 60 second timeout. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23348: -- Description: {code:java} Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") scala> sql("select * from t").show {code} This query will fail with a strange error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) ... {code} All Spark 2.X are the same. For Spark 1.6.3, {code} scala> sql("select * from tx").show ++---+ | i| j| ++---+ |null| 3| | 1| a| ++---+ {code} was: {code:java} Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") scala> sql("select * from t").show {code} This query will fail with a strange error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) ... {code} > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > > All Spark 2.X are the same. For Spark 1.6.3, > {code} > scala> sql("select * from tx").show > ++---+ > | i| j| > ++---+ > |null| 3| > | 1| a| > ++---+ > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc
[ https://issues.apache.org/jira/browse/SPARK-23354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356132#comment-16356132 ] Sean Owen commented on SPARK-23354: --- I'm not clear what about this involves Spark. What length do you mean? The first question is even if your code is correct > spark jdbc does not maintain length of data type when I move data from MS sql > server to Oracle using spark jdbc > --- > > Key: SPARK-23354 > URL: https://issues.apache.org/jira/browse/SPARK-23354 > Project: Spark > Issue Type: Bug > Components: Input/Output >Affects Versions: 2.2.1 >Reporter: Lav Patel >Priority: Major > > spark jdbc does not maintain length of data type when I move data from MS sql > server to Oracle using spark jdbc > > To fix this, I have written code so it will figure out length of column and > it does the conversion. > > I can put more details with a code sample if the community is interested. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23348: -- Affects Version/s: 2.0.2 > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23348: -- Affects Version/s: 2.1.2 > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.1.2, 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23348: -- Affects Version/s: 2.2.1 > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-23348: -- Description: {code:java} Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") scala> sql("select * from t").show {code} This query will fail with a strange error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) ... {code} was: {code:java} Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") {code} This query will fail with a strange error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) ... {code} > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.2.1, 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > scala> sql("select * from t").show > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
[ https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-23349. --- Resolution: Won't Fix Fix Version/s: (was: 2.2.1) > Duplicate and redundant type determination for ShuffleManager Object > > > Key: SPARK-23349 > URL: https://issues.apache.org/jira/browse/SPARK-23349 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.2.1 >Reporter: Phoenix_Daddy >Priority: Minor > > org.apache.spark.sql.execution.exchange.ShuffleExchange > In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or > else" branch . > The condition in the first layer "if" has the same > value as the condition in the second layer "if", that is, > must be true when is true. > In addition, the condition will be used in the second > layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356119#comment-16356119 ] Sean Owen commented on SPARK-23329: --- Yeah, none of the other docs talk about a Column. It's implied that operations operate across rows of data in columns. "angles" makes it sound like the operator takes several arguments. I would go with the former. > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23318) FP-growth: WARN FPGrowth: Input data is not cached
[ https://issues.apache.org/jira/browse/SPARK-23318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356111#comment-16356111 ] Sean Owen commented on SPARK-23318: --- [~tashoyan] did you want to submit a PR for this? > FP-growth: WARN FPGrowth: Input data is not cached > -- > > Key: SPARK-23318 > URL: https://issues.apache.org/jira/browse/SPARK-23318 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 2.2.1 >Reporter: Arseniy Tashoyan >Priority: Minor > Labels: MLLib,, fp-growth > Original Estimate: 24h > Remaining Estimate: 24h > > When running FPGrowth.fit() from _ml_ package, one can see a warning: > WARN FPGrowth: Input data is not cached. > This warning occurs even the dataset of transactions is cached. > Actually this warning comes from the FPGrowth implementation in old _mllib_ > package. New FPGrowth performs some transformations on the input data set of > transactions and then passes it to the old FPGrowth - without caching. Hence > the warning. > The problem looks similar to SPARK-18356 > If you don't mind, I can push a similar fix: > {code} > // ml.FPGrowth > val handlePersistence = dataset.storageLevel == StorageLevel.NONE > if (handlePersistence) { > // cache the data > } > // then call mllib.FPGrowth > // finally unpersist the data > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default
[ https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22279: Assignee: (was: Apache Spark) > Turn on spark.sql.hive.convertMetastoreOrc by default > - > > Key: SPARK-22279 > URL: https://issues.apache.org/jira/browse/SPARK-22279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` > by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default
[ https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-22279: Assignee: Apache Spark > Turn on spark.sql.hive.convertMetastoreOrc by default > - > > Key: SPARK-22279 > URL: https://issues.apache.org/jira/browse/SPARK-22279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` > by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23355) convertMetastore should not ignore table properties
[ https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16356076#comment-16356076 ] Apache Spark commented on SPARK-23355: -- User 'dongjoon-hyun' has created a pull request for this issue: https://github.com/apache/spark/pull/20522 > convertMetastore should not ignore table properties > --- > > Key: SPARK-23355 > URL: https://issues.apache.org/jira/browse/SPARK-23355 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > `convertMetastoreOrc/Parquet` ignores table properties. > This happens for a table created by `STORED AS ORC/PARQUET`. > Currently, `convertMetastoreParquet` is `true` by default. So, it's a > well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` > and SPARK-22279 tried to turn on that for Feature Parity with Parquet. > However, it's indeed a regression for the existing Hive ORC table users. So, > it'll be reverted for Apache Spark 2.3 via > https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties
[ https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23355: Assignee: Apache Spark > convertMetastore should not ignore table properties > --- > > Key: SPARK-23355 > URL: https://issues.apache.org/jira/browse/SPARK-23355 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Assignee: Apache Spark >Priority: Major > > `convertMetastoreOrc/Parquet` ignores table properties. > This happens for a table created by `STORED AS ORC/PARQUET`. > Currently, `convertMetastoreParquet` is `true` by default. So, it's a > well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` > and SPARK-22279 tried to turn on that for Feature Parity with Parquet. > However, it's indeed a regression for the existing Hive ORC table users. So, > it'll be reverted for Apache Spark 2.3 via > https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23355) convertMetastore should not ignore table properties
[ https://issues.apache.org/jira/browse/SPARK-23355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23355: Assignee: (was: Apache Spark) > convertMetastore should not ignore table properties > --- > > Key: SPARK-23355 > URL: https://issues.apache.org/jira/browse/SPARK-23355 > Project: Spark > Issue Type: Test > Components: SQL >Affects Versions: 2.0.2, 2.1.2, 2.2.1, 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > `convertMetastoreOrc/Parquet` ignores table properties. > This happens for a table created by `STORED AS ORC/PARQUET`. > Currently, `convertMetastoreParquet` is `true` by default. So, it's a > well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` > and SPARK-22279 tried to turn on that for Feature Parity with Parquet. > However, it's indeed a regression for the existing Hive ORC table users. So, > it'll be reverted for Apache Spark 2.3 via > https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23355) convertMetastore should not ignore table properties
Dongjoon Hyun created SPARK-23355: - Summary: convertMetastore should not ignore table properties Key: SPARK-23355 URL: https://issues.apache.org/jira/browse/SPARK-23355 Project: Spark Issue Type: Test Components: SQL Affects Versions: 2.2.1, 2.1.2, 2.0.2, 2.3.0 Reporter: Dongjoon Hyun `convertMetastoreOrc/Parquet` ignores table properties. This happens for a table created by `STORED AS ORC/PARQUET`. Currently, `convertMetastoreParquet` is `true` by default. So, it's a well-known user-facing issue. For `convertMetastoreOrc`, it has been `false` and SPARK-22279 tried to turn on that for Feature Parity with Parquet. However, it's indeed a regression for the existing Hive ORC table users. So, it'll be reverted for Apache Spark 2.3 via https://github.com/apache/spark/pull/20536 . -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default
[ https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun updated SPARK-22279: -- Fix Version/s: (was: 2.3.0) > Turn on spark.sql.hive.convertMetastoreOrc by default > - > > Key: SPARK-22279 > URL: https://issues.apache.org/jira/browse/SPARK-22279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` > by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Reopened] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default
[ https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun reopened SPARK-22279: --- This will be reverted. > Turn on spark.sql.hive.convertMetastoreOrc by default > - > > Key: SPARK-22279 > URL: https://issues.apache.org/jira/browse/SPARK-22279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > > Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` > by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23045) Have RFormula use OneHotEncoderEstimator
[ https://issues.apache.org/jira/browse/SPARK-23045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-23045: -- Summary: Have RFormula use OneHotEncoderEstimator (was: Have RFormula use OneHoEncoderEstimator) > Have RFormula use OneHotEncoderEstimator > > > Key: SPARK-23045 > URL: https://issues.apache.org/jira/browse/SPARK-23045 > Project: Spark > Issue Type: Sub-task > Components: ML >Affects Versions: 2.3.0 >Reporter: Bago Amirbekian >Assignee: Bago Amirbekian >Priority: Major > Fix For: 2.3.0 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23206) Additional Memory Tuning Metrics
[ https://issues.apache.org/jira/browse/SPARK-23206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355994#comment-16355994 ] Edwina Lu commented on SPARK-23206: --- We ([~jerryshao], [~zhz] and I) are planning a conference call via Skype to discuss this ticket tomorrow Thursday Feb 8 at 5pm PST. If anyone else is interested in joining the call, please let me know. > Additional Memory Tuning Metrics > > > Key: SPARK-23206 > URL: https://issues.apache.org/jira/browse/SPARK-23206 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Edwina Lu >Priority: Major > Attachments: ExecutorsTab.png, MemoryTuningMetricsDesignDoc.pdf, > StageTab.png > > > At LinkedIn, we have multiple clusters, running thousands of Spark > applications, and these numbers are growing rapidly. We need to ensure that > these Spark applications are well tuned – cluster resources, including > memory, should be used efficiently so that the cluster can support running > more applications concurrently, and applications should run quickly and > reliably. > Currently there is limited visibility into how much memory executors are > using, and users are guessing numbers for executor and driver memory sizing. > These estimates are often much larger than needed, leading to memory wastage. > Examining the metrics for one cluster for a month, the average percentage of > used executor memory (max JVM used memory across executors / > spark.executor.memory) is 35%, leading to an average of 591GB unused memory > per application (number of executors * (spark.executor.memory - max JVM used > memory)). Spark has multiple memory regions (user memory, execution memory, > storage memory, and overhead memory), and to understand how memory is being > used and fine-tune allocation between regions, it would be useful to have > information about how much memory is being used for the different regions. > To improve visibility into memory usage for the driver and executors and > different memory regions, the following additional memory metrics can be be > tracked for each executor and driver: > * JVM used memory: the JVM heap size for the executor/driver. > * Execution memory: memory used for computation in shuffles, joins, sorts > and aggregations. > * Storage memory: memory used caching and propagating internal data across > the cluster. > * Unified memory: sum of execution and storage memory. > The peak values for each memory metric can be tracked for each executor, and > also per stage. This information can be shown in the Spark UI and the REST > APIs. Information for peak JVM used memory can help with determining > appropriate values for spark.executor.memory and spark.driver.memory, and > information about the unified memory region can help with determining > appropriate values for spark.memory.fraction and > spark.memory.storageFraction. Stage memory information can help identify > which stages are most memory intensive, and users can look into the relevant > code to determine if it can be optimized. > The memory metrics can be gathered by adding the current JVM used memory, > execution memory and storage memory to the heartbeat. SparkListeners are > modified to collect the new metrics for the executors, stages and Spark > history log. Only interesting values (peak values per stage per executor) are > recorded in the Spark history log, to minimize the amount of additional > logging. > We have attached our design documentation with this ticket and would like to > receive feedback from the community for this proposal. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355982#comment-16355982 ] Imran Rashid commented on SPARK-19870: -- [~eyalfa] any chance you can share those exectuor logs? those WARN messages may or may not be OK -- the known situation in which they occur is when you intentionally do not read to the end of a partition, eg. if you take a take() or a limit() etc. > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert >Priority: Major > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at >
[jira] [Updated] (SPARK-21084) Improvements to dynamic allocation for notebook use cases
[ https://issues.apache.org/jira/browse/SPARK-21084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mark Hamstra updated SPARK-21084: - Description: One important application of Spark is to support many notebook users with a single YARN or Spark Standalone cluster. We at IBM have seen this requirement across multiple deployments of Spark: on-premises and private cloud deployments at our clients, as well as on the IBM cloud. The scenario goes something like this: "Every morning at 9am, 500 analysts log into their computers and start running Spark notebooks intermittently for the next 8 hours." I'm sure that many other members of the community are interested in making similar scenarios work. Dynamic allocation is supposed to support these kinds of use cases by shifting cluster resources towards users who are currently executing scalable code. In our own testing, we have encountered a number of issues with using the current implementation of dynamic allocation for this purpose: *Issue #1: Starvation.* A Spark job acquires all available containers, preventing other jobs or applications from starting. *Issue #2: Request latency.* Jobs that would normally finish in less than 30 seconds take 2-4x longer than normal with dynamic allocation. *Issue #3: Unfair resource allocation due to cached data.* Applications that have cached RDD partitions hold onto executors indefinitely, denying those resources to other applications. *Issue #4: Loss of cached data leads to thrashing.* Applications repeatedly lose partitions of cached RDDs because the underlying executors are removed; the applications then need to rerun expensive computations. This umbrella JIRA covers efforts to address these issues by making enhancements to Spark. Here's a high-level summary of the current planned work: * [SPARK-21097]: Preserve an executor's cached data when shutting down the executor. * [SPARK-21122]: Make Spark give up executors in a controlled fashion when the RM indicates it is running low on capacity. * (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors. Note that this overall plan is subject to change, and other members of the community should feel free to suggest changes and to help out. was: One important application of Spark is to support many notebook users with a single YARN or Spark Standalone cluster. We at IBM have seen this requirement across multiple deployments of Spark: on-premises and private cloud deployments at our clients, as well as on the IBM cloud. The scenario goes something like this: "Every morning at 9am, 500 analysts log into their computers and start running Spark notebooks intermittently for the next 8 hours." I'm sure that many other members of the community are interested in making similar scenarios work. Dynamic allocation is supposed to support these kinds of use cases by shifting cluster resources towards users who are currently executing scalable code. In our own testing, we have encountered a number of issues with using the current implementation of dynamic allocation for this purpose: *Issue #1: Starvation.* A Spark job acquires all available containers, preventing other jobs or applications from starting. *Issue #2: Request latency.* Jobs that would normally finish in less than 30 seconds take 2-4x longer than normal with dynamic allocation. *Issue #3: Unfair resource allocation due to cached data.* Applications that have cached RDD partitions hold onto executors indefinitely, denying those resources to other applications. *Issue #4: Loss of cached data leads to thrashing.* Applications repeatedly lose partitions of cached RDDs because the underlying executors are removed; the applications then need to rerun expensive computations. This umbrella JIRA covers efforts to address these issues by making enhancements to Spark. Here's a high-level summary of the current planned work: * [SPARK-21097]: Preserve an executor's cached data when shutting down the executor. * (JIRA TBD): Make Spark give up executors in a controlled fashion when the RM indicates it is running low on capacity. * (JIRA TBD): Reduce the delay for dynamic allocation to spin up new executors. Note that this overall plan is subject to change, and other members of the community should feel free to suggest changes and to help out. > Improvements to dynamic allocation for notebook use cases > - > > Key: SPARK-21084 > URL: https://issues.apache.org/jira/browse/SPARK-21084 > Project: Spark > Issue Type: Umbrella > Components: Block Manager, Scheduler, Spark Core, YARN >Affects Versions: 2.2.0, 2.3.0 >Reporter: Frederick Reiss >Priority: Major > > One important application of Spark is to support many notebook users with a > single
[jira] [Commented] (SPARK-22279) Turn on spark.sql.hive.convertMetastoreOrc by default
[ https://issues.apache.org/jira/browse/SPARK-22279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355952#comment-16355952 ] Apache Spark commented on SPARK-22279: -- User 'gatorsmile' has created a pull request for this issue: https://github.com/apache/spark/pull/20536 > Turn on spark.sql.hive.convertMetastoreOrc by default > - > > Key: SPARK-22279 > URL: https://issues.apache.org/jira/browse/SPARK-22279 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Dongjoon Hyun >Priority: Major > Fix For: 2.3.0 > > > Like Parquet, this issue aims to turn on `spark.sql.hive.convertMetastoreOrc` > by default. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23139) Read eventLog file with mixed encodings
[ https://issues.apache.org/jira/browse/SPARK-23139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355946#comment-16355946 ] Marcelo Vanzin commented on SPARK-23139: Even if you change {{file.encoding}}, Spark should be forcing the written data to be in UTF-8. If that's not happening, then that needs to be fixed. > Read eventLog file with mixed encodings > --- > > Key: SPARK-23139 > URL: https://issues.apache.org/jira/browse/SPARK-23139 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: DENG FEI >Priority: Major > > EventLog may contain mixed encodings such as custom exception message, but > caused to replay failure. > java.nio.charset.MalformedInputException: Input length = 1 > at java.nio.charset.CoderResult.throwException(CoderResult.java:281) > at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:339) > at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) > at java.io.InputStreamReader.read(InputStreamReader.java:184) > at java.io.BufferedReader.fill(BufferedReader.java:161) > at java.io.BufferedReader.readLine(BufferedReader.java:324) > at java.io.BufferedReader.readLine(BufferedReader.java:389) > at > scala.io.BufferedSource$BufferedLineIterator.hasNext(BufferedSource.scala:72) > at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:836) > at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:461) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:78) > at > org.apache.spark.scheduler.ReplayListenerBus.replay(ReplayListenerBus.scala:58) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$replay(FsHistoryProvider.scala:694) > at > org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$mergeApplicationListing(FsHistoryProvider.scala:507) > at > org.apache.spark.deploy.history.FsHistoryProvider$$anonfun$checkForLogs$4$$anon$4.run(FsHistoryProvider.scala:399) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355943#comment-16355943 ] Mark Hamstra commented on SPARK-22683: -- I agree that setting the config to 1 should be sufficient to retain current behaviour; however, there seemed to be at least some urging in the discussion toward a default of 2. I'm not sure that we want to change Spark's default behavior in that way. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355920#comment-16355920 ] Thomas Graves commented on SPARK-22683: --- If the config is set to 1 which keeps the current behavior the job server pattern and really any other application by default won't be affected. I don't see this as any different then me tuning max executors for example. Really this is just a more dynamic max executors. I agree with you that this isn't optimal in ways. For instances it applies it across the entire application where you could run multiple jobs and stages. Each of those might not want this config, but that is a different problem where we would need to support per stage configuration for example. If its a single application then you should be able to set this between jobs programmatically if they are serial jobs (although I haven't tested this), but if that doesn't work all the dynamic allocation configs would have the same issue. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355895#comment-16355895 ] Mark Hamstra commented on SPARK-22683: -- A concern that I have is that the discussion seems to be very focused on Spark "jobs" that are actually Spark Applications that run only a single Job – i.e. a workload where Applications and Jobs are submitted 1:1. Tuning for just this kind of workload has the potential to negatively impact other Spark usage patterns where several Jobs run in a single Application or the common Job-server pattern where a single, long-running Application serves many Jobs over an extended time. I'm not saying that the proposals made here and in the associated PR will necessarily affect those other usage patterns negatively. What I am saying is that there must be strong demonstration that those patterns will not be negatively affected before I will be comfortable approving the requested changes. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become
[jira] [Commented] (SPARK-15176) Job Scheduling Within Application Suffers from Priority Inversion
[ https://issues.apache.org/jira/browse/SPARK-15176?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355880#comment-16355880 ] Alex Duvall commented on SPARK-15176: - Another interested party here - I'd find being able to limit max running tasks at a pool level extremely useful. > Job Scheduling Within Application Suffers from Priority Inversion > - > > Key: SPARK-15176 > URL: https://issues.apache.org/jira/browse/SPARK-15176 > Project: Spark > Issue Type: Bug > Components: Scheduler >Affects Versions: 1.6.1 >Reporter: Nick White >Priority: Major > > Say I have two pools, and N cores in my cluster: > * I submit a job to one, which has M >> N tasks > * N of the M tasks are scheduled > * I submit a job to the second pool - but none of its tasks get scheduled > until a task from the other pool finishes! > This can lead to unbounded denial-of-service for the second pool - regardless > of `minShare` or `weight` settings. Ideally Spark would support a pre-emption > mechanism, or an upper bound on a pool's resource usage. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23354) spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc
Lav Patel created SPARK-23354: - Summary: spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc Key: SPARK-23354 URL: https://issues.apache.org/jira/browse/SPARK-23354 Project: Spark Issue Type: Bug Components: Input/Output Affects Versions: 2.2.1 Reporter: Lav Patel spark jdbc does not maintain length of data type when I move data from MS sql server to Oracle using spark jdbc To fix this, I have written code so it will figure out length of column and it does the conversion. I can put more details with a code sample if the community is interested. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23345) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li reassigned SPARK-23345: --- Assignee: Liang-Chi Hsieh > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23345 > URL: https://issues.apache.org/jira/browse/SPARK-23345 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Assignee: Liang-Chi Hsieh >Priority: Major > Fix For: 2.3.0 > > > See at least once: > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {noformat} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 650 times over 10.011358752 > seconds. Last failure message: There are 1 possibly leaked file streams.. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 650 times over 10.011358752 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > {noformat} > Log file is too large to attach here, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23345) Flaky test: FileBasedDataSourceSuite
[ https://issues.apache.org/jira/browse/SPARK-23345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xiao Li resolved SPARK-23345. - Resolution: Fixed Fix Version/s: 2.3.0 > Flaky test: FileBasedDataSourceSuite > > > Key: SPARK-23345 > URL: https://issues.apache.org/jira/browse/SPARK-23345 > Project: Spark > Issue Type: Bug > Components: SQL, Tests >Affects Versions: 2.4.0 >Reporter: Marcelo Vanzin >Priority: Major > Fix For: 2.3.0 > > > See at least once: > https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/87114/testReport/junit/org.apache.spark.sql/FileBasedDataSourceSuite/_It_is_not_a_test_it_is_a_sbt_testing_SuiteSelector_/ > {noformat} > Error Message > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 650 times over 10.011358752 > seconds. Last failure message: There are 1 possibly leaked file streams.. > Stacktrace > sbt.ForkMain$ForkError: > org.scalatest.exceptions.TestFailedDueToTimeoutException: The code passed to > eventually never returned normally. Attempted 650 times over 10.011358752 > seconds. Last failure message: There are 1 possibly leaked file streams.. > at > org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:421) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:439) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26) > at > org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:337) > at > org.apache.spark.sql.FileBasedDataSourceSuite.eventually(FileBasedDataSourceSuite.scala:26) > at > org.apache.spark.sql.test.SharedSparkSession$class.afterEach(SharedSparkSession.scala:114) > at > org.apache.spark.sql.FileBasedDataSourceSuite.afterEach(FileBasedDataSourceSuite.scala:26) > at > org.scalatest.BeforeAndAfterEach$$anonfun$1.apply$mcV$sp(BeforeAndAfterEach.scala:234) > {noformat} > Log file is too large to attach here, though. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ahern updated SPARK-23351: Description: hi, after leaving my (somewhat high volume) Structured Streaming application running for some time, i get the following exception. The same exception also repeats when i try to restart the application. The only way to get the application back running is to clear the checkpoint directory which is far from ideal. Maybe a stream is not being flushed/closed properly internally by Spark when checkpointing? User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was: hi, after leaving my (somewhat high volume) Structured Streaming application running for some time, i get the following exception. The same exception also repeats when i try to restart the application. The only way to get the application back running is to clear the checkpoint directory which is far from ideal. Maybe a stream in not being flushed/closed properly internally by Spark when checkpointing? User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) at
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355634#comment-16355634 ] Xuefu Zhang commented on SPARK-22683: - +1 on the idea of including this. Also, +1 on renaming the config. Thanks. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.
[ https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23341: Assignee: Apache Spark > DataSourceOptions should handle path and table names to avoid confusion. > > > Key: SPARK-23341 > URL: https://issues.apache.org/jira/browse/SPARK-23341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Assignee: Apache Spark >Priority: Major > > DataSourceOptions should have getters for path and table, which should be > passed in when creating the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.
[ https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355563#comment-16355563 ] Apache Spark commented on SPARK-23341: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20535 > DataSourceOptions should handle path and table names to avoid confusion. > > > Key: SPARK-23341 > URL: https://issues.apache.org/jira/browse/SPARK-23341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceOptions should have getters for path and table, which should be > passed in when creating the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23341) DataSourceOptions should handle path and table names to avoid confusion.
[ https://issues.apache.org/jira/browse/SPARK-23341?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23341: Assignee: (was: Apache Spark) > DataSourceOptions should handle path and table names to avoid confusion. > > > Key: SPARK-23341 > URL: https://issues.apache.org/jira/browse/SPARK-23341 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.0 >Reporter: Ryan Blue >Priority: Major > > DataSourceOptions should have getters for path and table, which should be > passed in when creating the options. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-22683) DynamicAllocation wastes resources by allocating containers that will barely be used
[ https://issues.apache.org/jira/browse/SPARK-22683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1637#comment-1637 ] Thomas Graves commented on SPARK-22683: --- ok thanks, I would like to try this out myself on a few jobs, but my opinion is we should put this config in, if others have strong disagreement please speak up, otherwise I think we can move the discussion to the PR. I do think we need to change the name of the config. > DynamicAllocation wastes resources by allocating containers that will barely > be used > > > Key: SPARK-22683 > URL: https://issues.apache.org/jira/browse/SPARK-22683 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.1.0, 2.2.0 >Reporter: Julien Cuquemelle >Priority: Major > Labels: pull-request-available > > While migrating a series of jobs from MR to Spark using dynamicAllocation, > I've noticed almost a doubling (+114% exactly) of resource consumption of > Spark w.r.t MR, for a wall clock time gain of 43% > About the context: > - resource usage stands for vcore-hours allocation for the whole job, as seen > by YARN > - I'm talking about a series of jobs because we provide our users with a way > to define experiments (via UI / DSL) that automatically get translated to > Spark / MR jobs and submitted on the cluster > - we submit around 500 of such jobs each day > - these jobs are usually one shot, and the amount of processing can vary a > lot between jobs, and as such finding an efficient number of executors for > each job is difficult to get right, which is the reason I took the path of > dynamic allocation. > - Some of the tests have been scheduled on an idle queue, some on a full > queue. > - experiments have been conducted with spark.executor-cores = 5 and 10, only > results for 5 cores have been reported because efficiency was overall better > than with 10 cores > - the figures I give are averaged over a representative sample of those jobs > (about 600 jobs) ranging from tens to thousands splits in the data > partitioning and between 400 to 9000 seconds of wall clock time. > - executor idle timeout is set to 30s; > > Definition: > - let's say an executor has spark.executor.cores / spark.task.cpus taskSlots, > which represent the max number of tasks an executor will process in parallel. > - the current behaviour of the dynamic allocation is to allocate enough > containers to have one taskSlot per task, which minimizes latency, but wastes > resources when tasks are small regarding executor allocation and idling > overhead. > The results using the proposal (described below) over the job sample (600 > jobs): > - by using 2 tasks per taskSlot, we get a 5% (against -114%) reduction in > resource usage, for a 37% (against 43%) reduction in wall clock time for > Spark w.r.t MR > - by trying to minimize the average resource consumption, I ended up with 6 > tasks per core, with a 30% resource usage reduction, for a similar wall clock > time w.r.t. MR > What did I try to solve the issue with existing parameters (summing up a few > points mentioned in the comments) ? > - change dynamicAllocation.maxExecutors: this would need to be adapted for > each job (tens to thousands splits can occur), and essentially remove the > interest of using the dynamic allocation. > - use dynamicAllocation.backlogTimeout: > - setting this parameter right to avoid creating unused executors is very > dependant on wall clock time. One basically needs to solve the exponential > ramp up for the target time. So this is not an option for my use case where I > don't want a per-job tuning. > - I've still done a series of experiments, details in the comments. > Result is that after manual tuning, the best I could get was a similar > resource consumption at the expense of 20% more wall clock time, or a similar > wall clock time at the expense of 60% more resource consumption than what I > got using my proposal @ 6 tasks per slot (this value being optimized over a > much larger range of jobs as already stated) > - as mentioned in another comment, tampering with the exponential ramp up > might yield task imbalance and such old executors could become contention > points for other exes trying to remotely access blocks in the old exes (not > witnessed in the jobs I'm talking about, but we did see this behavior in > other jobs) > Proposal: > Simply add a tasksPerExecutorSlot parameter, which makes it possible to > specify how many tasks a single taskSlot should ideally execute to mitigate > the overhead of executor allocation. > PR: https://github.com/apache/spark/pull/19881 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23319: - Target Version/s: (was: 2.4.0) > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-23319. -- Resolution: Fixed Assignee: Hyukjin Kwon Target Version/s: 2.4.0 Fixed in https://github.com/apache/spark/pull/20487 > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23319: - Fix Version/s: 2.4.0 > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23319) Skip PySpark tests for old Pandas and old PyArrow
[ https://issues.apache.org/jira/browse/SPARK-23319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355531#comment-16355531 ] Apache Spark commented on SPARK-23319: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20534 > Skip PySpark tests for old Pandas and old PyArrow > - > > Key: SPARK-23319 > URL: https://issues.apache.org/jira/browse/SPARK-23319 > Project: Spark > Issue Type: Test > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > This is related with SPARK-23300. We declare PyArrow 0.8.0 >= and Pandas >= > 0.19.2 for now. We should explicitly skip in this case. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23300) Print out if Pandas and PyArrow are installed or not in tests
[ https://issues.apache.org/jira/browse/SPARK-23300?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355483#comment-16355483 ] Apache Spark commented on SPARK-23300: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20533 > Print out if Pandas and PyArrow are installed or not in tests > - > > Key: SPARK-23300 > URL: https://issues.apache.org/jira/browse/SPARK-23300 > Project: Spark > Issue Type: Test > Components: PySpark, Tests >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Hyukjin Kwon >Priority: Major > Fix For: 2.4.0 > > > This is related with SPARK-23292. Some tests with, for example, PyArrow and > Pandas are currently not running in some Python versions in Jenkins. > At least, we should print out if we are skipping the tests or now explicitly > in the console. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
[ https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23353: Assignee: (was: Apache Spark) > Allow ExecutorMetricsUpdate events to be logged to the event log with sampling > -- > > Key: SPARK-23353 > URL: https://issues.apache.org/jira/browse/SPARK-23353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Priority: Minor > > [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to > log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very > useful if it can be persisted for further analysis. > As a performance reason and actual use case, the PR offers a fraction > configuration which can sample the events to be persisted. we also refactor > for BlockUpdated with the same sampling way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
[ https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23353: Assignee: Apache Spark > Allow ExecutorMetricsUpdate events to be logged to the event log with sampling > -- > > Key: SPARK-23353 > URL: https://issues.apache.org/jira/browse/SPARK-23353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Assignee: Apache Spark >Priority: Minor > > [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to > log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very > useful if it can be persisted for further analysis. > As a performance reason and actual use case, the PR offers a fraction > configuration which can sample the events to be persisted. we also refactor > for BlockUpdated with the same sampling way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
[ https://issues.apache.org/jira/browse/SPARK-23353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355454#comment-16355454 ] Apache Spark commented on SPARK-23353: -- User 'LantaoJin' has created a pull request for this issue: https://github.com/apache/spark/pull/20532 > Allow ExecutorMetricsUpdate events to be logged to the event log with sampling > -- > > Key: SPARK-23353 > URL: https://issues.apache.org/jira/browse/SPARK-23353 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 2.2.1 >Reporter: Lantao Jin >Priority: Minor > > [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to > log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very > useful if it can be persisted for further analysis. > As a performance reason and actual use case, the PR offers a fraction > configuration which can sample the events to be persisted. we also refactor > for BlockUpdated with the same sampling way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23352) Explicitly specify supported types in Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23352: Assignee: Apache Spark > Explicitly specify supported types in Pandas UDFs > - > > Key: SPARK-23352 > URL: https://issues.apache.org/jira/browse/SPARK-23352 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Assignee: Apache Spark >Priority: Major > > Currently, we don't support {{BinaryType}} in Pandas UDFs: > {code} > >>> from pyspark.sql.functions import pandas_udf > >>> pudf = pandas_udf(lambda x: x, "binary") > >>> df = spark.createDataFrame([[bytearray("a")]]) > >>> df.select(pudf("_1")).show() > ... > TypeError: Unsupported type in conversion to Arrow: BinaryType > {code} > Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems > we can support this case. > We should better clarify it in doc in Pandas UDFs, and fail fast with type > checking ahead, rather than execution time. > Please consider this case: > {code} > pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage > because we know the schema ahead > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23352) Explicitly specify supported types in Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23352: Assignee: (was: Apache Spark) > Explicitly specify supported types in Pandas UDFs > - > > Key: SPARK-23352 > URL: https://issues.apache.org/jira/browse/SPARK-23352 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, we don't support {{BinaryType}} in Pandas UDFs: > {code} > >>> from pyspark.sql.functions import pandas_udf > >>> pudf = pandas_udf(lambda x: x, "binary") > >>> df = spark.createDataFrame([[bytearray("a")]]) > >>> df.select(pudf("_1")).show() > ... > TypeError: Unsupported type in conversion to Arrow: BinaryType > {code} > Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems > we can support this case. > We should better clarify it in doc in Pandas UDFs, and fail fast with type > checking ahead, rather than execution time. > Please consider this case: > {code} > pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage > because we know the schema ahead > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23352) Explicitly specify supported types in Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355442#comment-16355442 ] Apache Spark commented on SPARK-23352: -- User 'HyukjinKwon' has created a pull request for this issue: https://github.com/apache/spark/pull/20531 > Explicitly specify supported types in Pandas UDFs > - > > Key: SPARK-23352 > URL: https://issues.apache.org/jira/browse/SPARK-23352 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, we don't support {{BinaryType}} in Pandas UDFs: > {code} > >>> from pyspark.sql.functions import pandas_udf > >>> pudf = pandas_udf(lambda x: x, "binary") > >>> df = spark.createDataFrame([[bytearray("a")]]) > >>> df.select(pudf("_1")).show() > ... > TypeError: Unsupported type in conversion to Arrow: BinaryType > {code} > Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems > we can support this case. > We should better clarify it in doc in Pandas UDFs, and fail fast with type > checking ahead, rather than execution time. > Please consider this case: > {code} > pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage > because we know the schema ahead > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23353) Allow ExecutorMetricsUpdate events to be logged to the event log with sampling
Lantao Jin created SPARK-23353: -- Summary: Allow ExecutorMetricsUpdate events to be logged to the event log with sampling Key: SPARK-23353 URL: https://issues.apache.org/jira/browse/SPARK-23353 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 2.2.1 Reporter: Lantao Jin [SPARK-22050|https://issues.apache.org/jira/browse/SPARK-22050] give a way to log BlockUpdated events. Actually, the ExecutorMetricsUpdates are also very useful if it can be persisted for further analysis. As a performance reason and actual use case, the PR offers a fraction configuration which can sample the events to be persisted. we also refactor for BlockUpdated with the same sampling way. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23352) Explicitly specify supported types in Pandas UDFs
[ https://issues.apache.org/jira/browse/SPARK-23352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon updated SPARK-23352: - Description: Currently, we don't support {{BinaryType}} in Pandas UDFs: {code} >>> from pyspark.sql.functions import pandas_udf >>> pudf = pandas_udf(lambda x: x, "binary") >>> df = spark.createDataFrame([[bytearray("a")]]) >>> df.select(pudf("_1")).show() ... TypeError: Unsupported type in conversion to Arrow: BinaryType {code} Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we can support this case. We should better clarify it in doc in Pandas UDFs, and fail fast with type checking ahead, rather than execution time. Please consider this case: {code} pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage because we know the schema ahead {code} was: Currently, we don't support {{BinaryType}} in Pandas UDFs: {code} >>> from pyspark.sql.functions import pandas_udf >>> pudf = pandas_udf(lambda x: x, "binary") >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> df = spark.createDataFrame([[bytearray("a")]]) >>> df.select(pudf("_1")).show() ... TypeError: Unsupported type in conversion to Arrow: BinaryType {code} Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we can support this case. We should better clarify it in doc in Pandas UDFs, and fail fast with type checking ahead, rather than execution time. Please consider this case: {code} pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage because we know the schema ahead {code} > Explicitly specify supported types in Pandas UDFs > - > > Key: SPARK-23352 > URL: https://issues.apache.org/jira/browse/SPARK-23352 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 2.3.0 >Reporter: Hyukjin Kwon >Priority: Major > > Currently, we don't support {{BinaryType}} in Pandas UDFs: > {code} > >>> from pyspark.sql.functions import pandas_udf > >>> pudf = pandas_udf(lambda x: x, "binary") > >>> df = spark.createDataFrame([[bytearray("a")]]) > >>> df.select(pudf("_1")).show() > ... > TypeError: Unsupported type in conversion to Arrow: BinaryType > {code} > Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems > we can support this case. > We should better clarify it in doc in Pandas UDFs, and fail fast with type > checking ahead, rather than execution time. > Please consider this case: > {code} > pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage > because we know the schema ahead > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23352) Explicitly specify supported types in Pandas UDFs
Hyukjin Kwon created SPARK-23352: Summary: Explicitly specify supported types in Pandas UDFs Key: SPARK-23352 URL: https://issues.apache.org/jira/browse/SPARK-23352 Project: Spark Issue Type: Sub-task Components: PySpark Affects Versions: 2.3.0 Reporter: Hyukjin Kwon Currently, we don't support {{BinaryType}} in Pandas UDFs: {code} >>> from pyspark.sql.functions import pandas_udf >>> pudf = pandas_udf(lambda x: x, "binary") >>> spark.conf.set("spark.sql.execution.arrow.enabled", "true") >>> df = spark.createDataFrame([[bytearray("a")]]) >>> df.select(pudf("_1")).show() ... TypeError: Unsupported type in conversion to Arrow: BinaryType {code} Also, the grouped aggregate Pandas UDF fail fast on {{ArrayType}} but seems we can support this case. We should better clarify it in doc in Pandas UDFs, and fail fast with type checking ahead, rather than execution time. Please consider this case: {code} pandas_udf(lambda x: x, BinaryType()) # we can fail fast at this stage because we know the schema ahead {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
[ https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355340#comment-16355340 ] Apache Spark commented on SPARK-23349: -- User 'wujianping10043419' has created a pull request for this issue: https://github.com/apache/spark/pull/20530 > Duplicate and redundant type determination for ShuffleManager Object > > > Key: SPARK-23349 > URL: https://issues.apache.org/jira/browse/SPARK-23349 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.2.1 >Reporter: Phoenix_Daddy >Priority: Minor > Fix For: 2.2.1 > > > org.apache.spark.sql.execution.exchange.ShuffleExchange > In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or > else" branch . > The condition in the first layer "if" has the same > value as the condition in the second layer "if", that is, > must be true when is true. > In addition, the condition will be used in the second > layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23350) [SS]Exception when stopping continuous processing application
[ https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355326#comment-16355326 ] Apache Spark commented on SPARK-23350: -- User 'yanlin-Lynn' has created a pull request for this issue: https://github.com/apache/spark/pull/20529 > [SS]Exception when stopping continuous processing application > - > > Key: SPARK-23350 > URL: https://issues.apache.org/jira/browse/SPARK-23350 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang Yanlin >Priority: Major > > SparkException happends when stopping continuous processing application, > using Ctrl-C in stand-alone mode. > 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = > 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = > 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error > org.apache.spark.SparkException: Writing job failed. > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: org.apache.spark.SparkException: Job 0 cancelled because > SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1923) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at
[jira] [Updated] (SPARK-23351) checkpoint corruption in long running application
[ https://issues.apache.org/jira/browse/SPARK-23351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] David Ahern updated SPARK-23351: Description: hi, after leaving my (somewhat high volume) Structured Streaming application running for some time, i get the following exception. The same exception also repeats when i try to restart the application. The only way to get the application back running is to clear the checkpoint directory which is far from ideal. Maybe a stream in not being flushed/closed properly internally by Spark when checkpointing? User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) was: hi, after leaving my (somewhat high volume) Structured Streaming running for some time, i get the following exception. The same exception also repeats when i try to restart the application. The only way to get the application back running is to clear the checkpoint directory which is far from ideal. Maybe a stream in not being flushed/closed properly internally by Spark when checkpointing? User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) at
[jira] [Commented] (SPARK-23350) [SS]Exception when stopping continuous processing application
[ https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355311#comment-16355311 ] Apache Spark commented on SPARK-23350: -- User 'yanlin-Lynn' has created a pull request for this issue: https://github.com/apache/spark/pull/20528 > [SS]Exception when stopping continuous processing application > - > > Key: SPARK-23350 > URL: https://issues.apache.org/jira/browse/SPARK-23350 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang Yanlin >Priority: Major > > SparkException happends when stopping continuous processing application, > using Ctrl-C in stand-alone mode. > 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = > 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = > 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error > org.apache.spark.SparkException: Writing job failed. > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: org.apache.spark.SparkException: Job 0 cancelled because > SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1923) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at
[jira] [Assigned] (SPARK-23350) [SS]Exception when stopping continuous processing application
[ https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23350: Assignee: Apache Spark > [SS]Exception when stopping continuous processing application > - > > Key: SPARK-23350 > URL: https://issues.apache.org/jira/browse/SPARK-23350 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang Yanlin >Assignee: Apache Spark >Priority: Major > > SparkException happends when stopping continuous processing application, > using Ctrl-C in stand-alone mode. > 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = > 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = > 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error > org.apache.spark.SparkException: Writing job failed. > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: org.apache.spark.SparkException: Job 0 cancelled because > SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1923) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at >
[jira] [Assigned] (SPARK-23350) [SS]Exception when stopping continuous processing application
[ https://issues.apache.org/jira/browse/SPARK-23350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23350: Assignee: (was: Apache Spark) > [SS]Exception when stopping continuous processing application > - > > Key: SPARK-23350 > URL: https://issues.apache.org/jira/browse/SPARK-23350 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wang Yanlin >Priority: Major > > SparkException happends when stopping continuous processing application, > using Ctrl-C in stand-alone mode. > 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = > 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = > 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error > org.apache.spark.SparkException: Writing job failed. > at > org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) > at > org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) > at > org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) > at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) > at > org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) > at > org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) > at > org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266) > at > org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90) > at > org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) > at > org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) > Caused by: org.apache.spark.SparkException: Job 0 cancelled because > SparkContext was shut down > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) > at > org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) > at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) > at > org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) > at > org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831) > at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) > at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743) > at > org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924) > at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) > at org.apache.spark.SparkContext.stop(SparkContext.scala:1923) > at > org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) > at > org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) > at > org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) > at >
[jira] [Created] (SPARK-23351) checkpoint corruption in long running application
David Ahern created SPARK-23351: --- Summary: checkpoint corruption in long running application Key: SPARK-23351 URL: https://issues.apache.org/jira/browse/SPARK-23351 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 2.2.0 Reporter: David Ahern hi, after leaving my (somewhat high volume) Structured Streaming running for some time, i get the following exception. The same exception also repeats when i try to restart the application. The only way to get the application back running is to clear the checkpoint directory which is far from ideal. Maybe a stream in not being flushed/closed properly internally by Spark when checkpointing? User class threw exception: org.apache.spark.sql.streaming.StreamingQueryException: Job aborted due to stage failure: Task 55 in stage 1.0 failed 4 times, most recent failure: Lost task 55.3 in stage 1.0 (TID 240, gbslixaacspa04u.metis.prd, executor 2): java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:392) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$readSnapshotFile(HDFSBackedStateStoreProvider.scala:481) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:359) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider$$anonfun$org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap$1.apply(HDFSBackedStateStoreProvider.scala:358) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.org$apache$spark$sql$execution$streaming$state$HDFSBackedStateStoreProvider$$loadMap(HDFSBackedStateStoreProvider.scala:358) at org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider.getStore(HDFSBackedStateStoreProvider.scala:265) at org.apache.spark.sql.execution.streaming.state.StateStore$.get(StateStore.scala:200) at org.apache.spark.sql.execution.streaming.state.StateStoreRDD.compute(StateStoreRDD.scala:61) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org.apache.spark.scheduler.Task.run(Task.scala:108) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23350) [SS]Exception when stopping continuous processing application
Wang Yanlin created SPARK-23350: --- Summary: [SS]Exception when stopping continuous processing application Key: SPARK-23350 URL: https://issues.apache.org/jira/browse/SPARK-23350 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wang Yanlin SparkException happends when stopping continuous processing application, using Ctrl-C in stand-alone mode. 18/02/02 16:12:57 ERROR ContinuousExecution: Query yanlin-CP-job [id = 007f1f44-771a-4097-aaa3-28ae35c16dd9, runId = 3e1ab7c1-4d6f-475a-9d2c-45577643b0dd] terminated with error org.apache.spark.SparkException: Writing job failed. at org.apache.spark.sql.execution.datasources.v2.WriteToDataSourceV2Exec.doExecute(WriteToDataSourceV2.scala:105) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3$$anonfun$apply$1.apply(ContinuousExecution.scala:268) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:77) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution$$anonfun$runContinuous$3.apply(ContinuousExecution.scala:268) at org.apache.spark.sql.execution.streaming.ProgressReporter$class.reportTimeTaken(ProgressReporter.scala:271) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:58) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runContinuous(ContinuousExecution.scala:266) at org.apache.spark.sql.execution.streaming.continuous.ContinuousExecution.runActivatedStream(ContinuousExecution.scala:90) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:189) Caused by: org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:837) at org.apache.spark.scheduler.DAGScheduler$$anonfun$cleanUpAfterSchedulerStop$1.apply(DAGScheduler.scala:835) at scala.collection.mutable.HashSet.foreach(HashSet.scala:78) at org.apache.spark.scheduler.DAGScheduler.cleanUpAfterSchedulerStop(DAGScheduler.scala:835) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onStop(DAGScheduler.scala:1831) at org.apache.spark.util.EventLoop.stop(EventLoop.scala:83) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1743) at org.apache.spark.SparkContext$$anonfun$stop$8.apply$mcV$sp(SparkContext.scala:1924) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1357) at org.apache.spark.SparkContext.stop(SparkContext.scala:1923) at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:572) at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1988) at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188) at
[jira] [Commented] (SPARK-14047) GBT improvement umbrella
[ https://issues.apache.org/jira/browse/SPARK-14047?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355216#comment-16355216 ] Nick Pentreath commented on SPARK-14047: SPARK-12375 should fix that? Can you check it against the 2.3 RC (or branch-2.3)? If not could you provide some code to reproduce the error? > GBT improvement umbrella > > > Key: SPARK-14047 > URL: https://issues.apache.org/jira/browse/SPARK-14047 > Project: Spark > Issue Type: Umbrella > Components: ML >Reporter: Joseph K. Bradley >Priority: Major > > This is an umbrella for improvements to learning Gradient Boosted Trees: > GBTClassifier, GBTRegressor. > Note: Aspects of GBTs which are related to individual trees should be listed > under [SPARK-14045]. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-19870) Repeatable deadlock on BlockInfoManager and TorrentBroadcast
[ https://issues.apache.org/jira/browse/SPARK-19870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355174#comment-16355174 ] Eyal Farago commented on SPARK-19870: - [~irashid], wenth through executors' logs and found no errors. did find some decent GC pauses and warnings like this: {quote}WARN Executor: 1 block locks were not released by TID = 5413: [rdd_53_42] {quote} this aligns with my findings of tasks cleaning after themselves, theoretically in case of failure as well. since both distros we're using do not offer any of the relevant versions including your fix, I can't test this scenario with your patch (aside from other practical constraints :-)) so I can't really verify my assumption about the link between SPARK-19870 and SPARK-22083. [~stevenruppert], [~gdanov], can you guys try running with one of the relevant versions? > Repeatable deadlock on BlockInfoManager and TorrentBroadcast > > > Key: SPARK-19870 > URL: https://issues.apache.org/jira/browse/SPARK-19870 > Project: Spark > Issue Type: Bug > Components: Block Manager, Shuffle >Affects Versions: 2.0.2, 2.1.0 > Environment: ubuntu linux 14.04 x86_64 on ec2, hadoop cdh 5.10.0, > yarn coarse-grained. >Reporter: Steven Ruppert >Priority: Major > Attachments: stack.txt > > > Running what I believe to be a fairly vanilla spark job, using the RDD api, > with several shuffles, a cached RDD, and finally a conversion to DataFrame to > save to parquet. I get a repeatable deadlock at the very last reducers of one > of the stages. > Roughly: > {noformat} > "Executor task launch worker-6" #56 daemon prio=5 os_prio=0 > tid=0x7fffd88d3000 nid=0x1022b9 waiting for monitor entry > [0x7fffb95f3000] >java.lang.Thread.State: BLOCKED (on object monitor) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:207) > - waiting to lock <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x0005b12f2290> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86) > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53) > at org.apache.spark.scheduler.Task.run(Task.scala:99) > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {noformat} > and > {noformat} > "Executor task launch worker-5" #55 daemon prio=5 os_prio=0 > tid=0x7fffd88d nid=0x1022b8 in Object.wait() [0x7fffb96f4000] >java.lang.Thread.State: WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > at java.lang.Object.wait(Object.java:502) > at > org.apache.spark.storage.BlockInfoManager.lockForReading(BlockInfoManager.scala:202) > - locked <0x000545736b58> (a > org.apache.spark.storage.BlockInfoManager) > at > org.apache.spark.storage.BlockManager.getLocalValues(BlockManager.scala:444) > at > org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210) > - locked <0x0005445cfc00> (a > org.apache.spark.broadcast.TorrentBroadcast$) > at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269) > at > org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206) > at > org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66) > - locked <0x00059711eb10> (a > org.apache.spark.broadcast.TorrentBroadcast) > at > org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66) > at > org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96) > at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70) > at >
[jira] [Commented] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355160#comment-16355160 ] Apache Spark commented on SPARK-23348: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/20527 > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23348: Assignee: Apache Spark > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Assignee: Apache Spark >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23348: Assignee: (was: Apache Spark) > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23329) Update the function descriptions with the arguments and returned values of the trigonometric functions
[ https://issues.apache.org/jira/browse/SPARK-23329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355155#comment-16355155 ] Mihaly Toth commented on SPARK-23329: - Nice. I like that the redundant description part is simply omitted. The only issue I see is that {{e}} is actually not an angle. At least we could put it into plural, like: {code} /** * @param e angles in radians * @return sines of the angles, as if computed by [[java.lang.Math.sin]] * ... */ {code} I was also thinking about mentioning that it is actually a Column does not inflate the lines very much. At the same time they are more precise. {code} /** * @param e [[Column]] of angles in radians * @return [[Column]] of sines of the angles, as if computed by [[java.lang.Math.sin]] * ... */ {code} Now looking at this the first one seems better because it tells truth and one can figure out easily that the angles are stored in a Column. > Update the function descriptions with the arguments and returned values of > the trigonometric functions > -- > > Key: SPARK-23329 > URL: https://issues.apache.org/jira/browse/SPARK-23329 > Project: Spark > Issue Type: Documentation > Components: SQL >Affects Versions: 2.3.0 >Reporter: Xiao Li >Priority: Minor > Labels: starter > > We need an update on the function descriptions for all the trigonometric > functions. For example, {{cos}}, {{sin}}, and {{cot}}. Internally, the > implementation is based on the java.lang.Math. We need a clear description > about the units of the input arguments and the returned values. > For example, the following descriptions are lacking such info. > https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala#L551-L555 > https://github.com/apache/spark/blob/d5861aba9d80ca15ad3f22793b79822e470d6913/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L1978 -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
[ https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23349: Assignee: Apache Spark > Duplicate and redundant type determination for ShuffleManager Object > > > Key: SPARK-23349 > URL: https://issues.apache.org/jira/browse/SPARK-23349 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.2.1 >Reporter: Phoenix_Daddy >Assignee: Apache Spark >Priority: Minor > Fix For: 2.2.1 > > > org.apache.spark.sql.execution.exchange.ShuffleExchange > In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or > else" branch . > The condition in the first layer "if" has the same > value as the condition in the second layer "if", that is, > must be true when is true. > In addition, the condition will be used in the second > layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
[ https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16355137#comment-16355137 ] Apache Spark commented on SPARK-23349: -- User 'wujianping10043419' has created a pull request for this issue: https://github.com/apache/spark/pull/20526 > Duplicate and redundant type determination for ShuffleManager Object > > > Key: SPARK-23349 > URL: https://issues.apache.org/jira/browse/SPARK-23349 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.2.1 >Reporter: Phoenix_Daddy >Priority: Minor > Fix For: 2.2.1 > > > org.apache.spark.sql.execution.exchange.ShuffleExchange > In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or > else" branch . > The condition in the first layer "if" has the same > value as the condition in the second layer "if", that is, > must be true when is true. > In addition, the condition will be used in the second > layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
[ https://issues.apache.org/jira/browse/SPARK-23349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-23349: Assignee: (was: Apache Spark) > Duplicate and redundant type determination for ShuffleManager Object > > > Key: SPARK-23349 > URL: https://issues.apache.org/jira/browse/SPARK-23349 > Project: Spark > Issue Type: Improvement > Components: Shuffle, SQL >Affects Versions: 2.2.1 >Reporter: Phoenix_Daddy >Priority: Minor > Fix For: 2.2.1 > > > org.apache.spark.sql.execution.exchange.ShuffleExchange > In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or > else" branch . > The condition in the first layer "if" has the same > value as the condition in the second layer "if", that is, > must be true when is true. > In addition, the condition will be used in the second > layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23349) Duplicate and redundant type determination for ShuffleManager Object
Phoenix_Daddy created SPARK-23349: - Summary: Duplicate and redundant type determination for ShuffleManager Object Key: SPARK-23349 URL: https://issues.apache.org/jira/browse/SPARK-23349 Project: Spark Issue Type: Improvement Components: Shuffle, SQL Affects Versions: 2.2.1 Reporter: Phoenix_Daddy Fix For: 2.2.1 org.apache.spark.sql.execution.exchange.ShuffleExchange In the "needtocopyobjectsbeforguffle()" function,there is a nested "if or else" branch . The condition in the first layer "if" has the same value as the condition in the second layer "if", that is, must be true when is true. In addition, the condition will be used in the second layer "if" and should not be calculated until needed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-23348) append data using saveAsTable should adjust the data types
[ https://issues.apache.org/jira/browse/SPARK-23348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-23348: Description: {code:java} Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") {code} This query will fail with a strange error: {code:java} org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 15, localhost, executor driver): java.lang.UnsupportedOperationException: Unimplemented type: IntegerType at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) at org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) ... {code} > append data using saveAsTable should adjust the data types > -- > > Key: SPARK-23348 > URL: https://issues.apache.org/jira/browse/SPARK-23348 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.3.0 >Reporter: Wenchen Fan >Priority: Major > > > {code:java} > Seq(1 -> "a").toDF("i", "j").write.saveAsTable("t") > Seq("c" -> 3).toDF("i", "j").write.mode("append").saveAsTable("t") > {code} > > This query will fail with a strange error: > > {code:java} > org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in > stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 > (TID 15, localhost, executor driver): > java.lang.UnsupportedOperationException: Unimplemented type: IntegerType > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBinaryBatch(VectorizedColumnReader.java:473) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readBatch(VectorizedColumnReader.java:214) > at > org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:261) > ... > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-23348) append data using saveAsTable should adjust the data types
Wenchen Fan created SPARK-23348: --- Summary: append data using saveAsTable should adjust the data types Key: SPARK-23348 URL: https://issues.apache.org/jira/browse/SPARK-23348 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.3.0 Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org